Search engines now blocked from indexing non-canonical pages

Some more information on noindex from the Google docs:


See Crawl Budget Management For Large Sites | Google Search Central  |  Documentation  |  Google Developers


See Consolidate Duplicate URLs with Canonicals | Google Search Central  |  Documentation  |  Google Developers

See Consolidate Duplicate URLs with Canonicals | Google Search Central  |  Documentation  |  Google Developers

3 Likes

This command doesn’t seem to work. I updated a smaller Discourse site today to test it, ran the command, and still see the noindex headers.


Edit: I’m not sure how that setting works, but I don’t see it in the SiteSettings, at least from the frontend (as admin) in the browser console:

var d = Discourse.SiteSettings;
document.body.innerHTML = `<pre>${JSON.stringify(d, null, 4)}</pre>`;

It looks like that setting is for robots.txt, not noindex. Wouldn’t that already be true on most Discourse sites?

2 Likes

Oh sorry the correct is SiteSetting.allow_indexing_non_canonical_urls. Fixed it on the OP.

3 Likes

We continued analyzing issues following this change and decided to roll it back per:

The goal behind it was to limit crawl budget by Google being spent scanning non canonical topic links.

Since this change was applied we rolled out 2 fixes that made the change unnecessary.

  1. Topic RSS feeds are no longer followed, links in the RSS feeds are not followed. Eg: https://meta.discourse.org/t/search-engines-now-blocked-from-indexing-non-canonical-pages/218985.rss

  2. Post RSS feeds now contain canonical links. Eg: https://meta.discourse.org/posts.rss

Combined these two changes mean crawlers no longer discover a large amount on non-canonical links on Discourse sites.

The frees search budget and makes the site setting no longer a requirement. Site operators are still free to experiment with it, however it is disabled by default.

12 Likes