Search engines now blocked from indexing non-canonical pages

rrit · March 4, 2022, 5:02pm

Some more information on noindex from the Google docs:

See Crawl Budget Management For Large Sites | Google Search Central | Documentation | Google for Developers

Consolidate duplicate content. Eliminate duplicate content to focus crawling on unique content rather than unique URLs.

Block crawling of URLs that you don’t want indexed. Some pages might be important to users, but you don’t want them to appear in Search results. For example, infinite scrolling pages that duplicate information on linked pages, or differently sorted versions of the same page. If you can’t consolidate them as described in the first bullet, block these unimportant (for search) pages using robots.txt or the URL Parameters tool (for duplicate content reached by URL parameters).

Don’t use noindex , as Google will still request, but then drop the page when it sees the noindex tag, wasting crawling time. Don’t use robots.txt to temporarily reallocate crawl budget for other pages; use robots.txt to block pages or resources that you don’t want Google to crawl at all. Google won’t shift this newly available crawl budget to other pages unless Google is already hitting your site’s serving limit.

See How to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central | Documentation | Google for Developers

j127 · March 8, 2022, 7:06pm

This command doesn’t seem to work. I updated a smaller Discourse site today to test it, ran the command, and still see the noindex headers.

Edit: I’m not sure how that setting works, but I don’t see it in the SiteSettings, at least from the frontend (as admin) in the browser console:

var d = Discourse.SiteSettings;
document.body.innerHTML = `<pre>${JSON.stringify(d, null, 4)}</pre>`;

It looks like that setting is for robots.txt, not noindex. Wouldn’t that already be true on most Discourse sites?

Falco · March 8, 2022, 7:27pm

Oh sorry the correct is SiteSetting.allow_indexing_non_canonical_urls. Fixed it on the OP.

sam · March 15, 2022, 10:33pm

We continued analyzing issues following this change and decided to roll it back per:

github.com/discourse/discourse

FEATURE: enable canonical url indexing

discourse:main ← discourse:enable_indexing_canonical

opened 10:30PM - 15 Mar 22 UTC

SamSaffron

+1 -1

We rolled out a change to disable canonical indexing. The goal behind it was to… limit crawl budget by Google being spent scanning non canonical topic links. Since this change was applied we rolled out 2 fixes that made the change no longer needed. 1. Topic RSS feeds are no longer followed, links in the RSS feeds are not followed. 2. Post RSS feeds now contain canonical links. Combined these two changes mean crawlers no longer discover a large amount on non-canonical links on Discourse sites.

The goal behind it was to limit crawl budget by Google being spent scanning non canonical topic links.

Since this change was applied we rolled out 2 fixes that made the change unnecessary.

Topic RSS feeds are no longer followed, links in the RSS feeds are not followed. Eg: https://meta.discourse.org/t/search-engines-now-blocked-from-indexing-non-canonical-pages/218985.rss
Post RSS feeds now contain canonical links. Eg: https://meta.discourse.org/posts.rss

Combined these two changes mean crawlers no longer discover a large amount on non-canonical links on Discourse sites.

The frees search budget and makes the site setting no longer a requirement. Site operators are still free to experiment with it, however it is disabled by default.

Topic		Replies	Views
Removing the /2, /3, /4, etc links for each reply within a topic URL Dev seo	33	4083	October 13, 2024
Adding Canonical Redirects for SEO Optimization Support	24	7308	October 1, 2015
Pagination URL scheme not passed through when topic is renamed Feature	22	3946	May 20, 2015
Canonical tag on topic URL Bug	23	2417	February 7, 2017
Google indexed link not pointing to the correct post Bug	10	1831	December 11, 2018

Search engines now blocked from indexing non-canonical pages

Related topics