rrit
(Ayke)
March 4, 2022, 5:02pm
23
Some more information on noindex
from the Google docs:
See Crawl Budget Management For Large Sites | Google Search Central | Documentation | Google for Developers
Consolidate duplicate content . Eliminate duplicate content to focus crawling on unique content rather than unique URLs.
Block crawling of URLs that you don’t want indexed. Some pages might be important to users, but you don’t want them to appear in Search results. For example, infinite scrolling pages that duplicate information on linked pages, or differently sorted versions of the same page. If you can’t consolidate them as described in the first bullet, block these unimportant (for search) pages using robots.txt or the URL Parameters tool (for duplicate content reached by URL parameters).
Don’t use noindex
, as Google will still request, but then drop the page when it sees the noindex
tag, wasting crawling time. Don’t use robots.txt to temporarily reallocate crawl budget for other pages; use robots.txt to block pages or resources that you don’t want Google to crawl at all. Google won’t shift this newly available crawl budget to other pages unless Google is already hitting your site’s serving limit.
See How to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central | Documentation | Google for Developers
See How to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central | Documentation | Google for Developers
3 Likes
j127
March 8, 2022, 7:06pm
26
This command doesn’t seem to work. I updated a smaller Discourse site today to test it, ran the command, and still see the noindex
headers.
Edit: I’m not sure how that setting works, but I don’t see it in the SiteSettings
, at least from the frontend (as admin) in the browser console:
var d = Discourse.SiteSettings;
document.body.innerHTML = `<pre>${JSON.stringify(d, null, 4)}</pre>`;
It looks like that setting is for robots.txt , not noindex
. Wouldn’t that already be true
on most Discourse sites?
2 Likes
Falco
(Falco)
March 8, 2022, 7:27pm
27
Oh sorry the correct is SiteSetting.allow_indexing_non_canonical_urls
. Fixed it on the OP.
3 Likes
sam
(Sam Saffron)
March 15, 2022, 10:33pm
30
We continued analyzing issues following this change and decided to roll it back per:
discourse:main
← discourse:enable_indexing_canonical
opened 10:30PM - 15 Mar 22 UTC
We rolled out a change to disable canonical indexing.
The goal behind it was to… limit crawl budget by Google being spent
scanning non canonical topic links.
Since this change was applied we rolled out 2 fixes that made the change
no longer needed.
1. Topic RSS feeds are no longer followed, links in the RSS feeds are
not followed.
2. Post RSS feeds now contain canonical links.
Combined these two changes mean crawlers no longer discover a large
amount on non-canonical links on Discourse sites.
The goal behind it was to limit crawl budget by Google being spent scanning non canonical topic links.
Since this change was applied we rolled out 2 fixes that made the change unnecessary.
Topic RSS feeds are no longer followed, links in the RSS feeds are not followed. Eg: https://meta.discourse.org/t/search-engines-now-blocked-from-indexing-non-canonical-pages/218985.rss
Post RSS feeds now contain canonical links. Eg: https://meta.discourse.org/posts.rss
Combined these two changes mean crawlers no longer discover a large amount on non-canonical links on Discourse sites.
The frees search budget and makes the site setting no longer a requirement. Site operators are still free to experiment with it, however it is disabled by default.
13 Likes