Search engines now blocked from indexing non-canonical pages

You are right, my apology. I get lost in my own thoughts sometimes. :slight_smile:

Quick question, I assume this feature is already available by default as long as we update Discourse to v2.9?

4 Likes

I think that the feature shouldn’t be on by default. It’s dangerous from a traffic standpoint, even if it’s only on for a brief time, so anyone who updates now might get an unwelcome surprise.

The canonical tag is the way Google recommends dealing with that problem, and it appears to be working already. Doing weird things with canonical tags can lead to strange problems with Google, and a noindex mistake could be difficult to recover from.

2 Likes

I agree with the first part of your post, but I don’t think internal nofollow is ideal. Internal links help tell search engines which pages on the site are important. Google isn’t going to follow every link it sees, because it knows that it’s seen them before. If they see a URL like example.com/t/1234/5 but have already crawled it and knows that its canonical URL is example.com/t/1234, they probably aren’t going to waste their computing resources visiting the non-canonical version multiple times.

3 Likes

Remove ‘noindex’ for URLs linked to by external websites

Sorry by “answers” I mean “posts” in a topic:
All deep-links from external domains to posts (e.g. forum.example.com/t/example-topic/5/11) have a http-header X-Robots-Tag: noindex now! I suggest to remove this http-header again.

I suggest for <link rel="canonical" … > to never use an URL with a post anchor (the last number in …/t/example-topic/1234/5 ) anywhere. Canonical URLs should always point to the topic url itself (…/t/example-topic/1234 ). I think it is already implemented like this.


Rewrite links for search engines if target url is “redirected” by <link rel="canonical" … >

Very good point, better don’t add rel="nofollow" here.

Discourse has a special view for crawlers. New suggestion for crawler view only:
Convert all internal links pointing to a post-URL (example.com/t/1234/5) to point to the corresponding topic-URL (example.com/t/1234) instead.
Intention: Don’t announce extra URLs to search engines when these extra URLs are “redirected” by <link rel="canonical" … > anyway.

Locations where such links to posts are found:

  • manually added links in user content
  • automatically generated links in
    • quotes
    • first post of topic: “inbound tracked links” from other topics
    • first post of topic: “selected answer”
    • first post of topic - topic map open: “topic links”/“liked links”

Excursus: Where does Google find all those URLs?


“inbound tracked links” for search engines

For exactly this reason the automatically generated “inbound tracked links from other topics” on the first post of a topic should also be visible by search engines.
Right now these “inbound tracked links” are missing in the crawler view. Edit: They are already in the crawler view.

But pointing to the post-url instead of topic-url (see html source)
<div class="crawler-linkback-list" itemscope="" itemtype="http://schema.org/ItemList">
      <div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
        <a href="https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/26" itemscope="" itemtype="http://schema.org/DiscussionForumPosting" itemprop="item">
          <meta itemprop="url" content="https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/26">
          <span itemprop="name">Removing the /2, /3, /4, etc links for each reply within a topic URL</span>
        </a>
        <meta itemprop="position" content="2">
      </div>
</div>
3 Likes

This is a crucial point. It’s one thing to get all of your pages indexed and another to get a relevant ranking for them. In my experience (with big publisher sites), smart internal linking is key to achieve this.

1 Like

I updated just this morning, do you recommend enable indexing of non-canonical pages with this?

I would not want to make my indexing more worse.

1 Like

For anyone that updates their site since the OP post date.

We have data that shows that the new header reduces crawl time on those pages, and they always had the canonical set.

But those pages are not meant to be crawled anyway. The metadata with the URL is set on the topic level, we don’t want Google to crawl the post level as it’s duplicated content.

Cool, so nothing need to change here.

Doing that at runtime may be too CPU expensive, and saving two versions of every post will disk expensive.

Our defaults are always what we recommend. However, we maintain and announce site settings so people can choose otherwise if they feel like a default isn’t ideal for their site.

5 Likes

Perfect then I will leave as recommended.
Thank you

2 Likes

Last thing and then I don’t disturb anymore :sweat_smile:

So could there be problems with sitemap_recent.xml that contains such links?
https://meta.discourse.org/t/category-moderator-improvements/158628?page=2

1 Like

That example is a canonical page, so it isn’t affect in any way by the changes outlined in the OP.

2 Likes

I see a huge difference when there is an external link to a post-url.

# A: 
External Domain
|
|--(link juice)--> post-url
                   |
                   |__/ crawling:      \---> post-url not indexed and
                      \ header noindex /     link-juice mostly gone

# B:
External Domain
|
|--(link juice)--> post-url
                   |
                   |__/ crawling:        \__|--> post-url not indexed
                      \ answer canonical /  |--> topic-url indexed (anyway)
                                                 with link-juice transfer

We should bring this up on

1 Like

For neophytes like me regarding SEO, does it imply that it’s an SEO improvement that could potentially lead to a slight increase/benefit in Google search results?

3 Likes

Yes, that is the goal!

We tested the change in a tech news community over a few months, and we saw a large peak-to-peak increase in anon page views. Our end goal is always to make all Discourse communities healthier in all fronts.

5 Likes

Is this effect visible in Google Search Console report ‘Settings’ → ‘Crawling’ → ‘Crawl stats’ ?

1 Like

Taking into consideration …

A. Decreasing crawls

B. No two versions of content

C. Use canonical tag

D. No nofollow

E. No noindex

… and having internal links at …

… I suggest the follwoing implementation to get the best compromise:

  1. Don’t add http-header X-Robots-Tag: noindex.
    – taking into account [E] –
  2. Keep canonical tags always pointing to the topic-url.
    – decreasing crawls [A] and considering [C] –
  3. For crawler view only: Convert automatically generated links to always link to topic-url instead of post-url - for all links in first post of topic “inbound tracked links from other topics" and “topic map open: topic link/liked links”.
    – decreasing crawls [A] and considering [D], but willfully disregarding [B] –
    On [B]: CPU expenses are for crawler-visits only and consist of doing a regex-replace to cut off the last number of internal urls ending in two numbers e.g. …/t/example-topic/1234/5…/t/example-topic/1234 in the confined borders of first post of topic “inbound tracked links from other topics” and “topic map open” only.
  4. for all views: add internal nofollow to quotes and manually added links in user content.
    – decreasing crawls [A] and considering [B], but slightly disregarding [D] –
    On [D]: important links are already automatically duplicated to first topic in "topic map open: topic link/liked links”-section [see 3.] and most quotes stay inside the topic itself anyway.

Some idea on internal links

Google says Consolidate Duplicate URLs with Canonicals | Google Search Central  |  Documentation  |  Google Developers

And Google says Create Crawlable Links | Google Search Central  |  Documentation  |  Google Developers

So Discourse might set internal links like this:

<a href="/t/example-topic/1234" routerLink="/t/example-topic/1234/5">…</a>

For Google the link goes straight to the canonical topic-url …/1234 - and Google does not get to know about the post-url …/1234/5 from this link-syntax.

For user-navigation some additional JavaScript in the Ember-app will do the trick:
e.g. replace href with routerLink.

2 Likes

Looks like a great improvement! Thanks for making this happen @Falco and Discourse team!

3 Likes

Some more information on noindex from the Google docs:


See Crawl Budget Management For Large Sites | Google Search Central  |  Documentation  |  Google Developers


See Consolidate Duplicate URLs with Canonicals | Google Search Central  |  Documentation  |  Google Developers

See Consolidate Duplicate URLs with Canonicals | Google Search Central  |  Documentation  |  Google Developers

3 Likes

This command doesn’t seem to work. I updated a smaller Discourse site today to test it, ran the command, and still see the noindex headers.


Edit: I’m not sure how that setting works, but I don’t see it in the SiteSettings, at least from the frontend (as admin) in the browser console:

var d = Discourse.SiteSettings;
document.body.innerHTML = `<pre>${JSON.stringify(d, null, 4)}</pre>`;

It looks like that setting is for robots.txt, not noindex. Wouldn’t that already be true on most Discourse sites?

2 Likes

Oh sorry the correct is SiteSetting.allow_indexing_non_canonical_urls. Fixed it on the OP.

3 Likes

We continued analyzing issues following this change and decided to roll it back per:

The goal behind it was to limit crawl budget by Google being spent scanning non canonical topic links.

Since this change was applied we rolled out 2 fixes that made the change unnecessary.

  1. Topic RSS feeds are no longer followed, links in the RSS feeds are not followed. Eg: https://meta.discourse.org/t/search-engines-now-blocked-from-indexing-non-canonical-pages/218985.rss

  2. Post RSS feeds now contain canonical links. Eg: https://meta.discourse.org/posts.rss

Combined these two changes mean crawlers no longer discover a large amount on non-canonical links on Discourse sites.

The frees search budget and makes the site setting no longer a requirement. Site operators are still free to experiment with it, however it is disabled by default.

11 Likes