Search engines now blocked from indexing non-canonical pages

:warning: Important

Following further investigation we decided to leave non canonical indexing enable, see more details at: Search engines now blocked from indexing non-canonical pages - #30 by sam

original announcement

Discourse will now reply with a X-Robots-Tag: noindex header when the requested page isn’t the canonical page for a resource.

While Discourse uses an automatic scrolling design for both topic lists and topics, this isn’t what we show search engine crawlers, like GoogleBot. Search engines see paginated topics, with 20 posts in each page. However, since users can link to specific posts in their own posts and will do so using the /t/title/topic_id/post_id URL format, those will be picked by the crawlers and add duplicated content into your site search results and waste the precious and limited crawl budget your domain have.

To alleviate this issue, our community of users suggested adding the X-Robots-Tag: noindex to URLs like post specific URLs, which we managed to expand to all non-canonical URLs in Discourse. This was released as a hidden site setting and disabled by default 3 months ago, during which we experimented having this header enabled in community sites as well as on meta.discourse.org.

Since results of this period are looking good so far, we just flipped this setting to be in effect by default.

If you for some reason don’t want this behavior on your instance you can still enable indexing of non-canonical pages by running docker exec -i app rails runner "SiteSetting.allow_indexing_non_canonical_urls = true" on your server.

Don’t expect any drastic changes on crawling and search results overnight, but over the next months you should see a decrease of crawls and search results on post specific pages, which will result in more crawl time spent on your site new topics and on content that wasn’t yet indexed because of crawl budget constraints on your domain.

30 Likes

TL;DR: Don’t block non-canonical pages - just point them to a correct url via <link rel="canonical" … > - that’s what it’s made for.


This feature might harm the SEO link-building in the long run:
All deep-links to answers inside topics are on noindex pages now! Does Google like this?

Actually a canonical tag always pointing to the topic url - even for pages deep-linking on an answer - should perfectly do the job – without adding X-Robots-Tag: noindex:
On first crawl of a deep-linking answer page Google recognizes that the page url (answer inside topic) does not fit the canonical-url and then decides to only crawl the canonical-url (topic).


May we add <a rel="nofollow" …> to all links doing this topic-answer deep-linking? Edit: no, see Search engines now blocked from indexing non-canonical pages - #9 by j127
Thereby we might save even more of this precious and limited crawl budget of search engines:
the search engine would neither extract the link in the first place nor do a call to the url. As calling the url results in a response with a X-Robots-Tag: noindex http-header causing the response to be ‘trashed’ by adding the url to the search engines internal ‘noindex’-list.

Some more savings on crawl budget with nofollow added to RSS-feed urls:

6 Likes

I totally agree with @rrit suggestions.

It would be better to point subpages/posts within the topic to its original canonical rather than blocking them.

Instead of adding noindex, can we add nofollow tag to each of the reply under the topic.

1 Like

That’s exactly how it works already, so I’m not sure I follow.

So you suggest that we need to update the URL here

to use a canonical URL with the page number and a post anchor?

Those are already blocked via the robots.txt, but that is a good idea!

Sounds like a good idea too!

3 Likes

You are right, my apology. I get lost in my own thoughts sometimes. :slight_smile:

Quick question, I assume this feature is already available by default as long as we update Discourse to v2.9?

4 Likes

I think that the feature shouldn’t be on by default. It’s dangerous from a traffic standpoint, even if it’s only on for a brief time, so anyone who updates now might get an unwelcome surprise.

The canonical tag is the way Google recommends dealing with that problem, and it appears to be working already. Doing weird things with canonical tags can lead to strange problems with Google, and a noindex mistake could be difficult to recover from.

2 Likes

I agree with the first part of your post, but I don’t think internal nofollow is ideal. Internal links help tell search engines which pages on the site are important. Google isn’t going to follow every link it sees, because it knows that it’s seen them before. If they see a URL like example.com/t/1234/5 but have already crawled it and knows that its canonical URL is example.com/t/1234, they probably aren’t going to waste their computing resources visiting the non-canonical version multiple times.

3 Likes

Remove ‘noindex’ for URLs linked to by external websites

Sorry by “answers” I mean “posts” in a topic:
All deep-links from external domains to posts (e.g. forum.example.com/t/example-topic/5/11) have a http-header X-Robots-Tag: noindex now! I suggest to remove this http-header again.

I suggest for <link rel="canonical" … > to never use an URL with a post anchor (the last number in …/t/example-topic/1234/5 ) anywhere. Canonical URLs should always point to the topic url itself (…/t/example-topic/1234 ). I think it is already implemented like this.


Rewrite links for search engines if target url is “redirected” by <link rel="canonical" … >

Very good point, better don’t add rel="nofollow" here.

Discourse has a special view for crawlers. New suggestion for crawler view only:
Convert all internal links pointing to a post-URL (example.com/t/1234/5) to point to the corresponding topic-URL (example.com/t/1234) instead.
Intention: Don’t announce extra URLs to search engines when these extra URLs are “redirected” by <link rel="canonical" … > anyway.

Locations where such links to posts are found:

  • manually added links in user content
  • automatically generated links in
    • quotes
    • first post of topic: “inbound tracked links” from other topics
    • first post of topic: “selected answer”
    • first post of topic - topic map open: “topic links”/“liked links”

Excursus: Where does Google find all those URLs?


“inbound tracked links” for search engines

For exactly this reason the automatically generated “inbound tracked links from other topics” on the first post of a topic should also be visible by search engines.
Right now these “inbound tracked links” are missing in the crawler view. Edit: They are already in the crawler view.

But pointing to the post-url instead of topic-url (see html source)
<div class="crawler-linkback-list" itemscope="" itemtype="http://schema.org/ItemList">
      <div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
        <a href="https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/26" itemscope="" itemtype="http://schema.org/DiscussionForumPosting" itemprop="item">
          <meta itemprop="url" content="https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/26">
          <span itemprop="name">Removing the /2, /3, /4, etc links for each reply within a topic URL</span>
        </a>
        <meta itemprop="position" content="2">
      </div>
</div>
3 Likes

This is a crucial point. It’s one thing to get all of your pages indexed and another to get a relevant ranking for them. In my experience (with big publisher sites), smart internal linking is key to achieve this.

1 Like

I updated just this morning, do you recommend enable indexing of non-canonical pages with this?

I would not want to make my indexing more worse.

1 Like

For anyone that updates their site since the OP post date.

We have data that shows that the new header reduces crawl time on those pages, and they always had the canonical set.

But those pages are not meant to be crawled anyway. The metadata with the URL is set on the topic level, we don’t want Google to crawl the post level as it’s duplicated content.

Cool, so nothing need to change here.

Doing that at runtime may be too CPU expensive, and saving two versions of every post will disk expensive.

Our defaults are always what we recommend. However, we maintain and announce site settings so people can choose otherwise if they feel like a default isn’t ideal for their site.

5 Likes

Perfect then I will leave as recommended.
Thank you

2 Likes

Last thing and then I don’t disturb anymore :sweat_smile:

So could there be problems with sitemap_recent.xml that contains such links?
https://meta.discourse.org/t/category-moderator-improvements/158628?page=2

1 Like

That example is a canonical page, so it isn’t affect in any way by the changes outlined in the OP.

2 Likes

I see a huge difference when there is an external link to a post-url.

# A: 
External Domain
|
|--(link juice)--> post-url
                   |
                   |__/ crawling:      \---> post-url not indexed and
                      \ header noindex /     link-juice mostly gone

# B:
External Domain
|
|--(link juice)--> post-url
                   |
                   |__/ crawling:        \__|--> post-url not indexed
                      \ answer canonical /  |--> topic-url indexed (anyway)
                                                 with link-juice transfer

We should bring this up on

1 Like

For neophytes like me regarding SEO, does it imply that it’s an SEO improvement that could potentially lead to a slight increase/benefit in Google search results?

3 Likes

Yes, that is the goal!

We tested the change in a tech news community over a few months, and we saw a large peak-to-peak increase in anon page views. Our end goal is always to make all Discourse communities healthier in all fronts.

5 Likes

Is this effect visible in Google Search Console report ‘Settings’ → ‘Crawling’ → ‘Crawl stats’ ?

1 Like

Taking into consideration …

A. Decreasing crawls

B. No two versions of content

C. Use canonical tag

D. No nofollow

E. No noindex

… and having internal links at …

… I suggest the follwoing implementation to get the best compromise:

  1. Don’t add http-header X-Robots-Tag: noindex.
    – taking into account [E] –
  2. Keep canonical tags always pointing to the topic-url.
    – decreasing crawls [A] and considering [C] –
  3. For crawler view only: Convert automatically generated links to always link to topic-url instead of post-url - for all links in first post of topic “inbound tracked links from other topics" and “topic map open: topic link/liked links”.
    – decreasing crawls [A] and considering [D], but willfully disregarding [B] –
    On [B]: CPU expenses are for crawler-visits only and consist of doing a regex-replace to cut off the last number of internal urls ending in two numbers e.g. …/t/example-topic/1234/5…/t/example-topic/1234 in the confined borders of first post of topic “inbound tracked links from other topics” and “topic map open” only.
  4. for all views: add internal nofollow to quotes and manually added links in user content.
    – decreasing crawls [A] and considering [B], but slightly disregarding [D] –
    On [D]: important links are already automatically duplicated to first topic in "topic map open: topic link/liked links”-section [see 3.] and most quotes stay inside the topic itself anyway.

Some idea on internal links

Google says Consolidate Duplicate URLs with Canonicals | Google Search Central  |  Documentation  |  Google Developers

And Google says Create Crawlable Links | Google Search Central  |  Documentation  |  Google Developers

So Discourse might set internal links like this:

<a href="/t/example-topic/1234" routerLink="/t/example-topic/1234/5">…</a>

For Google the link goes straight to the canonical topic-url …/1234 - and Google does not get to know about the post-url …/1234/5 from this link-syntax.

For user-navigation some additional JavaScript in the Ember-app will do the trick:
e.g. replace href with routerLink.

2 Likes

Looks like a great improvement! Thanks for making this happen @Falco and Discourse team!

3 Likes