Search engines now blocked from indexing non-canonical pages

Falco · February 21, 2022, 7:35pm

Important

Following further investigation we decided to leave non canonical indexing enable, see more details at: Search engines now blocked from indexing non-canonical pages - #30 by sam

original announcement

Discourse will now reply with a X-Robots-Tag: noindex header when the requested page isn’t the canonical page for a resource.

While Discourse uses an automatic scrolling design for both topic lists and topics, this isn’t what we show search engine crawlers, like GoogleBot. Search engines see paginated topics, with 20 posts in each page. However, since users can link to specific posts in their own posts and will do so using the /t/title/topic_id/post_id URL format, those will be picked by the crawlers and add duplicated content into your site search results and waste the precious and limited crawl budget your domain have.

To alleviate this issue, our community of users suggested adding the X-Robots-Tag: noindex to URLs like post specific URLs, which we managed to expand to all non-canonical URLs in Discourse. This was released as a hidden site setting and disabled by default 3 months ago, during which we experimented having this header enabled in community sites as well as on meta.discourse.org.

Since results of this period are looking good so far, we just flipped this setting to be in effect by default.

If you for some reason don’t want this behavior on your instance you can still enable indexing of non-canonical pages by running docker exec -i app rails runner "SiteSetting.allow_indexing_non_canonical_urls = true" on your server.

Don’t expect any drastic changes on crawling and search results overnight, but over the next months you should see a decrease of crawls and search results on post specific pages, which will result in more crawl time spent on your site new topics and on content that wasn’t yet indexed because of crawl budget constraints on your domain.

rrit · February 21, 2022, 11:29pm

TL;DR: Don’t block non-canonical pages - just point them to a correct url via <link rel="canonical" … > - that’s what it’s made for.

This feature might harm the SEO link-building in the long run:
All deep-links to answers inside topics are on noindex pages now! Does Google like this?

Actually a canonical tag always pointing to the topic url - even for pages deep-linking on an answer - should perfectly do the job – without adding X-Robots-Tag: noindex:
On first crawl of a deep-linking answer page Google recognizes that the page url (answer inside topic) does not fit the canonical-url and then decides to only crawl the canonical-url (topic).

~~May we add <a rel="nofollow" …> to all links doing this topic-answer deep-linking?~~ Edit: no, see Search engines now blocked from indexing non-canonical pages - #9 by j127
Thereby we might save even more of this precious and limited crawl budget of search engines:
the search engine would neither extract the link in the first place nor do a call to the url. As calling the url results in a response with a X-Robots-Tag: noindex http-header causing the response to be ‘trashed’ by adding the url to the search engines internal ‘noindex’-list.

Some more savings on crawl budget with nofollow added to RSS-feed urls:

github.com/discourse/discourse

FEATURE: add nofollow to RSS alternate link in topics and categories

discourse:main ← rr-it:feature/seo-rss-nofollow

opened 03:03PM - 21 Feb 22 UTC

rr-it

+2 -2

The urls of RSS-feeds of topics and categories are already excluded by `robots.t…xt`. But without `rel="alternate nofollow"` the search engine still extracts the url itself of a RSS-feed from the `<link rel="alterante" …>` tag inside the header of a topic or category page. Afterwards the search engine evaluates that this grabbed RSS-feed url is excluded by robots.txt and adds it to a 'noindex'-list. With this change the RSS-feed url stays completely unknown to the search engine: it is neither extracted in the first place nor added to the 'noindex'-list afterwards. Thereby the number of topic-urls the search-engine has to handle gets halfed.

arkklo · February 22, 2022, 4:34am

I totally agree with @rrit suggestions.

It would be better to point subpages/posts within the topic to its original canonical rather than blocking them.

Instead of adding noindex, can we add nofollow tag to each of the reply under the topic.

Falco · February 22, 2022, 4:47am

That’s exactly how it works already, so I’m not sure I follow.

So you suggest that we need to update the URL here

github.com

discourse/discourse-solved/blob/main/plugin.rb#L308


      
                .where("topic_custom_fields.created_at <= ?", report.start_date)
                .count
          end
          
          register_modifier(:search_rank_sort_priorities) do |priorities, _search|
            if SiteSetting.prioritize_solved_topics_in_search
              condition = <<~SQL
                  EXISTS (
                    SELECT 1
                      FROM topic_custom_fields
                     WHERE topic_id = topics.id
                       AND name = '#{::DiscourseSolved::ACCEPTED_ANSWER_POST_ID_CUSTOM_FIELD}'
                       AND value IS NOT NULL
                  )
                SQL
          
              priorities.push([condition, 1.1])
            else
              priorities
            end
          end

to use a canonical URL with the page number and a post anchor?

Those are already blocked via the robots.txt, but that is a good idea!

Sounds like a good idea too!

arkklo · February 22, 2022, 5:04am

You are right, my apology. I get lost in my own thoughts sometimes.

Quick question, I assume this feature is already available by default as long as we update Discourse to v2.9?

j127 · February 22, 2022, 5:14am

I think that the feature shouldn’t be on by default. It’s dangerous from a traffic standpoint, even if it’s only on for a brief time, so anyone who updates now might get an unwelcome surprise.

The canonical tag is the way Google recommends dealing with that problem, and it appears to be working already. Doing weird things with canonical tags can lead to strange problems with Google, and a noindex mistake could be difficult to recover from.

j127 · February 22, 2022, 5:20am

I agree with the first part of your post, but I don’t think internal nofollow is ideal. Internal links help tell search engines which pages on the site are important. Google isn’t going to follow every link it sees, because it knows that it’s seen them before. If they see a URL like example.com/t/1234/5 but have already crawled it and knows that its canonical URL is example.com/t/1234, they probably aren’t going to waste their computing resources visiting the non-canonical version multiple times.

rrit · February 22, 2022, 10:09am

Remove ‘noindex’ for URLs linked to by external websites

Sorry by “answers” I mean “posts” in a topic:
All deep-links from external domains to posts (e.g. forum.example.com/t/example-topic/5/11) have a http-header X-Robots-Tag: noindex now! I suggest to remove this http-header again.

I suggest for <link rel="canonical" … > to never use an URL with a post anchor (the last number in …/t/example-topic/1234/5 ) anywhere. Canonical URLs should always point to the topic url itself (…/t/example-topic/1234 ). I think it is already implemented like this.

Rewrite links for search engines if target url is “redirected” by `<link rel="canonical" … >`

Very good point, better don’t add rel="nofollow" here.

Discourse has a special view for crawlers. New suggestion for crawler view only:
Convert all internal links pointing to a post-URL (example.com/t/1234/5) to point to the corresponding topic-URL (example.com/t/1234) instead.
Intention: Don’t announce extra URLs to search engines when these extra URLs are “redirected” by <link rel="canonical" … > anyway.

Locations where such links to posts are found:

manually added links in user content
automatically generated links in
- quotes
- first post of topic: “inbound tracked links” from other topics
- first post of topic: “selected answer”
- first post of topic - topic map open: “topic links”/“liked links”

Excursus: Where does Google find all those URLs?

“inbound tracked links” for search engines

For exactly this reason the automatically generated “inbound tracked links from other topics” on the first post of a topic should also be visible by search engines.
~~Right now these “inbound tracked links” are missing in the crawler view.~~ Edit: They are already in the crawler view.

But pointing to the post-url instead of topic-url (see html source)

<div class="crawler-linkback-list" itemscope="" itemtype="http://schema.org/ItemList">
      <div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
        <a href="https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/26" itemscope="" itemtype="http://schema.org/DiscussionForumPosting" itemprop="item">
          <meta itemprop="url" content="https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/26">
          <span itemprop="name">Removing the /2, /3, /4, etc links for each reply within a topic URL</span>
        </a>
        <meta itemprop="position" content="2">
      </div>
</div>

Krischan · February 22, 2022, 10:54am

This is a crucial point. It’s one thing to get all of your pages indexed and another to get a relevant ranking for them. In my experience (with big publisher sites), smart internal linking is key to achieve this.

mstm · February 22, 2022, 5:48pm

I updated just this morning, do you recommend enable indexing of non-canonical pages with this?

I would not want to make my indexing more worse.

Falco · February 22, 2022, 5:57pm

For anyone that updates their site since the OP post date.

We have data that shows that the new header reduces crawl time on those pages, and they always had the canonical set.

But those pages are not meant to be crawled anyway. The metadata with the URL is set on the topic level, we don’t want Google to crawl the post level as it’s duplicated content.

Cool, so nothing need to change here.

Doing that at runtime may be too CPU expensive, and saving two versions of every post will disk expensive.

Our defaults are always what we recommend. However, we maintain and announce site settings so people can choose otherwise if they feel like a default isn’t ideal for their site.

mstm · February 22, 2022, 6:11pm

Perfect then I will leave as recommended.
Thank you

mstm · February 22, 2022, 6:23pm

Last thing and then I don’t disturb anymore

So could there be problems with sitemap_recent.xml that contains such links?
https://meta.discourse.org/t/category-moderator-improvements/158628?page=2

Falco · February 22, 2022, 6:29pm

That example is a canonical page, so it isn’t affect in any way by the changes outlined in the OP.

rrit · February 22, 2022, 8:19pm

I see a huge difference when there is an external link to a post-url.

# A: 
External Domain
|
|--(link juice)--> post-url
                   |
                   |__/ crawling:      \---> post-url not indexed and
                      \ header noindex /     link-juice mostly gone

# B:
External Domain
|
|--(link juice)--> post-url
                   |
                   |__/ crawling:        \__|--> post-url not indexed
                      \ answer canonical /  |--> topic-url indexed (anyway)
                                                 with link-juice transfer

We should bring this up on

Canapin · February 22, 2022, 9:44pm

For neophytes like me regarding SEO, does it imply that it’s an SEO improvement that could potentially lead to a slight increase/benefit in Google search results?

Falco · February 22, 2022, 10:26pm

Yes, that is the goal!

We tested the change in a tech news community over a few months, and we saw a large peak-to-peak increase in anon page views. Our end goal is always to make all Discourse communities healthier in all fronts.

rrit · February 22, 2022, 11:47pm

Is this effect visible in Google Search Console report ‘Settings’ → ‘Crawling’ → ‘Crawl stats’ ?

rrit · February 23, 2022, 11:00pm

Taking into consideration …

A. Decreasing crawls

B. No two versions of content

C. Use canonical tag

D. No nofollow

E. No noindex

… and having internal links at …

… I suggest the follwoing implementation to get the best compromise:

Don’t add http-header X-Robots-Tag: noindex.
– taking into account [E] –
Keep canonical tags always pointing to the topic-url.
– decreasing crawls [A] and considering [C] –
For crawler view only: Convert automatically generated links to always link to topic-url instead of post-url - for all links in first post of topic “inbound tracked links from other topics" and “topic map open: topic link/liked links”.
– decreasing crawls [A] and considering [D], but willfully disregarding [B] –
On [B]: CPU expenses are for crawler-visits only and consist of doing a regex-replace to cut off the last number of internal urls ending in two numbers e.g. …/t/example-topic/1234/5 → …/t/example-topic/1234 in the confined borders of first post of topic “inbound tracked links from other topics” and “topic map open” only.
for all views: add internal nofollow to quotes and manually added links in user content.
– decreasing crawls [A] and considering [B], but slightly disregarding [D] –
On [D]: important links are already automatically duplicated to first topic in "topic map open: topic link/liked links”-section [see 3.] and most quotes stay inside the topic itself anyway.

Some idea on internal links

Google says How to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central | Documentation | Google for Developers

And Google says SEO Link Best Practices for Google | Google Search Central | Documentation | Google for Developers

So Discourse might set internal links like this:

<a href="/t/example-topic/1234" routerLink="/t/example-topic/1234/5">…</a>

For Google the link goes straight to the canonical topic-url …/1234 - and Google does not get to know about the post-url …/1234/5 from this link-syntax.

For user-navigation some additional JavaScript in the Ember-app will do the trick:
e.g. replace href with routerLink.

SethWilliams · February 24, 2022, 5:02pm

Looks like a great improvement! Thanks for making this happen @Falco and Discourse team!

Topic		Replies	Views
Removing the /2, /3, /4, etc links for each reply within a topic URL Dev seo	33	4028	October 13, 2024
Why isn't Google Indexing Discourse? SEO concerns Support seo	31	5173	June 1, 2024
Adding Canonical Redirects for SEO Optimization Support	24	7292	October 1, 2015
Sitelinks in Google disappearing Community	26	1385	January 27, 2023
Google Search Indexing and Discourse Data & reporting	9	3683	June 9, 2020