Removing the /2, /3, /4, etc links for each reply within a topic URL

I’m wondering if there’s a way to remove canonical links entirely from a Discourse forum.

I’m referring to the /2, /3, /4, etc extensions that show up for each topic URL as a user scrolls down the page. I would like to have every reply within a topic simply refer to the original URL (not redirecting visitors to the original URL, but removing those paths completely, so they don’t exist).

Originally, I thought this was a cool feature of Discourse, but as I’ve been running a similar forum on my site over the past year with NodeBB (which uses the same canonical link feature for each reply), we’ve discovered that this functionality can be catastrophic for the SEO of a public forum.

Why? Because even though these /2, /3, /4 URLs are canonical links, Google will eventually crawl and index all of them. This means that every new reply within a topic can show up in Google search results, and since they’re basically duplicate versions of the original topic URL, these extra indexed pages rarely get visits and when they do, visitors don’t stay on the site for more than a couple of seconds.

When Google indexes a lot of extra pages and these pages don’t get much activity, it tells Google that the domain as a whole has a lot of low-quality URLs and it hurts the domain health as a whole (as it did in our case). We lost about 40% of our traffic since launching our public forum, and a big contributing factor was these additional URLs (over 30,000 low-quality URLs were added to our sitemap over the course of 12 months, simply because of the replies that were left in each topic).

Now, if you’re running a private forum, none of this matters, because a private forum doesn’t benefit from any kind of SEO since the whole forum is hidden from the internet anyway. But if you ARE trying to run a public forum and your objective is to pick up organic search traffic, these additional URL strings can have a huge detrimental impact on the overall health of your site.

So, I’m wondering, is there any conceivable way to tweak the settings or create a plugin that will tell a Discourse forum NOT to create these additional URLs for each reply within a topic?

I’m considering migrating our forum from NodeBB over to Discourse, but this will only make sense if there’s a way to make our Discourse forum not create these additional URLs.

3 Likes

This has come up a little bit before: Google indexing same page multiple times: Issue with canonicals

how were you able to determine that this was a significant contributor? a lot of SEO posts made here have been fairly speculative, so some evidence goes a long way!

I have no experience with this plugin and can’t vouch for it personally, but someone has attempted to disable canonical links entirely with a plugin before: Remove Canonical Link Plugin

7 Likes

This has come up a little bit before: Google indexing same page multiple times: Issue with canonicals

If I’m following this conversation correctly, this seems to be referring to a slightly different issue than what I’m talking about above. It’s not a problem to create multiple canonical URLs for a topic if they’re grouped by 20 replies at a time and have unique meta descriptions (page=2, page=3, etc). The problem is when a new URL is created for every individual reply within a topic (/2, /3, /4, etc).

For a topic with 100 replies, the former would result in 5 URLs per topic (100 replies grouped into multiples of 20). The latter would result in 100 URLs per topic (a new, individual URL for every single reply), which creates a big SEO problem.

how were you able to determine that this was a significant contributor? a lot of SEO posts made here have been fairly speculative, so some evidence goes a long way!

With tools like Google Search Console, SEMrush, and Ahrefs. All of them highlighted warnings and errors resulting from the massive number of URLs on our site that were being created by these forum topic replies, all of which were being indexed by Google without providing substantial new content. Health scores were in the 30s and 40s when our forum was public. Once we locked down our entire forum and made it private (so Google couldn’t see it) and re-ran the tests, our health score went up into the 80s from this change alone.

I have no experience with this plugin and can’t vouch for it personally, but someone has attempted to disable canonical links entirely with a plugin before: Remove Canonical Link Plugin

I found this as well. Unfortunately, this plugin actually makes the situation worse, because it just removes canonical tags completely while still keeping the /2, /3, etc pages, so these additional URLs are still seen as low-quality duplicate content.

3 Likes

Worth noting that it’s in our roadmap to add a X-Robots-Tag: noindex header to the response payload of those pages.

10 Likes

Good to know. Sounds like a big step in the right direction.

For what it’s worth, we’ve actually seen some instances in our current forum (back when it was still public) where we had added the robots.txt file to certain subfolders of our forum, and Google was crawling them anyway. I believe this is highly irregular… but our discovery was that there are some cases where Google doesn’t follow this directive. The only way to be 100% sure a page doesn’t get indexed is for the page to be hidden behind a login screen or for the page to not exist at all.

From an outsider’s perspective, it seems like it should be simple to tell Discourse not to go through the extra motions of creating these additional URLs for every reply. The software would be doing less work and creating less complexity this way, wouldn’t it?

It would be nice to have a feature in the admin settings to just turn off these extra URLs altogether.

1 Like

Yes, that way we will be using the header tag instead of putting those in the robots.txt file.

3 Likes

I am not sure I understand what is happening here.

Because Discourse is doing the former: it does create ?page=X canonical URL meta tags for groups of 20 replies. And as you can see here, post numbers are never added to the sitemap URLs, only ?page=X URLs are.

So I was going to tell you that there is no problem.

But then I did a Google search for a topic with many replies and although page 2 of those search results is full with the ?page=X links, some of the top results are actually linking to those numbered replies.

But why is this happening? That page does have a correct canonical URL.

rgj@labgate:~$ wget -q -O - "https://meta.discourse.org/t/babble-a-chat-plugin/87297/418"|grep -e "<title" -e canonical
<title>Babble - A Chat Plugin - #418 by HAWK - broken-plugin - Discourse Meta</title>
<link rel="canonical" href="https://meta.discourse.org/t/babble-a-chat-plugin/87297?page=20" />

3 Likes

I don’t know.

My wild guess is that those links appear elsewhere in the wild so those are the ones that get indexed?

1 Like

Canonical is there to prevent that from happening. At least - in theory.

4 Likes

I had a lengthy discussion about this with a blog-centric community here in Brazil that uses Discourse, and it’s what pushed me to try this new approach by sending the noindex header for those post specific pages. I should be able to cleanup and merge the PR for that next week and we can start experimenting with it.

7 Likes

But why is this happening? That page does have a correct canonical URL.

Exactly. It shouldn’t be happening (in theory) but it does, and as we experienced with our site, it really will do some damage to the health score of a domain, which can eventually have a big negative impact on the entire domain’s search rankings.

Regarding what @pfaffman said,

My wild guess is that those links appear elsewhere in the wild so those are the ones that get indexed?

This would’ve been my thought too… but we saw that tens of thousands of these individual replies on our site were getting indexed by Google even though absolutely nothing was linking to them. It’s pretty bizarre and I can’t pretend to understand why/how it’s happening, but it underscores the need for a forum admin to simply have the ability to turn off these /2, /3, /4, etc URLs for each reply if they so choose.

I am curious, is difficult to give Discourse this capability? From my non-coder’s perspective, it seems like this should be easy since it’s just telling the software not to do as much work… but perhaps there’s something more to it that I don’t understand?

2 Likes

I’m not sure, but doing noindex on those might be harmful. Discourse is already handling it correctly by using canonical URLs.

If you noindex them, there is a chance that it will noindex the entire page (because they are all the same canonical URL), which would be disastrous. I don’t know for sure what will happen, but I’d be extremely careful, because Google often handles edge cases unpredictably, and how they handle them can change with the updates. I’ve seen weird things happen with canonical tags.

It’s unknown exactly how the ranking algorithm works, and it changes over time, but one other thing to consider is that rankings are a results of inbound links. If an external site links to a /number URL, and that URL returns noindex header, it’s conceivable that Google might not transfer the inbound “link juice” to the canonical URL, which could be detrimental to the search rankings of Discourse sites.

I think it would be much safer to contact someone at Google Search and let them know that the canonical tag isn’t working for a widely-used CMS than try to come up with a workaround that might not be handled in the same way as Google makes more updates.

6 Likes

No, it underscores the need to fix things. As a software engineer I find it very difficult to remove functionality because it’s not working 100% correctly. Let’s see if we can help and get to the bottom of this instead.

Are you sure about this? I have never seen a post number in a site map.

3 Likes

That’s why it will be under a site setting.

5 Likes

Thanks for questioning this. “Sitemap” was probably the wrong word to use. What I meant was that these numbered posts were being crawled and indexed by Google and showing up as individual pages in Google Analytics and Search Console, resulting in A LOT of low-quality pages on our domain.

If these numbers simply weren’t added to every single reply, Google would only be able to see the original post URL.

2 Likes

Yeah, and if Google would be honoring those rel="canonical" meta tags (which they invented!) that were put in there specifically to prevent Google from doing this, it would be not an issue and we would still be able to link to a specific post at the same time.

7 Likes

Sounds good. It would be ideal if it’s off by default, because it’s not inconceivable that it could cause topic pages to disappear from Google or other search engines.

I’m not sure if it was already mentioned, but another way to fix it without noindex might be to use URL fragments for the posts, since those shouldn’t get counted as separate pages.

/t/slug/id#13
5 Likes

Thanks for the suggestion. I’d like to think that would work… but seeing as how the original “correct” method isn’t even working, it makes me skeptical that anything will solve the problem other than just eliminating the numbered replies altogether.

Of course, I’m not saying ALL Discourse users should stop using them. Heck, if it weren’t for Google being dumb (and/or if we intended for our forum to be private and not public), I would be all for it… but just having the option to turn off the automatic numbering of replies would be a huge help to those who run public forums and care about the overall SEO health of their domain.

1 Like

That would remove the ability to link to specific posts though. There would be no way to link to post #789 in a 1,000-post topic, and it would be annoying for users to have to scroll that far.

It’s strange. I searched Google to see if the post ID URLs were getting indexed on my forums, and only the canonical URLs are showing on all the topics I checked.

I do see it on another large Discourse site though. It also appears on this topic. [Google query]

I ran a diff on the two responses like this:

curl -s https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648 > 1.html
curl -s https://meta.discourse.org/t/removing-the-2-3-4-etc-links-for-each-reply-within-a-topic-url/209648/8 > 2.html
vim -d 1.html 2.html

One difference that stands out is that the article:published_time is different though they should probably be the same, because the pages are otherwise almost identical. I wonder if that metatag could make Google override the canonical URL. A Google employee says that canonical URLs can be ignored in certain cases over here.

<meta property="article:published_time" content="2021-11-19T15:57:21+00:00" />
<meta property="article:published_time" content="2021-11-20T06:48:06+00:00" />

Also, is the ignore_canonical tag unique to Discourse or is there a chance that Google might be following it? I saw it in the HTML.

<meta property="og:ignore_canonical" content="true" />
3 Likes

It could be slightly more annoying, but if the page=2, page=3, etc URLs still work (which doesn’t really create an SEO problem like the numbered replies do), you could at least link a person to the correct page within a conversation. This would get them most of the way there, provided they’re willing to scroll a little bit.

One forum that works this way is BiggerPockets. Their replies do not have individually numbered URLs, but the topics do have numbered pages, like this: https://www.biggerpockets.com/forums/311-buying-selling-real-estate/topics/1000291-kids-throwing-rocks-at-windows-nearly-everyday-wont-stop?page=2 (take note of the URL as you’re scrolling through each topic and page).

Their forum has always been a major component of what makes the site so special and successful from an SEO standpoint, so it’s a pretty good example of what works.

Interesting. I have no idea if that’s causing the problem or not, but I can see how the inconsistency might confuse Google into ignoring the canonical URL.

Even so, with the nature of how forums work, unless you wanted to completely remove the dates and timelines of each topic, you couldn’t eliminate this, could you? Accounting for the dates and times of each post and reply is sort of an integral part of how forums work.

1 Like