Sitemap plugin - no ?page=… urls in default sitemap

The #sitemap plugin does not include any ?page=… urls in default sitemaps e.g. https://meta.discourse.org/sitemap_4.xml

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810
  </loc>
  <lastmod>2022-02-25T21:55:40Z</lastmod>
</url>

In the recent sitemap the pagination urls are included - e.g. https://meta.discourse.org/sitemap_recent.xml

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810?page=18
  </loc>
  <lastmod>2022-03-07T12:03:50Z</lastmod>
</url>

Are there no ?page=… urls in default sitemaps by design?
All these ?page=… urls are canonical urls and thereby should be added to the default sitemap - e.g.

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810
  </loc>
  <lastmod>2022-02-25T21:55:40Z</lastmod>
</url>
<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810?page=2
  </loc>
  <lastmod>2022-03-02T19:08:07Z</lastmod>
</url>

[…]

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810?page=18
  </loc>
  <lastmod>2022-03-07T12:03:50Z</lastmod>
</url>

sources

2 Likes

I do not think this is deliberate, can you do a PR to fix?

Thanks @rrit, a few months ago I noticed this too but I always thought it was normal :man_facepalming:t2:

I can implement an easy fix which is not very specific about the last edited date: e.g all pages of one topic will use the same date of the last edited post.
Thereby on a new post in a topic (with many posts and many pages) all the pages will get a new last changed date - even when only the last page needs the new date.

Is this a feasable solution?


Otherwise we need to bundle all posts of a topic in packages of 20 posts (per page). And then calculate the last changed date for each package itself.

Honestly I looked at this and I am mixed on any changes here, the issue is not that Google is having trouble discovering content on Discourse forums.

It is that it is discovering, crawling and then due to “arbitrary decision making” deciding that content does not belong in the index.

1 Like

Does “arbitrary decision making” imply one of these points:
(See Index Coverage report - Search Console Help)


On the pro side of adding all these ?page=… canonical urls to the sitemap:
It gives Google a strong hint on <lastmod> for these url. Thereby Google has no reason to re-crawl unchanged ?page=… urls and might use its precious crawl budget for more important urls.

If ?page=… urls are missing in the sitemap, Google finds them anyway and does some “arbitrary” re-crawling - even if it’s totally unnecessary as there are no new changes to the content.

See Build and Submit a Sitemap | Google Search Central  |  Documentation  |  Google Developers


Google really keeps track and makes a difference of where it knows about urls from:
“All submitted pages” (sitemap) or “All known pages” (links etc.)
See Google Search Console → Index → Coverage Report

  • “A sitemap is an important way for Google to discover URLs on your site.” see
  • “Google chooses the canonical page based on a number of factors (or signals ), such as […], presence of the URL in a sitemap, […].” see
  • “Using a sitemap doesn’t guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling.” see
1 Like

I hope it will be implemented along with this :slight_smile:

2 Likes

This is certainly something for @Roman_Rizzi to keep in mind when he integrates this into core.

I much prefer first merging in sitemap prior to layering more changes, but once that is done … maybe we can start with canonical page based urls on _recent. We have canonical url now which is usable in posts.rss with adequate caching it can also be usable in sitemaps.

3 Likes

I’m having trouble with Google Search Console trying to index URLs like https://example.com/t/title-slug/1234?page=3 , which make Discourse throw a 404. Removing the ?page=x parameter makes the URL valid.

I assume this is some kind of a side effect of Discourse adding pagination to the version of the site that it serves crawlers:

page urls work fine, you just need more than N posts.

Do you happen to have a ton of deleted posts on said topic?

Hi Sam, thanks for the reply. After posting this I found your explanation here:

But in my case, no, the topics with this problem that I’ve looked at don’t show any modifications to the original threading. The only thing is that they were imported from Drupal. But I need to dig more into other examples to see if any topics that were originally created in Discourse are also affected, because unfortunately there are tons of them, as in thousands probably.

1 Like

Yikes, were tons imported from Drupal? Is that the common thread here?

Yeah, close to 100k topics and ~2M posts. I’m not sure if this issue is only with imported topics though, I’ll post back here soon if I find any more anomalies.

1 Like