Sitemap plugin - no ?page=… urls in default sitemap

The #sitemap plugin does not include any ?page=… urls in default sitemaps e.g. https://meta.discourse.org/sitemap_4.xml

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810
  </loc>
  <lastmod>2022-02-25T21:55:40Z</lastmod>
</url>

In the recent sitemap the pagination urls are included - e.g. https://meta.discourse.org/sitemap_recent.xml

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810?page=18
  </loc>
  <lastmod>2022-03-07T12:03:50Z</lastmod>
</url>

Are there no ?page=… urls in default sitemaps by design?
All these ?page=… urls are canonical urls and thereby should be added to the default sitemap - e.g.

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810
  </loc>
  <lastmod>2022-02-25T21:55:40Z</lastmod>
</url>
<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810?page=2
  </loc>
  <lastmod>2022-03-02T19:08:07Z</lastmod>
</url>

[…]

<url>
  <loc>
    https://meta.discourse.org/t/importing-migrating-from-phpbb3/30810?page=18
  </loc>
  <lastmod>2022-03-07T12:03:50Z</lastmod>
</url>

sources

1 Like

I do not think this is deliberate, can you do a PR to fix?

Thanks @rrit, a few months ago I noticed this too but I always thought it was normal :man_facepalming:t2:

I can implement an easy fix which is not very specific about the last edited date: e.g all pages of one topic will use the same date of the last edited post.
Thereby on a new post in a topic (with many posts and many pages) all the pages will get a new last changed date - even when only the last page needs the new date.

Is this a feasable solution?


Otherwise we need to bundle all posts of a topic in packages of 20 posts (per page). And then calculate the last changed date for each package itself.

Honestly I looked at this and I am mixed on any changes here, the issue is not that Google is having trouble discovering content on Discourse forums.

It is that it is discovering, crawling and then due to “arbitrary decision making” deciding that content does not belong in the index.

Does “arbitrary decision making” imply one of these points:
(See Index Coverage report - Search Console Help)


On the pro side of adding all these ?page=… canonical urls to the sitemap:
It gives Google a strong hint on <lastmod> for these url. Thereby Google has no reason to re-crawl unchanged ?page=… urls and might use its precious crawl budget for more important urls.

If ?page=… urls are missing in the sitemap, Google finds them anyway and does some “arbitrary” re-crawling - even if it’s totally unnecessary as there are no new changes to the content.

See Build and Submit a Sitemap | Google Search Central  |  Documentation  |  Google Developers


Google really keeps track and makes a difference of where it knows about urls from:
“All submitted pages” (sitemap) or “All known pages” (links etc.)
See Google Search Console → Index → Coverage Report

  • “A sitemap is an important way for Google to discover URLs on your site.” see
  • “Google chooses the canonical page based on a number of factors (or signals ), such as […], presence of the URL in a sitemap, […].” see
  • “Using a sitemap doesn’t guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling.” see
1 Like

I hope it will be implemented along with this :slight_smile:

1 Like

This is certainly something for @Roman_Rizzi to keep in mind when he integrates this into core.

I much prefer first merging in sitemap prior to layering more changes, but once that is done … maybe we can start with canonical page based urls on _recent. We have canonical url now which is usable in posts.rss with adequate caching it can also be usable in sitemaps.

2 Likes