Are there no ?page=… urls in default sitemaps by design?
All these ?page=… urls are canonical urls and thereby should be added to the default sitemap - e.g.
I can implement an easy fix which is not very specific about the last edited date: e.g all pages of one topic will use the same date of the last edited post.
Thereby on a new post in a topic (with many posts and many pages) all the pages will get a new last changed date - even when only the last page needs the new date.
Is this a feasable solution?
Otherwise we need to bundle all posts of a topic in packages of 20 posts (per page). And then calculate the last changed date for each package itself.
On the pro side of adding all these ?page=… canonical urls to the sitemap:
It gives Google a strong hint on <lastmod> for these url. Thereby Google has no reason to re-crawl unchanged ?page=… urls and might use its precious crawl budget for more important urls.
If ?page=… urls are missing in the sitemap, Google finds them anyway and does some “arbitrary” re-crawling - even if it’s totally unnecessary as there are no new changes to the content.
Google really keeps track and makes a difference of where it knows about urls from:
“All submitted pages” (sitemap) or “All known pages” (links etc.)
See Google Search Console → Index → Coverage Report
“A sitemap is an important way for Google to discover URLs on your site.” see
“Google chooses the canonical page based on a number of factors (or signals ), such as […], presence of the URL in a sitemap, […].” see
“Using a sitemap doesn’t guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling.” see
This is certainly something for @Roman_Rizzi to keep in mind when he integrates this into core.
I much prefer first merging in sitemap prior to layering more changes, but once that is done … maybe we can start with canonical page based urls on _recent. We have canonical url now which is usable in posts.rss with adequate caching it can also be usable in sitemaps.
I’m having trouble with Google Search Console trying to index URLs like https://example.com/t/title-slug/1234?page=3 , which make Discourse throw a 404. Removing the ?page=x parameter makes the URL valid.
I assume this is some kind of a side effect of Discourse adding pagination to the version of the site that it serves crawlers:
Hi Sam, thanks for the reply. After posting this I found your explanation here:
But in my case, no, the topics with this problem that I’ve looked at don’t show any modifications to the original threading. The only thing is that they were imported from Drupal. But I need to dig more into other examples to see if any topics that were originally created in Discourse are also affected, because unfortunately there are tons of them, as in thousands probably.
Yeah, close to 100k topics and ~2M posts. I’m not sure if this issue is only with imported topics though, I’ll post back here soon if I find any more anomalies.