I’ve got a site that’s gotten the “Googlebot identified a significant increase in the number of URLs” error.
It looks like the bad urls (returning 404s) are all of the form https://site/t/slug/id?page=XXX.
And changing XXX to XXX-y makes the URLs work fine. (I thought it was an off-by-one error at first, but sometimes y needs to be significantly more than 1).
It’s not immediately apparent that it’s due to deleted posts, which was my next guess.
Perhaps if the page number is greater than the number of pages there should be a 301 to page 1 or something? (I don’t pretend to know anything about SEO). Or should I tell them just not to worry about those 404s?
This is only an issue we can address if you can find actual cases on the forum where the meta tags link to non existent pages
If the content is looking good and google are just having an adventure adjusting to the fact that a book once had 100 pages and now has 10 there is not much we can do, a 301 or 302 here is not ideal, perhaps we can add it for deletions… it’s a giant edge case though
Found this topic after encountering the same issue.
I fail to see how this is google’s problem, you can’t expect every page to be reindexed every hours.
Are you seriously saying that discourse can’t check that if a link includes ?page=2 which would translate to /21, but if the topic shrunk to 19 posts, it can’t handle that and redirect to the last post /19 in this case?
In fact, discourse already has everything ready to do it right, w/o needing to put the blame on external forces. Currently, If the topic has 19 posts and I type /99, it automatically redirects to /19 - so it knows how to handle shrinking topics just fine. So if you let ?page=2 to always redirect to /21, w/o checking whether there are 21+ posts or not, discourse will do the right thing after 2 redirects:
…?page=2 => Redirect …/21
…/21 => …/19
of course, I’m assuming the server is configured to handle multiple redirects.
Does this make sense?
p.s. I think the proposed solution reduces the edge case to general case and actually simplifies things, since you no longer need to check whether ?page=2 “exists” and let redirects do the right thing.
Of course, e.g. take the first hit from this search, which links to this. (assuming google will show you the same results, but if not, then just use the last link of my reply directly.)
Actually, you don’t need live examples from elsewhere - take this thread on your own server, it doesn’t do too well either: Googlebot 404 errors due to page numbers - same problem.
This very topic wasn’t split and has the same problem. Therefore the way how the shrinking came about matters not.
Take a book analogy, in a book if you rip out page 77, page 200 remains page 200. In Discourse page 200 becomes page 199
but discourse is not a book which is cast in “paper” once printed, it’s dynamic “living” system. As such I am unable to see where the book metaphor fits in this case.
And even if I run with your metaphor, page 200 remains page 200, so if there was a link to page 200 before that worked - it should work still if pages 1-199 were ripped out. Except, this surely is not the way to fix this problem, which is much simpler if you don’t look at it as a book.
Apologies for not being clear, @sam. I meant that if this topic were to shrink from 21 to 20 posts it would have the same issue, and thus you have your own setup to test with and no need for external live URLs.
This would seem to be a decent solution. But I don’t understand why this is happening in the first place? Why is Discourse adding page numbers that don’t exist to the sitemap?
In the case of the topics I’m examining with this issue there are no deleted posts or other modifications to the threading.