Googlebot 404 errors due to page numbers

pfaffman · September 4, 2018, 3:46pm

I’ve got a site that’s gotten the “Googlebot identified a significant increase in the number of URLs” error.

It looks like the bad urls (returning 404s) are all of the form https://site/t/slug/id?page=XXX.

And changing XXX to XXX-y makes the URLs work fine. (I thought it was an off-by-one error at first, but sometimes y needs to be significantly more than 1).

It’s not immediately apparent that it’s due to deleted posts, which was my next guess.

Perhaps if the page number is greater than the number of pages there should be a 301 to page 1 or something? (I don’t pretend to know anything about SEO). Or should I tell them just not to worry about those 404s?

sam · September 4, 2018, 10:21pm

This is only an issue we can address if you can find actual cases on the forum where the meta tags link to non existent pages

If the content is looking good and google are just having an adventure adjusting to the fact that a book once had 100 pages and now has 10 there is not much we can do, a 301 or 302 here is not ideal, perhaps we can add it for deletions… it’s a giant edge case though

stas00 · February 9, 2019, 10:27pm

Found this topic after encountering the same issue.

I fail to see how this is google’s problem, you can’t expect every page to be reindexed every hours.

Are you seriously saying that discourse can’t check that if a link includes ?page=2 which would translate to /21, but if the topic shrunk to 19 posts, it can’t handle that and redirect to the last post /19 in this case?

In fact, discourse already has everything ready to do it right, w/o needing to put the blame on external forces. Currently, If the topic has 19 posts and I type /99, it automatically redirects to /19 - so it knows how to handle shrinking topics just fine. So if you let ?page=2 to always redirect to /21, w/o checking whether there are 21+ posts or not, discourse will do the right thing after 2 redirects:

…?page=2 => Redirect …/21
…/21 => …/19

of course, I’m assuming the server is configured to handle multiple redirects.

Does this make sense?

p.s. I think the proposed solution reduces the edge case to general case and actually simplifies things, since you no longer need to check whether ?page=2 “exists” and let redirects do the right thing.

codinghorror · February 9, 2019, 11:07pm

Make sure you are on latest (2.2 final or 2.3 beta) as an important fix went out for web crawler pagination about a month ago.

stas00 · February 9, 2019, 11:37pm

We are already using 2.3.0.beta2, so that did not fix the issue.

codinghorror · February 10, 2019, 1:51am

Can you provide live URLs demonstrating the problem?

stas00 · February 10, 2019, 2:08am

Of course, e.g. take the first hit from this search, which links to this. (assuming google will show you the same results, but if not, then just use the last link of my reply directly.)

if you remove page=2 or use page=1 it works.

Thank you for looking into it, @codinghorror!

stas00 · February 10, 2019, 5:48am

Actually, you don’t need live examples from elsewhere - take this thread on your own server, it doesn’t do too well either: Googlebot 404 errors due to page numbers - same problem.

sam · February 10, 2019, 8:48pm

That topic was split so you destabilized the page number, it is a fundamental limitation with our numbering implementation

Take a book analogy, in a book if you rip out page 77, page 200 remains page 200. In Discourse page 200 becomes page 199

I guess, maybe it does make sense to auto redirect high page numbers back to last page.

stas00 · February 10, 2019, 9:45pm

This very topic wasn’t split and has the same problem. Therefore the way how the shrinking came about matters not.

Take a book analogy, in a book if you rip out page 77, page 200 remains page 200. In Discourse page 200 becomes page 199

but discourse is not a book which is cast in “paper” once printed, it’s dynamic “living” system. As such I am unable to see where the book metaphor fits in this case.

And even if I run with your metaphor, page 200 remains page 200, so if there was a link to page 200 before that worked - it should work still if pages 1-199 were ripped out. Except, this surely is not the way to fix this problem, which is much simpler if you don’t look at it as a book.

sam · February 10, 2019, 10:03pm

Hmmm not seeing any page=2 here at all, when I Google:

"Googlebot 404 errors due to page numbers" site:meta.discourse.org

Hence I said

I guess, maybe it does make sense to auto redirect high page numbers back to last page.

The risk here is that a change like this masks bugs.

stas00 · February 10, 2019, 10:10pm

Apologies for not being clear, @sam. I meant that if this topic were to shrink from 21 to 20 posts it would have the same issue, and thus you have your own setup to test with and no need for external live URLs.

And I demonstrated how this fails by this direct link.

I gave this example, because @codinghorror initially suggested that the server might not be running the latest code base.

Hence I said
I guess, maybe it does make sense to auto redirect high page numbers back to last page.
The risk here is that a change like this masks bugs.

+1

rahim123 · October 18, 2023, 6:49am

This is still a problem.

This would seem to be a decent solution. But I don’t understand why this is happening in the first place? Why is Discourse adding page numbers that don’t exist to the sitemap?

In the case of the topics I’m examining with this issue there are no deleted posts or other modifications to the threading.

Topic		Replies	Views
Canonical tag on topic URL Bug	23	2396	February 7, 2017
?page= bug, both in core and in sitemap plugin Bug sitemap	4	603	June 1, 2021
Sitemap plugin - no ?page=… urls in default sitemap Bug sitemap	12	1269	October 18, 2023
Pagination URL scheme not passed through when topic is renamed Feature	22	3934	May 20, 2015
?page= sometimes redirects to a page with a different canonical URL Bug	3	708	October 19, 2020

Googlebot 404 errors due to page numbers

Related topics