Googlebot 404 errors due to page numbers

I’ve got a site that’s gotten the “Googlebot identified a significant increase in the number of URLs” error.

It looks like the bad urls (returning 404s) are all of the form https://site/t/slug/id?page=XXX.

And changing XXX to XXX-y makes the URLs work fine. (I thought it was an off-by-one error at first, but sometimes y needs to be significantly more than 1).

It’s not immediately apparent that it’s due to deleted posts, which was my next guess.

Perhaps if the page number is greater than the number of pages there should be a 301 to page 1 or something? (I don’t pretend to know anything about SEO). Or should I tell them just not to worry about those 404s?

1 Like

This is only an issue we can address if you can find actual cases on the forum where the meta tags link to non existent pages

If the content is looking good and google are just having an adventure adjusting to the fact that a book once had 100 pages and now has 10 there is not much we can do, a 301 or 302 here is not ideal, perhaps we can add it for deletions… it’s a giant edge case though

4 Likes

Found this topic after encountering the same issue.

I fail to see how this is google’s problem, you can’t expect every page to be reindexed every hours.

Are you seriously saying that discourse can’t check that if a link includes ?page=2 which would translate to /21, but if the topic shrunk to 19 posts, it can’t handle that and redirect to the last post /19 in this case?

In fact, discourse already has everything ready to do it right, w/o needing to put the blame on external forces. Currently, If the topic has 19 posts and I type /99, it automatically redirects to /19 - so it knows how to handle shrinking topics just fine. So if you let ?page=2 to always redirect to /21, w/o checking whether there are 21+ posts or not, discourse will do the right thing after 2 redirects:

  1. …?page=2 => Redirect …/21
  2. …/21 => …/19

of course, I’m assuming the server is configured to handle multiple redirects.

Does this make sense?

p.s. I think the proposed solution reduces the edge case to general case and actually simplifies things, since you no longer need to check whether ?page=2 “exists” and let redirects do the right thing.

1 Like

Make sure you are on latest (2.2 final or 2.3 beta) as an important fix went out for web crawler pagination about a month ago.

1 Like

We are already using 2.3.0.beta2, so that did not fix the issue.

1 Like

Can you provide live URLs demonstrating the problem?

Of course, e.g. take the first hit from this search, which links to this. (assuming google will show you the same results, but if not, then just use the last link of my reply directly.)

if you remove page=2 or use page=1 it works.

Thank you for looking into it, @codinghorror!

2 Likes

Actually, you don’t need live examples from elsewhere - take this thread on your own server, it doesn’t do too well either: Googlebot 404 errors due to page numbers - same problem.

That topic was split so you destabilized the page number, it is a fundamental limitation with our numbering implementation

Take a book analogy, in a book if you rip out page 77, page 200 remains page 200. In Discourse page 200 becomes page 199

I guess, maybe it does make sense to auto redirect high page numbers back to last page.

2 Likes

This very topic wasn’t split and has the same problem. Therefore the way how the shrinking came about matters not.

Take a book analogy, in a book if you rip out page 77, page 200 remains page 200. In Discourse page 200 becomes page 199

but discourse is not a book which is cast in “paper” once printed, it’s dynamic “living” system. As such I am unable to see where the book metaphor fits in this case.

And even if I run with your metaphor, page 200 remains page 200, so if there was a link to page 200 before that worked - it should work still if pages 1-199 were ripped out. Except, this surely is not the way to fix this problem, which is much simpler if you don’t look at it as a book.

Hmmm :thinking: not seeing any page=2 here at all, when I Google:

"Googlebot 404 errors due to page numbers" site:meta.discourse.org

Hence I said

I guess, maybe it does make sense to auto redirect high page numbers back to last page.

The risk here is that a change like this masks bugs.

1 Like

Apologies for not being clear, @sam. I meant that if this topic were to shrink from 21 to 20 posts it would have the same issue, and thus you have your own setup to test with and no need for external live URLs.

And I demonstrated how this fails by this direct link.

I gave this example, because @codinghorror initially suggested that the server might not be running the latest code base.

Hence I said

I guess, maybe it does make sense to auto redirect high page numbers back to last page.

The risk here is that a change like this masks bugs.

+1

This is still a problem.

This would seem to be a decent solution. But I don’t understand why this is happening in the first place? Why is Discourse adding page numbers that don’t exist to the sitemap?

In the case of the topics I’m examining with this issue there are no deleted posts or other modifications to the threading.