Google the past few weeks is complaining about URL errors .
Figured out that the content of e.g some.forum.com/topic-title/topic-number?page=4 contains 20 posts, from 60th to 80th.
So assuming that this topic has 61 posts, Google “detects” a some.forum.com/topic-title/topic-number?page=5 url, thus a 404 page.
What I was not able to figure out is when and why this is happening, because it does not happen for all topics but for quite a few of them.
So if I try https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=2 redirects to the 20th post on the topic.
If i try https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=3 I get a 404 .
Both the above are surely the expected outcomes.
The actual problem is that Google somehow detects that extra page (on the above example it would detect a https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=3 ) and then it is complaining that the webpage has URL errors.
I’m not sure. I viewed topics, but didn’t see the next page link on the last page like I do on the Category views. I see a previous page on the last page, which would be expected.
Maybe, but the issue I repro’d doesn’t produce a 404. It simply shows a blank page. It could be that google is simply trying to verify they made it to the last page in a topic when indexing and thus going to a page that doesn’t exist… Not sure why they’d do that though.
Yeah, so more info on this, I’d guess that Googlebot is guessing the next possible URL based on the canonical URL. (this is all hypothetical)
When I visit Is the Second Amendment still relevant today? - Demo, as Googlebot, I see a canonical URL of <link rel="canonical" href="http://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291" />
When I visit Is the Second Amendment still relevant today? - Demo, as Googlebot, I see a canonical URL of <link rel="canonical" href="http://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=2" />
That is why I’m leaning on deleted posts. As page 3 may have existed two days ago, but now it doesn’t and when Googlebot returns it is trying to update itself on the contents of page 3 which is no longer applicable. Again, pure theory.
As to how to tell if a topic has deleted posts, open the topic, navigate to the very last post. Take note of the post number count in the timeline (example, this topic, it should be 14), click on the timestamp of the last post, what post number does it link to? (example, this topic, it should be 15), as you can see they differ on this topic, which means there was a deleted post at a point in time (in this topics case, it was Post #2, that was deleted).
We’re using <a rel="next"> and <a rel="prev"> in the body, but the guidelines say to use <link rel="next"> and <link rel="prev"> in the head.
If we get it wrong (which we are), they say:
If Google finds mistakes in your implementation…, we’ll continue to index the page(s), and rely on our own heuristics to understand your content.
So I think @cpradio is right that Googlebot is trying to guess the url for the next page. It’s getting a 404 page, which it shouldn’t be indexing… but if we put link elements in the head then Googlebot might not need to guess anymore.