Canonical tag on topic URL


#1

Hi ,

Google the past few weeks is complaining about URL errors .

Figured out that the content of e.g some.forum.com/topic-title/topic-number?page=4 contains 20 posts, from 60th to 80th.
So assuming that this topic has 61 posts, Google “detects” a some.forum.com/topic-title/topic-number?page=5 url, thus a 404 page.

What I was not able to figure out is when and why this is happening, because it does not happen for all topics but for quite a few of them.

Is this a known issue ?

Thanks,
CosM


(Régis Hanol) #3

Can you reproduce this issue here or on try?


#4

Thanks for your reply !

So if I try https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=2 redirects to the 20th post on the topic.
If i try https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=3 I get a 404 .
Both the above are surely the expected outcomes.

The actual problem is that Google somehow detects that extra page (on the above example it would detect a https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=3 ) and then it is complaining that the webpage has URL errors.

Thanks,
CosM


(Régis Hanol) #5

I understand the issue, but unless you can provide us with a reproduction, there’s not much we can do I’m afraid :wink:


#6

Hi,

The above example is a realtime reproduction,this is exactly what is happening.

Thanks,
CosM


(Régis Hanol) #7

I’m not seeing any links that points to https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=3.


EDIT: could it be related to this?


(cpradio) #8

I’m not sure. I viewed topics, but didn’t see the next page link on the last page like I do on the Category views. I see a previous page on the last page, which would be expected.


(Régis Hanol) #9

I was thinking maybe @CosM was describing your issue instead. (Since I can’t reproduce it)


(cpradio) #10

Maybe, but the issue I repro’d doesn’t produce a 404. It simply shows a blank page. It could be that google is simply trying to verify they made it to the last page in a topic when indexing and thus going to a page that doesn’t exist… Not sure why they’d do that though.


#11

No this is not relevant.

@cpradio is probably correct.

Thanks,
CosM


(cpradio) #12

Yeah, so more info on this, I’d guess that Googlebot is guessing the next possible URL based on the canonical URL. (this is all hypothetical)

When I visit Is the Second Amendment still relevant today? - Demo, as Googlebot, I see a canonical URL of <link rel="canonical" href="http://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291" />

When I visit Is the Second Amendment still relevant today? - Demo, as Googlebot, I see a canonical URL of <link rel="canonical" href="http://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=2" />

If I try to visit https://try.discourse.org/t/is-the-second-amendment-still-relevant-today/291?page=3, I get a 404.

So if Googlebot does try to guess the next page to make sure it indexes them all, the 404 is a good indication to it that it did.


(cpradio) #13

@CosM, I have one other theory. By chance does the topic in question have any deleted posts, and if so, how many?

Could there at one point have actually been a ?page=3, and then posts were deleted, thus making it only 2 pages now?


#14

Thanks for that @cpradio.
What you are saying here makes sense but:

  1. I am wondering why it targets some topics only and not every single topic that has e.g 22 or 36 or 86 posts ?

  2. What has changed and it started complaining the past few weeks ?

Edit: I have thought about your theory too but I believe is very unlikely that happened. Is there any way I can find out if a post has been deleted ?

Thanks,
CosM


(cpradio) #15

That is why I’m leaning on deleted posts. As page 3 may have existed two days ago, but now it doesn’t and when Googlebot returns it is trying to update itself on the contents of page 3 which is no longer applicable. Again, pure theory.

As to how to tell if a topic has deleted posts, open the topic, navigate to the very last post. Take note of the post number count in the timeline (example, this topic, it should be 14), click on the timestamp of the last post, what post number does it link to? (example, this topic, it should be 15), as you can see they differ on this topic, which means there was a deleted post at a point in time (in this topics case, it was Post #2, that was deleted).


#16

No, they are exactly the same.


(cpradio) #17

Bummer. I’m out of ideas unfortunately. I haven’t a clue why Googlebot would be looking for a page that doesn’t exist on a small subset of topics. :frowning:


#18

@cpradio thanks anyway !

@zogstrip any more thoughts ?

Thanks,
CosM


(Neil Lalonde) #19

I looked up Google’s guidelines about paginated content, and I think we’re not following it well enough…

  • We’re using <a rel="next"> and <a rel="prev"> in the body, but the guidelines say to use <link rel="next"> and <link rel="prev"> in the head.

If we get it wrong (which we are), they say:

If Google finds mistakes in your implementation…, we’ll continue to index the page(s), and rely on our own heuristics to understand your content.

So I think @cpradio is right that Googlebot is trying to guess the url for the next page. It’s getting a 404 page, which it shouldn’t be indexing… but if we put link elements in the head then Googlebot might not need to guess anymore.


#20

Thanks @neil !

Is this a fix that you will consider for a future release ?
If yes it would be great if you could provide a link so that I can track its progress.

Thanks,
CosM


(Neil Lalonde) #21

Yeah it should be an easy fix. I’ll post here when it’s done.