Google indexed link not pointing to the correct post


(Quintin Par) #1

I did a search for this topic, which is a comment on CodingHorror’s March 2017 post

https://blog.codinghorror.com/thunderbolting-your-video-card/

The first link Google points to, is the blogpost and the second one to the forum/comment discussion/post.


When I click on the link it takes me to another post that is not the one I searched for. The one I searched for is this.


Add option to set canonical_url to embed_url
(Sam Saffron) #2

something about the google indexing here is weird. the canonical on that page is:

<link rel="canonical" href="https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157?page=2">

so I am not even sure how that is showing up there.


(Jeff Atwood) #3

My google results from your link are in this order

  1. https://blog.codinghorror.com/thunderbolting-your-video-card/
  2. https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157/9
  3. https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157/21
  4. https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157

The actual correct result would be post number 12…

https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157/12

… so 9 and 21 are a bit off but certainly “on the same page” ish.

Still it is odd to search for a whole quoted paragraph plus, verbatim. in Google.


(Sam Saffron) #4

For your specific case I wonder if canonical should be the parent blog at least for all the comments that render on parent blog.

I also wonder about adding an option to redirect crawlers to the correct page vs canonical.

For a 100% embedded case you want the search to always hit the parent blog except for super rare cases


(Jeff Atwood) #5

The actual search in this case is a bit bizarre so I am not cnfortable basing an entire philosophy on a sample size of one.


(Quintin Par) #6

I saw this bug on another forum and wanted an example to showcase, hence the search by a paragraph.


(Quintin Par) #7

Please don’t change the current functionality to a blogpost canonical. Here’s the reason:

https://meta.discourse.org/t/how-can-i-get-google-to-index-all-responses-and-comments-as-new-url-endpoints-or-pages/?source_topic_id=61443

(Gerhard Schlager) #9

For the record, this is how the search results look for me right now:

image

https://blog.codinghorror.com/thunderbolting-your-video-card/
https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157?page=2

The strange thing is that the search term isn’t present on page 2. It’s found in post 12 which belongs to page 1. So, when you land on page 2 which translates to posts 20 and onwards, you don’t find what you were looking for. That’s quite confusing for users.

And I think Google gets confused too because the blog post links to post 31 via the image link.

When you visit https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157/31 as Google Bot you land on a page that contains post 11 up to post 31. And the canonical wrongfully points to https://discourse.codinghorror.com/t/thunderbolting-your-video-card/5157?page=2.

I think the correct fix is to show the search bot the full page that includes the linked post. For post 31 that would be page 2 starting at post 21 and ending at max 40.


(Mittineague) #10

Might this involve the discrepancy between visible and deleted posts count? i.e. deleted posts still retain their post id value.


(Gerhard Schlager) #11

No, it doesn’t have anything to do with deleted posts. I consider this a #bug.

It simply doesn’t use the correct post offset. The crawler view renders posts in pages. Post 1-20, 21-40,… If a crawler requests a certain post number, the app should render the right page. For post 31 it needs to select page 2 and render posts 21-40. Everything else results in a wrong search index.


(Sam Saffron) #12

This is the fundamental issue… we have no “correct” canonical page if we are displaying content from 2 different pages on the screen. Only way to correct this is making pages for “crawling” purpose work differently and this enters other worlds of pain.

For my blog what I do is just keep the whole chunk of comments with the blog post, eg:

https://www.google.com.au/search?q=“One+commonly+overlooked+impedance+to+development+flow+is+typos”

But the issue described here is far more fundemental we give web crawlers a bunch of content splayed across 2 pages and then we just pick the canonical for one of the posts in the set.

One way I can think of ways of resolving this, tell google not to index “post” links eg: https://meta.discourse.org/t/google-indexed-link-not-pointing-to-the-correct-post/61443/9 is a post link, using meta tags which may force its hand to crawl the canonical and index that instead, it may work. I don’t know. Very trick problem.

Interestingly there is a far more severe issue I am noticing when I search

google indexing site:meta.discourse.org

I find these 2 broken links that we need to figure out how this even happened:

This on the second page:

https://meta.discourse.org/t/google-complaining-indexed-though-blocked-by-robots-txt/96408?page=2

This on the third

https://meta.discourse.org/t/canonical-tag-generated-with-page-2/32842?page=4

It is not really making sense how this sneaked in. My first port of call here would be to check the site map plugin to confirm it does not include these bad links AND then to confirm there is no logic where we are presenting google with content on these pages instead of an error page.