Weird Google(and page numbering) behaviour


(Johan Jatko) #1

Actually, I am not sure if this is a bug(aka something that we can do about), but when google crawls Discourse forums it seems to crawl enormous amounts of pages for each topic.

For example, check this search:
Google.
It has results for page numbers over 5 million, and I doubt topics are that long.

When you click the URLs you end up at the topic with no posts, and these links sometimes end up as top result in searches.


Multiple URLs are being indexed for each forum topic
#2

Looks like search engine gold to me! but preloaded content might be the spot to consider. Since all we are seeing is what is between those commented lines looks like what a list view would show and the 5,000,000 lines (super long) I bet it is an artifact from the listview.

I wonder if a rel nofollow or a robots.txt file disallow on the listview of topics could sort this out? There are a few spots that listview things so the other thing would be to determine where it is coming from (or are all of them doing this). Example when you go to your profile there is a listview there as well as the main forum listview.

Interesting byproduct. Good catch.

Edit: After my drive to work I got concerned. I don’t know how nofollow or disallow would be a good thing either. We want the search engine to creep into pages like this, correct? So then my thinking is “how on earth do you control this sort of thing”.

So on the one hand you have ‘search engine gold’, on the other hand if Google engineers take a look-see they may scratch their heads and adjust their tools to ‘not’ do this kind of thing.

So then the question becomes. Is it our problem, or Google’s? …and what if anything should we do about it?


(Robin Ward) #3

I think this could be solved by providing canonical meta tags to google. If it realizes the page it’s crawling is the same as another it won’t do it.


#4

That is a good solution now the hard part is determining which component is the culprit? For example right below us is the “Suggested Topics” is this the culprit it? A public facing profile page? or the 404 error page? Hopefully it is not that sneaky :smile:


(Jeff Atwood) #5

But where is that magic number coming from? I don’t see any others when searching Google for…

site:meta.discourse.org inurl:5000000 → 3,430
site:meta.discourse.org inurl:76823 → nothing
site:meta.discourse.org inurl:41000 → nothing
site:meta.discourse.org inurl:967 → nothing
site:meta.discourse.org inurl:51 → 5 results
site:meta.discourse.org inurl:23 → 23 results
site:meta.discourse.org inurl:17 → 39 results

Not sure why that “magical” number 5000000 produces results but nothing under it does. I think this should be ignored.


(Johan Jatko) #6

Try 5000000 and decrease by one, 4999999 and 4999998 give results.
I actually found this from a google search as result #2. (Can’t remember the query now :frowning: )

Edit: 4999995-5000000 give results. Nothing around them does.


(Jeff Atwood) #7

Doesn’t really make sense though – where would those numbers be coming from? Not us…


(Mittineague) #8

Very weird.

I thought they might be error code numbers, but the closest I could find was “debian” and 4999995-5000000 seems like a very small range if that is what they are.

Another of those “unsolved mysteries”.

If not to be totally ignored, at least filed away as “in case something weird happens, check into further”.

But agreed, doesn’t seem like it’s worth giving it much time at present.


(Johan Jatko) #9

I don’t know where it gets 5000000 initially, but it surely gets the rest afterwards:

The previous button is visible even if the topic page has no posts.

This shows up in search results for all kinds of Discourse forums: inurl:page=5000000 - Google Search

I will continue to scout for the reason, but as of now maybe remove the previous link on empty pages and show an error instead?


(Jeff Atwood) #10

I agree with this – @neil? Can we slot this for Monday first thing?


(Neil Lalonde) #11

Good call. I’ll fix that on Monday.


(Neil Lalonde) #12

I just pushed a fix so that the previous link isn’t rendered when there are 0 results.

Seems like Google’s crawler is looking for the last page so it can crawl backwards, starting with what it guesses is the most recent content. If it finds a previous page link but no next page, it thinks it found the last page. That’s my guess anyway.


(Jeff Atwood) #13

Great find @ArmedGuy! As the Commando in Command & Conquer once said…

“keep 'em comin!”


(Jeff Atwood) #14