Why there are lots of Disallow rule in robots.txt?


(Gulshan Kumar) #1

Does it prevent indexing in the SERPs?

  • Never.

Why does this happen?

What best can be done?
Please allow editing of the robots.txt file, we will do our best. :wink:

Thanks


(Michael Howell) #2
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback

These pages don’t work unless you are logged in, so it makes no sense for spiders to crawl them. They’ll just get an error if it tries.

Disallow: /assets/browser-update*.js

I think this is working around a bad habit Firefox has of overeagerly re-downloading Web Workers.

Disallow: /users/
Disallow: /u/

Because they’re infrequently read by humans, black hat SEO people like to spam forum user profiles. Discourse blocks user profiles in order to disincentivise it.

Also, user profiles contain excerpts of the posts, but we don’t want the search engines linking there. They should be linking to the actual posts.

Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

These pages don’t have a noscript version, so they just get empty responses. Try visiting https://meta.discourse.org/badges/?_escaped_fragment_=1

Disallow: /email/

I dunno. Does it even exist any more?

Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*

Again, these don’t work without being logged in anyway, and spiders can’t log in.

Disallow: /groups
Disallow: /groups/

Again, need javascript to view these pages.

Disallow: /uploads/

Don’t want spiders shoveling down huge images, or ending up hotlinked because somebody found it in an image search. This rule stops working if you turn on a CDN, by the way.


(Jeff Atwood) #3

That’s not the only reason; the user profiles are all duplicate data, in that your posts and topics are visible on… the topics themselves, which are the focus.


(Gulshan Kumar) #4

Maybe we could replace this way for some paths

  • First, Allow Bots to see the link

Allow: /u/

  • Then, set X-Robots-Tag "noindex, follow"

Why? This will actually prevent indexing.

Ref: Robots meta tag and X-Robots-Tag HTTP header specifications  |  Search  |  Google Developers


(Michael Howell) #5

Indexing the page isn’t a problem. If Bing thinks I’m looking for your user profile, and offers up a link to it, that’s perfectly fine. The reason they’re blocked in robots.txt is because the page contains text and links that are likely to be misleading.


(Jeff Atwood) #6

“duplicates” would be a much more accurate statement here.


(Gulshan Kumar) #7

What if we …

Step 1. Allow crawling path /u/
Step 2. set noindex, follow header for path /u/
Step 3. Restrict profile access to logged-in users
image

If something is restricted to logged in user means bot cannot access it. Hence, it is already prevented from duplicate text concerns. Still, why disallow for path /u/?


(Michael Howell) #8

Why do you think it would be better to use a noindex header instead of robots.txt? The goal of blocking search engines is not to prevent them from returning user profiles as search results, it’s to prevent them from reading the contents of the user profiles (because the contents of the user profiles are duplicates of other pages and/or spam).


(Gulshan Kumar) #9

Because that is a right way to preventing indexing.


(Gulshan Kumar) #10

Because Google doesn’t recommend using disallow rule to handle duplicate content.

https://support.google.com/webmasters/answer/66359?hl=en

But, two-three times they have mentioned about noindex.


(Jeff Atwood) #11

Why are you posting screenshots of text? Guess what is also very bad for searchability? Pictures of text…

So your point seems to be this – which I had to TYPE IN FROM YOUR SCREENSHOT instead of being linked like a regular link:

https://support.google.com/webmasters/answer/93710

You can prevent a page from appearing in Google Search by including a noindex meta tag in the page’s HTML code, or by returning a ‘noindex’ header in the HTTP request. When Googlebot next crawls that page and see the tag or header, Googlebot will drop that page entirely from Google Search results, regardless of whether other sites link to it.

:warning: Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

Which means user pages are still kinda present in Google’s indexes though they would never appear as hits for any actual search terms.

How is this a problem? Give me a valid search term with actual search keywords that produces a user page.


(Gulshan Kumar) #13

The first result, just for searching homepage URL.

https://www.bing.com/search?q=help.gulshankumar.net


(Jeff Atwood) #14

No repro here on meta


(Gulshan Kumar) #15

I appreciate your case-study.

However, In my case, it’s different. If there were noindex tag, my case might be different.


(Jeff Atwood) #16

I’ll ask again:

I do not consider “type the full domain name into a search box” a valid search.


(Gulshan Kumar) #17

Another one, please consider now.


(Jeff Atwood) #18

In Google – Bing is terrible, and has almost no market share.


(Michael Howell) #19

Also, you just searched for your own login name. Returning the user profile seems like a correct result to me.


(Gulshan Kumar) #20

Unfortunately, the Search Engine Bing powers many other small SE. So, it reflects same results there. I learned this thing from Bing, after keeping it noindex for a year. I found, other SE also stopped showing results because I had done noindex at Bing.

Maybe, Bing has a small market share, but its impact on small SE :frowning:


(Jeff Atwood) #21

If you feel strongly about it, you can write a plugin to change the behavior. I see no value in spending any effort on this, other than you typing your name into Bing.