Why there are lots of Disallow rule in robots.txt?

Does it prevent indexing in the SERPs?

  • Never.

Why does this happen?

What best can be done?
Please allow editing of the robots.txt file, we will do our best. :wink:

Thanks

Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback

These pages don’t work unless you are logged in, so it makes no sense for spiders to crawl them. They’ll just get an error if it tries.

Disallow: /assets/browser-update*.js

I think this is working around a bad habit Firefox has of overeagerly re-downloading Web Workers.

Disallow: /users/
Disallow: /u/

Because they’re infrequently read by humans, black hat SEO people like to spam forum user profiles. Discourse blocks user profiles in order to disincentivise it.

Also, user profiles contain excerpts of the posts, but we don’t want the search engines linking there. They should be linking to the actual posts.

Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

These pages don’t have a noscript version, so they just get empty responses. Try visiting https://meta.discourse.org/badges/?_escaped_fragment_=1

Disallow: /email/

I dunno. Does it even exist any more?

Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*

Again, these don’t work without being logged in anyway, and spiders can’t log in.

Disallow: /groups
Disallow: /groups/

Again, need javascript to view these pages.

Disallow: /uploads/

Don’t want spiders shoveling down huge images, or ending up hotlinked because somebody found it in an image search. This rule stops working if you turn on a CDN, by the way.

11 Likes

That’s not the only reason; the user profiles are all duplicate data, in that your posts and topics are visible on… the topics themselves, which are the focus.

2 Likes

Maybe we could replace this way for some paths

  • First, Allow Bots to see the link

Allow: /u/

  • Then, set X-Robots-Tag "noindex, follow"

Why? This will actually prevent indexing.

Ref: Robots meta tag, data-nosnippet, and X-Robots-Tag specifications

Indexing the page isn’t a problem. If Bing thinks I’m looking for your user profile, and offers up a link to it, that’s perfectly fine. The reason they’re blocked in robots.txt is because the page contains text and links that are likely to be misleading.

2 Likes

“duplicates” would be a much more accurate statement here.

2 Likes

What if we …

Step 1. Allow crawling path /u/
Step 2. set noindex, follow header for path /u/
Step 3. Restrict profile access to logged-in users
image

If something is restricted to logged in user means bot cannot access it. Hence, it is already prevented from duplicate text concerns. Still, why disallow for path /u/?

Why do you think it would be better to use a noindex header instead of robots.txt? The goal of blocking search engines is not to prevent them from returning user profiles as search results, it’s to prevent them from reading the contents of the user profiles (because the contents of the user profiles are duplicates of other pages and/or spam).

2 Likes

Because that is a right way to preventing indexing.

Because Google doesn’t recommend using disallow rule to handle duplicate content.

https://support.google.com/webmasters/answer/66359?hl=en

But, two-three times they have mentioned about noindex.

Why are you posting screenshots of text? Guess what is also very bad for searchability? Pictures of text…

So your point seems to be this – which I had to TYPE IN FROM YOUR SCREENSHOT instead of being linked like a regular link:

https://support.google.com/webmasters/answer/93710

You can prevent a page from appearing in Google Search by including a noindex meta tag in the page’s HTML code, or by returning a ‘noindex’ header in the HTTP request. When Googlebot next crawls that page and see the tag or header, Googlebot will drop that page entirely from Google Search results, regardless of whether other sites link to it.

:warning: Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

Which means user pages are still kinda present in Google’s indexes though they would never appear as hits for any actual search terms.

How is this a problem? Give me a valid search term with actual search keywords that produces a user page.

4 Likes

The first result, just for searching homepage URL.

https://www.bing.com/search?q=help.gulshankumar.net

No repro here on meta

1 Like

I appreciate your case-study.

However, In my case, it’s different. If there were noindex tag, my case might be different.

I’ll ask again:

I do not consider “type the full domain name into a search box” a valid search.

2 Likes

Another one, please consider now.

In Google – Bing is terrible, and has almost no market share.

Also, you just searched for your own login name. Returning the user profile seems like a correct result to me.

3 Likes

Unfortunately, the Search Engine Bing powers many other small SE. So, it reflects same results there. I learned this thing from Bing, after keeping it noindex for a year. I found, other SE also stopped showing results because I had done noindex at Bing.

Maybe, Bing has a small market share, but its impact on small SE :frowning:

If you feel strongly about it, you can write a plugin to change the behavior. I see no value in spending any effort on this, other than you typing your name into Bing.

2 Likes