Why there are lots of Disallow rule in robots.txt?

Gulshan_Kumar · March 30, 2018, 7:23pm

Does it prevent indexing in the SERPs?

Never.

Why does this happen?

What best can be done?
Please allow editing of the robots.txt file, we will do our best.

Thanks

notriddle · March 30, 2018, 7:34pm

Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback

These pages don’t work unless you are logged in, so it makes no sense for spiders to crawl them. They’ll just get an error if it tries.

Disallow: /assets/browser-update*.js

I think this is working around a bad habit Firefox has of overeagerly re-downloading Web Workers.

Disallow: /users/
Disallow: /u/

Because they’re infrequently read by humans, black hat SEO people like to spam forum user profiles. Discourse blocks user profiles in order to disincentivise it.

Also, user profiles contain excerpts of the posts, but we don’t want the search engines linking there. They should be linking to the actual posts.

Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

These pages don’t have a noscript version, so they just get empty responses. Try visiting https://meta.discourse.org/badges/?_escaped_fragment_=1

Disallow: /email/

I dunno. Does it even exist any more?

Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*

Again, these don’t work without being logged in anyway, and spiders can’t log in.

Disallow: /groups
Disallow: /groups/

Again, need javascript to view these pages.

Disallow: /uploads/

Don’t want spiders shoveling down huge images, or ending up hotlinked because somebody found it in an image search. This rule stops working if you turn on a CDN, by the way.

codinghorror · March 30, 2018, 8:31pm

That’s not the only reason; the user profiles are all duplicate data, in that your posts and topics are visible on… the topics themselves, which are the focus.

Gulshan_Kumar · March 30, 2018, 8:38pm

Maybe we could replace this way for some paths

First, Allow Bots to see the link

Allow: /u/

Then, set X-Robots-Tag "noindex, follow"

Why? This will actually prevent indexing.

Ref: Robots meta tag, data-nosnippet, and X-Robots-Tag specifications

notriddle · March 30, 2018, 8:40pm

Indexing the page isn’t a problem. If Bing thinks I’m looking for your user profile, and offers up a link to it, that’s perfectly fine. The reason they’re blocked in robots.txt is because the page contains text and links that are likely to be misleading.

codinghorror · March 30, 2018, 8:41pm

“duplicates” would be a much more accurate statement here.

Gulshan_Kumar · March 30, 2018, 8:48pm

What if we …

Step 1. Allow crawling path /u/
Step 2. set noindex, follow header for path /u/
Step 3. Restrict profile access to logged-in users

If something is restricted to logged in user means bot cannot access it. Hence, it is already prevented from duplicate text concerns. Still, why disallow for path /u/?

notriddle · March 30, 2018, 8:52pm

Why do you think it would be better to use a noindex header instead of robots.txt? The goal of blocking search engines is not to prevent them from returning user profiles as search results, it’s to prevent them from reading the contents of the user profiles (because the contents of the user profiles are duplicates of other pages and/or spam).

Gulshan_Kumar · March 30, 2018, 8:56pm

Because that is a right way to preventing indexing.

Gulshan_Kumar · March 30, 2018, 9:06pm

Because Google doesn’t recommend using disallow rule to handle duplicate content.

https://support.google.com/webmasters/answer/66359?hl=en

But, two-three times they have mentioned about noindex.

codinghorror · March 30, 2018, 9:18pm

Why are you posting screenshots of text? Guess what is also very bad for searchability? Pictures of text…

So your point seems to be this – which I had to TYPE IN FROM YOUR SCREENSHOT instead of being linked like a regular link:

You can prevent a page from appearing in Google Search by including a noindex meta tag in the page’s HTML code, or by returning a ‘noindex’ header in the HTTP request. When Googlebot next crawls that page and see the tag or header, Googlebot will drop that page entirely from Google Search results, regardless of whether other sites link to it.

Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

Which means user pages are still kinda present in Google’s indexes though they would never appear as hits for any actual search terms.

How is this a problem? Give me a valid search term with actual search keywords that produces a user page.

Gulshan_Kumar · March 30, 2018, 9:32pm

The first result, just for searching homepage URL.

https://www.bing.com/search?q=help.gulshankumar.net

codinghorror · March 30, 2018, 9:33pm

No repro here on meta

Gulshan_Kumar · March 30, 2018, 9:35pm

I appreciate your case-study.

However, In my case, it’s different. If there were noindex tag, my case might be different.

codinghorror · March 30, 2018, 9:36pm

I’ll ask again:

I do not consider “type the full domain name into a search box” a valid search.

Gulshan_Kumar · March 30, 2018, 9:39pm

Another one, please consider now.

codinghorror · March 30, 2018, 9:39pm

In Google – Bing is terrible, and has almost no market share.

notriddle · March 30, 2018, 9:41pm

Also, you just searched for your own login name. Returning the user profile seems like a correct result to me.

Gulshan_Kumar · March 30, 2018, 9:44pm

Unfortunately, the Search Engine Bing powers many other small SE. So, it reflects same results there. I learned this thing from Bing, after keeping it noindex for a year. I found, other SE also stopped showing results because I had done noindex at Bing.

Maybe, Bing has a small market share, but its impact on small SE

codinghorror · March 30, 2018, 9:45pm

If you feel strongly about it, you can write a plugin to change the behavior. I see no value in spending any effort on this, other than you typing your name into Bing.

Topic		Replies	Views
Excluding user profiles in robots.txt (or allow edit of file) Feature	5	2485	May 24, 2014
Needing to edit robots.txt file - where is it? Support	42	7476	April 29, 2023
Google complaining – Indexed, though blocked by robots.txt Support	24	2469	September 28, 2023
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3241	July 30, 2019
Search Engine / No JavaScript version missing links Feature	7	1811	November 10, 2014

Why there are lots of Disallow rule in robots.txt?

Related topics