Why there are lots of Disallow rule in robots.txt?

sam · March 30, 2018, 9:45pm

Well bing has a bug and is not respecting robots txt, why not raise this with Microsoft

Note, I am sure some real weird things are up with bing, looking at crawler stats here it’s going ballistic on meta

notriddle · March 30, 2018, 9:47pm

Actually, it is respecting robots.txt. It is not crawling the pages that are denied. It is including the page in its index, but the robots.txt spec allows you to do that just fine.

codinghorror · March 30, 2018, 9:48pm

I refer you to my previous statements on the matter:

We saw this at Stack Overflow as well. A lot of sound and fury from bing crawlers resulting in virtually zero real traffic. Bing completely sucks, and has for a decade.

Gulshan_Kumar · March 30, 2018, 9:50pm

The disallow rule at robots.txt file is to prevent from crawling, never from indexing.
Raising the issue with MS? Seriously, they don’t care about webmasters.

sam · March 30, 2018, 9:52pm

I am against removing this from robots, but somehow amending discourse to carry the list and auto generate don’t index meta tags as well on those pages is something I am open to, especially if this reduces traffic from bing

codinghorror · March 30, 2018, 9:53pm

Worth a try, at least that is a net benefit to every Discourse instance.

notriddle · March 30, 2018, 9:56pm

Personally, @Gulshan_Kumar, I recommend:

Work on getting people to link to your forum’s front page from Twitter and stuff. Unindexed pages should not be ranking higher than your front page, even on Bing.
Don’t worry about Bing returning your user profile when searching for your login name. That’s correct behavior.

And as for @codinghorror: don’t act like Stack Overflow is representative of the internet at large. Bing, being the default search engine in Internet Explorer, gets most of its traffic from a very different demographic than SO or even Discourse Meta targets.

codinghorror · March 30, 2018, 9:58pm

Worldwide, 9%. My main beef is that Bing is objectively very bad. Both at providing relevant results, and the crawler behavior.

I mean for a criteria of “type words in a search box and have something come back”, it works…

Gulshan_Kumar · March 30, 2018, 10:06pm

About Bing, I think every bit helps. If a website is quality enough, at least for some keywords it will rank.
My purpose to keep the noindex is to avoid indexing unnecessary pages. The way Bing/Yahoo randomly show stuff, I feel helpless.

Thanks for participating in this discussion. I greatly appreciate your valuable input.

Sorry, I forget to mention. How if we just remove the user-profile link for non-logged in users (anonymous/bot)? This will solve all problem.

notriddle · March 30, 2018, 10:26pm

Because they’re not supposed to be secret.

riking · March 30, 2018, 10:32pm

Also note that the crawler view user profile… doesn’t actually contain anything other than the bio. Certainly not a list of links to posts.

Now that the user page has been redefined a few times, presenting these links to crawlers might actually be a good thing - surfacing quality content is what those sections are for, and we might as well help the search engine out with them.

@sam’s profile Overview

versus noscript version of me:


      <div id="main-outlet" class="wrap">
        <!-- preload-content: -->
         <h2>riking</h2>

<p><p>Discourse is pretty great</p>
<p><a href="https://github.com/riking" class="onebox" target="_blank">https://github.com/riking</a><br>
<a href="https://twitter.com/riking27" class="onebox" target="_blank">https://twitter.com/riking27</a></p></p>



        <!-- :preload-content -->
        <footer>
          <nav itemscope itemtype='http://schema.org/SiteNavigationElement'>
            <a href='/'>Home</a>
            <a href="/categories">Categories</a>
            <a href="/guidelines">FAQ/Guidelines</a>
            <a href="/tos">Terms of Service</a>
            <a href="/privacy">Privacy Policy</a>
          </nav>
        </footer>
      </div>

      <footer id='noscript-footer'>
        <p>Powered by <a href="https://www.discourse.org">Discourse</a>, best viewed with JavaScript enabled</p>
      </footer>

wow, that’s out of date…

codinghorror · April 1, 2018, 7:45am

See also

Google has a massive lead on mobile as well, so that transition heavily favors them.

markersocial · July 11, 2018, 10:43pm

@notriddle - I think profiles are quite frequently read by humans, forum profiles will often have more views than any one of the user’s threads. If blocking user profiles from indexing is to disincentivize spammers from placing links, it means that we’re likely in agreement that links listed inside a profile will be valued more if the page is indexed, as opposed to not indexed.

This would mean that our internal linking will suffer, due to being on noindex profile pages. More strong internal linking is better for SEO and helps search engines crawl content better. Especially when no sitemap is being used. Simply adding nofollow to external spam/non-whitelisted profile links should be sufficient to disincentivize spammers (if it isn’t done already), many probably don’t even check the robots.txt to see if the profiles get indexed.

Here is a thread incl. video discussing how Google will not follow links on noindex pages: Google Will Eventually Stop Following Links on Noindex Pages - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

@codinghorror - As for it being duplicate content, I think profiles are less duplicate content than a thread being inside parent and child categories (incl. tags) simultaneously. The links are duplicates, but on different pages/urls, with different purposes. Profiles can include unique content also like bio, the floating card that also displays the bio displays on desktop only, to view it on mobile you can only go to their full profile. Google has switched to mobile first indexing also, meaning the mobile versions of our sites have become the primary versions of our sites: How Does Mobile-First Indexing Work, and How Does It Impact SEO? - Moz

The question is, why wouldn’t we want to allow search engines to have better crawling on our sites? Reddit allows indexing of user profiles and is basically the largest forum in the world. Youtube allows indexing of channels, Twiiter, FB, Google Plus etc. allow indexing of profiles as well. The only main examples I’ve seen of using noindex for profiles is on old forum software.

I definitely think the noindex for user profile pages in the robots.txt should not be default.

sam · December 22, 2020, 2:50am

Just reviving this.

You can now edit robot.txt file to taste if you want to
We always supply x-robots-tag noindex on pages that should not be indexed
Turns out that some crawlers “go to town” on sites if we do not give strict guidance in robots.txt, not everyone is Google. We have an incredibly vanilla robots.txt file these days, and it comes at a cost. (We expect everyone to be as well behaved as Google and it takes a massive effort to become Google)

I think we should probably bring back the “very limiting” robots.txt by default at least for all non googlebots.

codinghorror · December 22, 2020, 4:28am

Sure feel free to make it so … Google kind of painted us into this corner.

Topic		Replies	Views
Excluding user profiles in robots.txt (or allow edit of file) Feature	5	2483	May 24, 2014
Needing to edit robots.txt file - where is it? Support	42	7469	April 29, 2023
Google complaining – Indexed, though blocked by robots.txt Support	24	2457	September 28, 2023
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3234	July 30, 2019
Search Engine / No JavaScript version missing links Feature	7	1808	November 10, 2014

Why there are lots of Disallow rule in robots.txt?

Related topics