Why there are lots of Disallow rule in robots.txt?

Well bing has a bug and is not respecting robots txt, why not raise this with Microsoft

Note, I am sure some real weird things are up with bing, looking at crawler stats here it’s going ballistic on meta

3 Likes

Actually, it is respecting robots.txt. It is not crawling the pages that are denied. It is including the page in its index, but the robots.txt spec allows you to do that just fine.

3 Likes

I refer you to my previous statements on the matter:

We saw this at Stack Overflow as well. A lot of sound and fury from bing crawlers resulting in virtually zero real traffic. Bing completely sucks, and has for a decade.

1 Like

The disallow rule at robots.txt file is to prevent from crawling, never from indexing.
Raising the issue with MS? Seriously, they don’t care about webmasters.

I am against removing this from robots, but somehow amending discourse to carry the list and auto generate don’t index meta tags as well on those pages is something I am open to, especially if this reduces traffic from bing

Worth a try, at least that is a net benefit to every Discourse instance.

1 Like

Personally, @Gulshan_Kumar, I recommend:

  1. Work on getting people to link to your forum’s front page from Twitter and stuff. Unindexed pages should not be ranking higher than your front page, even on Bing.

  2. Don’t worry about Bing returning your user profile when searching for your login name. That’s correct behavior.

And as for @codinghorror: don’t act like Stack Overflow is representative of the internet at large. Bing, being the default search engine in Internet Explorer, gets most of its traffic from a very different demographic than SO or even Discourse Meta targets.

4 Likes

Worldwide, 9%. My main beef is that Bing is objectively very bad. Both at providing relevant results, and the crawler behavior.

I mean for a criteria of “type words in a search box and have something come back”, it works…

1 Like

About Bing, I think every bit helps. If a website is quality enough, at least for some keywords it will rank.
My purpose to keep the noindex is to avoid indexing unnecessary pages. The way Bing/Yahoo randomly show stuff, I feel helpless.

Thanks for participating in this discussion. I greatly appreciate your valuable input.

Sorry, I forget to mention. How if we just remove the user-profile link for non-logged in users (anonymous/bot)? This will solve all problem.

Because they’re not supposed to be secret.

1 Like

Also note that the crawler view user profile… doesn’t actually contain anything other than the bio. Certainly not a list of links to posts.

Now that the user page has been redefined a few times, presenting these links to crawlers might actually be a good thing - surfacing quality content is what those sections are for, and we might as well help the search engine out with them.

@sam’s profile Overview

versus noscript version of me:


      <div id="main-outlet" class="wrap">
        <!-- preload-content: -->
         <h2>riking</h2>

<p><p>Discourse is pretty great</p>
<p><a href="https://github.com/riking" class="onebox" target="_blank">https://github.com/riking</a><br>
<a href="https://twitter.com/riking27" class="onebox" target="_blank">https://twitter.com/riking27</a></p></p>



        <!-- :preload-content -->
        <footer>
          <nav itemscope itemtype='http://schema.org/SiteNavigationElement'>
            <a href='/'>Home</a>
            <a href="/categories">Categories</a>
            <a href="/guidelines">FAQ/Guidelines</a>
            <a href="/tos">Terms of Service</a>
            <a href="/privacy">Privacy Policy</a>
          </nav>
        </footer>
      </div>

      <footer id='noscript-footer'>
        <p>Powered by <a href="https://www.discourse.org">Discourse</a>, best viewed with JavaScript enabled</p>
      </footer>

wow, that’s out of date…

3 Likes

See also

Google has a massive lead on mobile as well, so that transition heavily favors them.

@notriddle - I think profiles are quite frequently read by humans, forum profiles will often have more views than any one of the user’s threads. If blocking user profiles from indexing is to disincentivize spammers from placing links, it means that we’re likely in agreement that links listed inside a profile will be valued more if the page is indexed, as opposed to not indexed.

This would mean that our internal linking will suffer, due to being on noindex profile pages. More strong internal linking is better for SEO and helps search engines crawl content better. Especially when no sitemap is being used. Simply adding nofollow to external spam/non-whitelisted profile links should be sufficient to disincentivize spammers (if it isn’t done already), many probably don’t even check the robots.txt to see if the profiles get indexed.

Here is a thread incl. video discussing how Google will not follow links on noindex pages: https://www.webmasterworld.com/google/4881752.htm

@codinghorror - As for it being duplicate content, I think profiles are less duplicate content than a thread being inside parent and child categories (incl. tags) simultaneously. The links are duplicates, but on different pages/urls, with different purposes. Profiles can include unique content also like bio, the floating card that also displays the bio displays on desktop only, to view it on mobile you can only go to their full profile. Google has switched to mobile first indexing also, meaning the mobile versions of our sites have become the primary versions of our sites: How Does Mobile-First Indexing Work, and How Does It Impact SEO? - Moz

The question is, why wouldn’t we want to allow search engines to have better crawling on our sites? Reddit allows indexing of user profiles and is basically the largest forum in the world. Youtube allows indexing of channels, Twiiter, FB, Google Plus etc. allow indexing of profiles as well. The only main examples I’ve seen of using noindex for profiles is on old forum software.

I definitely think the noindex for user profile pages in the robots.txt should not be default.

2 Likes

Just reviving this.

  1. You can now edit robot.txt file to taste if you want to

  2. We always supply x-robots-tag noindex on pages that should not be indexed

  3. Turns out that some crawlers “go to town” on sites if we do not give strict guidance in robots.txt, not everyone is Google. We have an incredibly vanilla robots.txt file these days, and it comes at a cost. (We expect everyone to be as well behaved as Google and it takes a massive effort to become Google)

I think we should probably bring back the “very limiting” robots.txt by default at least for all non googlebots.

4 Likes

Sure feel free to make it so … Google kind of painted us into this corner.

3 Likes