Google changed how they process robots.txt in Discourse?

My board has been linked from my site for a couple of weeks now and I submitted the URL to Google. I had a no index warning but it seems for profile pages, which is good.

Yet nothing is appearing in Google yet, is there anything I need to do on the board end, or is it simply a matter of waiting for Google to now crawl it?

Maybe you can try https://search.google.com/search-console/ ?

It seems to be saying the post pages are blocked by robots.txt but that isn’t something I have done, is there a setting in Discourse I need to change to open it up? Thank you

There is a site setting, search for allow index in robots txt in your site settings, it should be enabled (it is enabled by default)

2 Likes

Thanks Sam, that setting is ticked, is that the right way around?

Sorry, I’m confusing things, it looks like the blocked URLs are the rss feed equivalents.

I guess it is just a matter of waiting until Google updates or crawls the site then.

Yeah this keep on repeating and keeps on causing support.

Googlebot is somewhat annoying. You can not tell it in robots.txt that you don’t want something indexed. We are working on a fix to appease googlebot but it will take a while for it to roll out.

  • We tell googlebot in robots.txt … “Hey … don’t go about indexing all the .rss pages on the site”

  • googlebot finds a link somewhere to a .rss file on the site

  • googlebot then complains to site operators that there is a .rss file out there on the site, but it can not figure out what to do with the link cause it is not allowed to index it. It sometimes even includes this content in search results.

  • Site operators then complain on meta

Our general fix here is just to let googlebot crawl every page on the site and use canonicals and indexing hints in HTTP headers to direct it to the pit of success.

I am working with @jomaxro on this and we have already made some good progress.

(fyi @codinghorror)

7 Likes

Thanks for the update Sam, that all makes sense and I feel your pain. I’m not an SEO but I used to run bigger websites and worked with SEO teams, on forums it was often very tricky!

1 Like

To be clear this has nothing to do with being a discussion forum. It is related to the … interesting … way Google treats robots.txt. Per Introduction to robots.txt - Search Console Help

A robotted page can still be indexed if linked to from from other sites
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

We’ve long included pages we don’t want indexed in the default robots.txt file each Discourse site has. This previously worked just fine. At an unknown point in past this was no longer enough, Google decided to index pages linked from elsewhere even if disallowed via a robots.txt.

So earlier this year we started testing including noindex headers on certain pages. This would work great, except for we now end up with a clash between robots.txt and the header. Per Block search indexing with 'noindex' - Search Console Help

Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

Which leads us to today. We’re testing the removal of certain pages from the robots.txt. We have to be careful, as we’re making all these changes based on the Google documentation, so we know we’re good with Googlebot, but need to also check other major crawlers to ensure we’re not going to cause issues there.

6 Likes

Quoted for emphasis. Google changed behavior here, we didn’t, so it will take a bit of time to adapt.

7 Likes

Hi Jeff, that all makes sense to me and I understand. I was just wanting to doublecheck that I couldn’t have done something to hide the thread pages from my set up in Google? The main home page and categories are appearing in Google but none of the thread pages are, it’s been a couple of months now. This is my site: https://community.jackwallington.com/

I believe we have made all the adjustments on our end to accommodate the recent Google behavior changes… maybe @jomaxro can confirm? You will want to be on the latest version of Discourse.

I’m not certain, will need to check. I believe we’ve made some manual robots.txt changes (on Meta only) during testing…

1 Like

Looking at https://github.com/discourse/discourse/blob/master/app/controllers/robots_txt_controller.rb#L10 it would appear the changes are local (Meta only). I’ll fix that, we still have a few long running tests in progress, but I’m pretty confident here.

2 Likes

Necessary changes made per
https://github.com/discourse/discourse/commit/b52143feff8c32f21ed53033b6a0a65ee45dce0e

2 Likes

Could it be that I have a no index somewhere for post pages? Even though Google says they ignore this now

Unless you installed a plugin to add that, I can’t think of a way such a header would be added. Google does not ignore the noindex header. Google ignores robots.txt when other sites point to your page. Google does respect it when crawling, which is why the commit above removes robots.txt entries in favor of previously added noindex headers.

I’d suggest signing up for Google Search Console so you can see for yourself what Google is seeing. Perhaps there’s another issue preventing the topics from being seen.

1 Like

Thanks Joshua, Google Search Console seems happy and says all threads are listed. Very strange when I search for them, the thread pages don’t show but the home and category pages do.

1 Like

I am going to revert this and have this condition explicit for googlebot.

Googlebot is a very smart crawler, but many other crawlers are not as smart.

2 Likes

Fair enough. Note there’s a later commit also to be reverted.

1 Like

I made this PR to resolve this:

https://github.com/discourse/discourse/pull/11553

Google gets to keep its special rule and we ship with better protection for various bots that are not as fancy. Default robots now looks like:

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
User-agent: mauibot
Disallow: /


User-agent: semrushbot
Disallow: /


User-agent: ahrefsbot
Disallow: /


User-agent: blexbot
Disallow: /


User-agent: seo spider
Disallow: /


User-agent: *
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /badges
Disallow: /u
Disallow: /my
Disallow: /search
Disallow: /tags
Disallow: /g
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: Googlebot
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*

4 Likes