Google changed how they process robots.txt in Discourse?

jackjjw · May 11, 2020, 3:37pm

My board has been linked from my site for a couple of weeks now and I submitted the URL to Google. I had a no index warning but it seems for profile pages, which is good.

Yet nothing is appearing in Google yet, is there anything I need to do on the board end, or is it simply a matter of waiting for Google to now crawl it?

satonotdead · May 11, 2020, 4:30pm

Maybe you can try https://search.google.com/search-console/ ?

jackjjw · May 12, 2020, 7:00am

It seems to be saying the post pages are blocked by robots.txt but that isn’t something I have done, is there a setting in Discourse I need to change to open it up? Thank you

sam · May 12, 2020, 7:11am

There is a site setting, search for allow index in robots txt in your site settings, it should be enabled (it is enabled by default)

jackjjw · May 12, 2020, 7:13am

Thanks Sam, that setting is ticked, is that the right way around?

Sorry, I’m confusing things, it looks like the blocked URLs are the rss feed equivalents.

I guess it is just a matter of waiting until Google updates or crawls the site then.

sam · May 12, 2020, 7:22am

Yeah this keep on repeating and keeps on causing support.

Googlebot is somewhat annoying. You can not tell it in robots.txt that you don’t want something indexed. We are working on a fix to appease googlebot but it will take a while for it to roll out.

We tell googlebot in robots.txt … “Hey … don’t go about indexing all the .rss pages on the site”
googlebot finds a link somewhere to a .rss file on the site
googlebot then complains to site operators that there is a .rss file out there on the site, but it can not figure out what to do with the link cause it is not allowed to index it. It sometimes even includes this content in search results.
Site operators then complain on meta

Our general fix here is just to let googlebot crawl every page on the site and use canonicals and indexing hints in HTTP headers to direct it to the pit of success.

I am working with @jomaxro on this and we have already made some good progress.

(fyi @codinghorror)

jackjjw · May 12, 2020, 7:29am

Thanks for the update Sam, that all makes sense and I feel your pain. I’m not an SEO but I used to run bigger websites and worked with SEO teams, on forums it was often very tricky!

jomaxro · May 12, 2020, 4:59pm

To be clear this has nothing to do with being a discussion forum. It is related to the … interesting … way Google treats robots.txt. Per Robots.txt Introduction and Guide | Google Search Central | Documentation | Google for Developers

A robotted page can still be indexed if linked to from from other sites
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

We’ve long included pages we don’t want indexed in the default robots.txt file each Discourse site has. This previously worked just fine. At an unknown point in past this was no longer enough, Google decided to index pages linked from elsewhere even if disallowed via a robots.txt.

So earlier this year we started testing including noindex headers on certain pages. This would work great, except for we now end up with a clash between robots.txt and the header. Per Bloccare l'indicizzazione della Ricerca con noindex | Google Search Central | Documentazione | Google for Developers

Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

Which leads us to today. We’re testing the removal of certain pages from the robots.txt. We have to be careful, as we’re making all these changes based on the Google documentation, so we know we’re good with Googlebot, but need to also check other major crawlers to ensure we’re not going to cause issues there.

codinghorror · May 13, 2020, 7:04am

Quoted for emphasis. Google changed behavior here, we didn’t, so it will take a bit of time to adapt.

jackjjw · June 25, 2020, 1:28pm

Hi Jeff, that all makes sense to me and I understand. I was just wanting to doublecheck that I couldn’t have done something to hide the thread pages from my set up in Google? The main home page and categories are appearing in Google but none of the thread pages are, it’s been a couple of months now. This is my site: https://community.jackwallington.com/

codinghorror · June 25, 2020, 5:20pm

I believe we have made all the adjustments on our end to accommodate the recent Google behavior changes… maybe @jomaxro can confirm? You will want to be on the latest version of Discourse.

jomaxro · June 25, 2020, 5:27pm

I’m not certain, will need to check. I believe we’ve made some manual robots.txt changes (on Meta only) during testing…

jomaxro · June 25, 2020, 5:32pm

Looking at https://github.com/discourse/discourse/blob/master/app/controllers/robots_txt_controller.rb#L10 it would appear the changes are local (Meta only). I’ll fix that, we still have a few long running tests in progress, but I’m pretty confident here.

jomaxro · June 25, 2020, 7:15pm

Necessary changes made per
https://github.com/discourse/discourse/commit/b52143feff8c32f21ed53033b6a0a65ee45dce0e

jackjjw · June 25, 2020, 7:31pm

Could it be that I have a no index somewhere for post pages? Even though Google says they ignore this now

jomaxro · June 25, 2020, 7:40pm

Unless you installed a plugin to add that, I can’t think of a way such a header would be added. Google does not ignore the noindex header. Google ignores robots.txt when other sites point to your page. Google does respect it when crawling, which is why the commit above removes robots.txt entries in favor of previously added noindex headers.

I’d suggest signing up for Google Search Console so you can see for yourself what Google is seeing. Perhaps there’s another issue preventing the topics from being seen.

jackjjw · June 25, 2020, 7:54pm

Thanks Joshua, Google Search Console seems happy and says all threads are listed. Very strange when I search for them, the thread pages don’t show but the home and category pages do.

sam · December 22, 2020, 5:17am

I am going to revert this and have this condition explicit for googlebot.

Googlebot is a very smart crawler, but many other crawlers are not as smart.

jomaxro · December 22, 2020, 5:22am

Fair enough. Note there’s a later commit also to be reverted.

sam · December 22, 2020, 6:01am

I made this PR to resolve this:

https://github.com/discourse/discourse/pull/11553

Google gets to keep its special rule and we ship with better protection for various bots that are not as fancy. Default robots now looks like:

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
User-agent: mauibot
Disallow: /


User-agent: semrushbot
Disallow: /


User-agent: ahrefsbot
Disallow: /


User-agent: blexbot
Disallow: /


User-agent: seo spider
Disallow: /


User-agent: *
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /badges
Disallow: /u
Disallow: /my
Disallow: /search
Disallow: /tags
Disallow: /g
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: Googlebot
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*

Topic		Replies	Views
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3241	July 30, 2019
Google notification to remove "noindex" statements from robots.txt Support	8	2427	July 30, 2019
Google complaining – Indexed, though blocked by robots.txt Support	24	2469	September 28, 2023
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	137	December 18, 2024
Google search indexing not working? Support seo	0	42	March 23, 2025

Google changed how they process robots.txt in Discourse?

Related topics