Malformed robots.txt causing issues with indexing

Hi everyone,

We just realised that our Discourse forum is not indexed by Google (we remember that it was indexed about a year ago), and we’re trying to fix it right now. What are the configuration that we need to make sure are set properly?

This is what I’ve done so far:

  1. I’ve made sure that “allow index in robots txt” is ticked

  2. I’ve added the following domains to “exclude rel nofollow domains”:

    • grakn.ai (our main site domain)
    • discuss.grakn.ai (our discourse forum domain)
  3. I’ve made sure that “add rel nofollow to user content” is unticked

  4. I’ve added Googlebot to “whitelisted crawler user agents”

Am I missing any other configurations that I need to set?

Our Google Search Console shows that discuss.grakn.ai could still not be crawled because it is blocked by robots.txt - see screenshot below.

Thanks in advance for the help!!!

2 Likes

Admin -> Settings -> Enable Robots.txt

Your Forum Roboy file is allowed: https://discuss.grakn.ai/robots.txt

Login to Google Webmaster Tools and check: https://www.google.com/webmasters/tools/robots-testing-tool

4 Likes

Out of the box with all defaults this works totally fine, did you modify these settings when you originally installed?

4 Likes

The robots.txt file has this text in the middle, so it might have problems with crawlers:

User-agent: *
Disallow: /
Noindex: /

Google is indexing pages though:
https://www.google.com/search?q=site%3Ahttps%3A%2F%2Fdiscuss.grakn.ai%2F&num=100

It might be that Googlebot is looking at your Google-specific rules and Webmaster Tools is warning you about the wildcard.

(I’m not sure what settings result in that robots.txt output.)

3 Likes

Yes.

  1. Access: https://discuss.grakn.ai/admin/customize/robots

  2. Remove:

    User-agent: *
    Disallow: /
    Noindex: /

  3. Go to Google Webmaster Tools: https://www.google.com/webmasters/tools/robots-testing-tool

Choose a verified property and submit robots.txt again to Google.

I think it should work.

1 Like

Finally, removing the following block fixed the problem.

User-agent: *
Disallow: /
Noindex: /

Thank you so much, @j127 and @tohaitrieu!!!

Google Search Console now shows that discuss.grakn.ai is queued up for indexing.

Cheers!

2 Likes

I’m very unclear how you ended up in this state. Did you change default site settings related to crawling?

2 Likes

I’m also unclear how we ended up in the above state, @codinghorror. I’ve been the admin of the site for the past year and I did not change anything related to stuff above. I do remember not doing an upgrade for very long, and then did one shortly before the above issue started occurring, but I don’t know if that’s related.

1 Like