My board has been linked from my site for a couple of weeks now and I submitted the URL to Google. I had a no index warning but it seems for profile pages, which is good.
Yet nothing is appearing in Google yet, is there anything I need to do on the board end, or is it simply a matter of waiting for Google to now crawl it?
It seems to be saying the post pages are blocked by robots.txt but that isn’t something I have done, is there a setting in Discourse I need to change to open it up? Thank you
Yeah this keep on repeating and keeps on causing support.
Googlebot is somewhat annoying. You can not tell it in robots.txt that you don’t want something indexed. We are working on a fix to appease googlebot but it will take a while for it to roll out.
We tell googlebot in robots.txt … “Hey … don’t go about indexing all the .rss pages on the site”
googlebot finds a link somewhere to a .rss file on the site
googlebot then complains to site operators that there is a .rss file out there on the site, but it can not figure out what to do with the link cause it is not allowed to index it. It sometimes even includes this content in search results.
Site operators then complain on meta
Our general fix here is just to let googlebot crawl every page on the site and use canonicals and indexing hints in HTTP headers to direct it to the pit of success.
I am working with @jomaxro on this and we have already made some good progress.
Thanks for the update Sam, that all makes sense and I feel your pain. I’m not an SEO but I used to run bigger websites and worked with SEO teams, on forums it was often very tricky!
A robotted page can still be indexed if linked to from from other sites
While Google won’t crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).
We’ve long included pages we don’t want indexed in the default robots.txt file each Discourse site has. This previously worked just fine. At an unknown point in past this was no longer enough, Google decided to index pages linked from elsewhere even if disallowed via a robots.txt.
Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Which leads us to today. We’re testing the removal of certain pages from the robots.txt. We have to be careful, as we’re making all these changes based on the Google documentation, so we know we’re good with Googlebot, but need to also check other major crawlers to ensure we’re not going to cause issues there.
Hi Jeff, that all makes sense to me and I understand. I was just wanting to doublecheck that I couldn’t have done something to hide the thread pages from my set up in Google? The main home page and categories are appearing in Google but none of the thread pages are, it’s been a couple of months now. This is my site: https://community.jackwallington.com/
I believe we have made all the adjustments on our end to accommodate the recent Google behavior changes… maybe @jomaxro can confirm? You will want to be on the latest version of Discourse.
Unless you installed a plugin to add that, I can’t think of a way such a header would be added. Google does not ignore the noindex header. Google ignores robots.txt when other sites point to your page. Google does respect it when crawling, which is why the commit above removes robots.txt entries in favor of previously added noindex headers.
I’d suggest signing up for Google Search Console so you can see for yourself what Google is seeing. Perhaps there’s another issue preventing the topics from being seen.
Thanks Joshua, Google Search Console seems happy and says all threads are listed. Very strange when I search for them, the thread pages don’t show but the home and category pages do.