While performing an SEO audit of our site, part of which is run on Discourse, it appears that googlebot is spending a fair amount of crawl budget on rss feeds. This despite the fact that the out-of-the-box robots.txt file for Discourse disallows these URLs and despite the fact that there is a noindex header sent in the HTTP response for these RSS urls.
I’m curious if there is a non-hacky way to disable these RSS feeds altogether on my site. I don’t suspect that many people are using them (will try to confirm this). But my question still stands.
There is no checkbox to disable those feeds at the moment.
If you know your way around nginx, you can craft a location block that matches the .rss and sends a 404 and put this in the appropriate app.yml section.
Is Google actually fetching those pages or are the URLs just showing up in the Google Search Console as “indexed but blocked by robots.txt”?
I don’t see noindex headers on RSS feeds, but if a URL is blocked by robots.txt and has a robots header, the crawler might not ever see the robots header.
[I removed the `curl -I` output, because it wasn’t using GET, so the robots header was missing.]
Edit: I just checked the RSS feeds with a GET request, because I thought the x-robots-noindex header was on the RSS feeds, and it is there, but only with a GET request.
Now I’m remembering what I did on my main forum. Google Search Console was warning about the RSS URLs being indexed but blocked, so I unblocked the feeds with robots.txt because that prevents Googlebot from fetching the URLs and seeing the noindex header. I’m pretty sure that’s going to resolve the warnings, but I don’t know if it’s going to stop Googlebot from crawling those URLs.
# Disallow: /t/*/*.rss
# Disallow: /c/*.rss
I’d worry a little about telling crawlers that there are RSS feeds (with link, see below) but then sending 404s when the bots try to fetch those URLs. It might lead a machine to think that there is some technical problem with the site, lowering its quality score (however the search engines determine quality or whether a site might be broken for users).
Yes, I’m not a fan of the 404 solution. Seems like it could send bad signal to google and is also a pretty blunt-force way to achieve what I want. What I really want is the links not to be on the page, not for the links to be dead links.
To your question, the RSS feeds are "Excluded by 'noindex' tag" in search console. It’s unclear if this means google spent time fetching them and then excluding them or if they were excluded pre-fetch in which case the impact on crawl budget is likely less.
There might be some information about how many RSS URLs they are fetching in the nginx logs. I just checked mine and Googlebot is crawling the RSS feeds, but I just asked them to do that because I unblocked them from robots.txt.