Way to globally disable all RSS feeds

hhunter · October 25, 2021, 9:22pm

Hello,

While performing an SEO audit of our site, part of which is run on Discourse, it appears that googlebot is spending a fair amount of crawl budget on rss feeds. This despite the fact that the out-of-the-box robots.txt file for Discourse disallows these URLs and despite the fact that there is a noindex header sent in the HTTP response for these RSS urls.

I’m curious if there is a non-hacky way to disable these RSS feeds altogether on my site. I don’t suspect that many people are using them (will try to confirm this). But my question still stands.

Thanks for any help on this!

–Hugh

Falco · October 25, 2021, 9:34pm

There is no checkbox to disable those feeds at the moment.

If you know your way around nginx, you can craft a location block that matches the .rss and sends a 404 and put this in the appropriate app.yml section.

IAmGav · October 25, 2021, 9:36pm

would sending a 404 not be even worst for SEO ?

hello-smile6 · October 25, 2021, 9:42pm

Why not block them using nginx?

j127 · October 26, 2021, 7:06am

Is Google actually fetching those pages or are the URLs just showing up in the Google Search Console as “indexed but blocked by robots.txt”?

~~I don’t see noindex headers on RSS feeds, but if a URL is blocked by robots.txt and has a robots header, the crawler might not ever see the robots header.~~

[I removed the `curl -I` output, because it wasn’t using GET, so the robots header was missing.]

Edit: I just checked the RSS feeds with a GET request, because I thought the x-robots-noindex header was on the RSS feeds, and it is there, but only with a GET request.

Now I’m remembering what I did on my main forum. Google Search Console was warning about the RSS URLs being indexed but blocked, so I unblocked the feeds with robots.txt because that prevents Googlebot from fetching the URLs and seeing the noindex header. I’m pretty sure that’s going to resolve the warnings, but I don’t know if it’s going to stop Googlebot from crawling those URLs.

# Disallow: /t/*/*.rss
# Disallow: /c/*.rss

I’d worry a little about telling crawlers that there are RSS feeds (with link, see below) but then sending 404s when the bots try to fetch those URLs. It might lead a machine to think that there is some technical problem with the site, lowering its quality score (however the search engines determine quality or whether a site might be broken for users).

$ curl -s https://meta.discourse.org/latest | grep -i rss    
    <link rel="alternate" type="application/rss+xml" title="Latest posts" href="https://meta.discourse.org/posts.rss" />
    <link rel="alternate" type="application/rss+xml" title="Latest topics" href="https://meta.discourse.org/latest.rss" />
   ...

hhunter · October 26, 2021, 1:40pm

Yes, I’m not a fan of the 404 solution. Seems like it could send bad signal to google and is also a pretty blunt-force way to achieve what I want. What I really want is the links not to be on the page, not for the links to be dead links.

To your question, the RSS feeds are "Excluded by 'noindex' tag" in search console. It’s unclear if this means google spent time fetching them and then excluding them or if they were excluded pre-fetch in which case the impact on crawl budget is likely less.

j127 · October 26, 2021, 3:53pm

There might be some information about how many RSS URLs they are fetching in the nginx logs. I just checked mine and Googlebot is crawling the RSS feeds, but I just asked them to do that because I unblocked them from robots.txt.

Topic		Replies	Views
Block RSS Support	9	117	February 18, 2025
Bing is picking all the rss feed for each post, can I disable the feed in Discourse Support	10	936	November 21, 2020
Google changed how they process robots.txt in Discourse? Support	20	1648	December 22, 2020
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3245	July 30, 2019
Google notification to remove "noindex" statements from robots.txt Support	8	2431	July 30, 2019

Way to globally disable all RSS feeds

Related topics