Handling of crawler throttling

I had a general question how crawler throttling is implemented.

According to Change Googlebot crawl rate - Search Console Help the recommended HTTP status is 429 (Too many requests) or 503 (Site unavailable).

But reading through the source code it looks like throttling is implemented by
throwing an error: https://github.com/discourse/discourse/blob/85fddf58bc1e751d0ac5b8192a630c59a34aed7d/lib/rate_limiter.rb#L129

My Ruby on Rails days are long behind me, but I am assuming that this raises a generic 505?

The Google crawler doesn’t quite understand discourse’s throttling and in Google Search Console I can see that our indexing and therefore impressions drastically reduced after throttling was implemented, but not due to throttling, but due to 5xx server errors.

I understand that throttling instances may be sometimes necessary if they cause to much traffic, but I was expecting that discourse reports a HTTP 429, instead of serving the crawler a 505 Internal Error.

1 Like

I think what you looking for is

Which is the “global” controller rescue for that error that sets the status code.

1 Like

Thanks! That’s reassuring, but doesn’t quite explain why Google Search Console is reporting 5xx errors that correlate with the moment throttling was implemented.

It even reports that it couldn’t fetch the discourse sitemap.xml

In particular throttling the sitemap.xml seems problematic.

I assume that’s what caused the gap in coverage. I could believe Google misreporting 429 as 5xx.

1 Like