Handling of crawler throttling

vchuravy · September 12, 2023, 6:12pm

I had a general question how crawler throttling is implemented.

According to Change Googlebot crawl rate - Search Console Help the recommended HTTP status is 429 (Too many requests) or 503 (Site unavailable).

But reading through the source code it looks like throttling is implemented by
throwing an error: https://github.com/discourse/discourse/blob/85fddf58bc1e751d0ac5b8192a630c59a34aed7d/lib/rate_limiter.rb#L129

My Ruby on Rails days are long behind me, but I am assuming that this raises a generic 505?

The Google crawler doesn’t quite understand discourse’s throttling and in Google Search Console I can see that our indexing and therefore impressions drastically reduced after throttling was implemented, but not due to throttling, but due to 5xx server errors.

I understand that throttling instances may be sometimes necessary if they cause to much traffic, but I was expecting that discourse reports a HTTP 429, instead of serving the crawler a 505 Internal Error.

Falco · September 12, 2023, 6:15pm

I think what you looking for is

github.com

discourse/discourse/blob/85fddf58bc1e751d0ac5b8192a630c59a34aed7d/app/controllers/application_controller.rb#L181-L200


      
          rescue_from RateLimiter::LimitExceeded do |e|
            retry_time_in_seconds = e&.available_in
          
            response_headers = { "Retry-After": retry_time_in_seconds.to_s }
          
            response_headers["Discourse-Rate-Limit-Error-Code"] = e.error_code if e&.error_code
          
            with_resolved_locale do
              render_json_error(
                e.description,
                type: :rate_limit,
                status: 429,
                extras: {
                  wait_seconds: retry_time_in_seconds,
                  time_left: e&.time_left,
                },
                headers: response_headers,
              )
            end
          end

Which is the “global” controller rescue for that error that sets the status code.

vchuravy · September 12, 2023, 6:21pm

Thanks! That’s reassuring, but doesn’t quite explain why Google Search Console is reporting 5xx errors that correlate with the moment throttling was implemented.

It even reports that it couldn’t fetch the discourse sitemap.xml

vchuravy · September 12, 2023, 6:34pm

In particular throttling the sitemap.xml seems problematic.

I assume that’s what caused the gap in coverage. I could believe Google misreporting 429 as 5xx.

Topic		Replies	Views
Discourse API Generating 429 Support	7	557	March 28, 2023
429 too many requests Installation	6	2634	April 19, 2023
The only solution I have found to workaround «429 Too Many Requests» failure from rubygems.org Support	5	2535	October 27, 2017
Understanding /logs/report_js_error 429 Support	4	964	January 15, 2024
Discourse throwing 500 Dev	2	521	July 4, 2019

Handling of crawler throttling

Related topics