Occasional OAuth errors

We are using external OAuth for user authentication. Occasionally users receive error 500 when coming to the platform: Snippet from error log:

Started GET "/auth/oauth2_basic/callback?code=[coderemoved]&state=[stateremoved]" for [IP] at [timestamp]
(oauth2_basic) Setup endpoint detected, running now.
(oauth2_basic) Callback phase initiated.
Faraday::TimeoutError (Timeout::Error)
lib/final_destination/resolver.rb:31:in `block in lookup'
lib/final_destination/resolver.rb:8:in `synchronize'
lib/final_destination/resolver.rb:8:in `lookup'
lib/final_destination/ssrf_detector.rb:127:in `lookup_ips'
lib/final_destination/ssrf_detector.rb:95:in `lookup_and_filter_ips'
lib/final_destination/http.rb:13:in `connect'
lib/middleware/omniauth_bypass_middleware.rb:43:in `call'
lib/content_security_policy/middleware.rb:12:in `call'
lib/middleware/anonymous_cache.rb:387:in `call'
lib/middleware/gtm_script_nonce_injector.rb:10:in `call'
config/initializers/100-quiet_logger.rb:20:in `call'
config/initializers/100-silence_logger.rb:29:in `call'
lib/middleware/enforce_hostname.rb:24:in `call'
lib/middleware/request_tracker.rb:233:in `call'

If user just reloads the page, everything works, with log info:

Started GET "/auth/oauth2_basic/callback?code=[coderemoved]&state=[stateremoved]" for [IP] at [timestamp]
(oauth2_basic) Setup endpoint detected, running now.
(oauth2_basic) Callback phase initiated.
Processing by Users::OmniauthCallbacksController#complete as HTML
  Parameters: {"code"=>"[coderemoved]", "state"=>"[stateremoved]", "provider"=>"oauth2_basic"}
Deprecation notice: `SiteSetting.anonymous_posting_min_trust_level` has been deprecated. Please use `SiteSetting.anonymous_posting_allowed_groups` instead. (removal in Discourse 3.3) 
At /var/www/discourse/lib/site_setting_extension.rb:160:in `public_send`
start
Redirected to https://[pageremoved]
Completed 302 Found in 83ms (ActiveRecord: 0.0ms | Allocations: 11138)

Sadly no steps to repro. It tends to happen when users who have been away for a longer period of time, but I can not confirm it with confidence. It is possible, that there has been a platform upgrade since their last visit.
Any suggestions or additional information I could provide?

Bumping this. I’m experiencing exactly the same issues.

Timeout error suggests a networking issue. Could just be a network glitch.

I was thinking that, but the error arises way too quickly for it to be normal behaviour. I’m wondering if there might be an overzealous DNS lookup timeout somewhere:

  1. The error is in “resolver.rb”
  2. It is temporarily fixed by refreshing - when the DNS lookup would be cached
  3. For some completely inexplicable reason, I can’t get it to read the OIDC discovery document from any URL that involves our self-hosted DNS. This is despite the fact that I’m perfectly able to curl the file manually from within the docker instance. I’ve eliminated many different variables and the DNS seems to be the only common factor.

Importantly, the Discourse server is able to talk to the OIDC server, even when it fails like this. From the access logs, there is one request:

21/Jan/2024:23:10:21 +0000] "POST /application/o/token/ HTTP/1.1" 200 7998 "-" "Faraday v2.9.0"

when it fails, and two requests:

[21/Jan/2024:23:21:03 +0000] "POST /application/o/token/ HTTP/1.1" 200 7998 "-" "Faraday v2.9.0"
[21/Jan/2024:23:21:05 +0000] "GET /application/o/userinfo/ HTTP/1.1" 200 5254 "-" "Faraday v2.9.0"

when it succeeds. Regardless, it never takes longer than 5 seconds. I’m yet to try setting up a proxy for the OIDC server that uses Cloudflare DNS, but that’ll be my next step.

Common wisdom is that it’s always dns.

Welp, it’s definitely DNS. Rather than set up a proxy I just added my OIDC server to the hosts file in the docker container and it seems to work now. This is a fragile and suboptimal solution though; I think the developers need to fix the timeout so that it’s something sane. This case reminds me of the 500 mile email story.

You can add stuff to your app.yml to update /etc/hosts on a rebuild. You can look at some other templates for examples.

Could be, but not many folks are having trouble. Could your self-hosted DNS server be overloaded sometimes?

I don’t know where to go about changing the timeout. I don’t remember ever doing it.

In my case, IdP and Discourse VMs as sitting next to each other and while noone can completely exclude possible network issues, no other service experiences it.