We are using external OAuth for user authentication. Occasionally users receive error 500 when coming to the platform: Snippet from error log:
Started GET "/auth/oauth2_basic/callback?code=[coderemoved]&state=[stateremoved]" for [IP] at [timestamp]
(oauth2_basic) Setup endpoint detected, running now.
(oauth2_basic) Callback phase initiated.
Faraday::TimeoutError (Timeout::Error)
lib/final_destination/resolver.rb:31:in `block in lookup'
lib/final_destination/resolver.rb:8:in `synchronize'
lib/final_destination/resolver.rb:8:in `lookup'
lib/final_destination/ssrf_detector.rb:127:in `lookup_ips'
lib/final_destination/ssrf_detector.rb:95:in `lookup_and_filter_ips'
lib/final_destination/http.rb:13:in `connect'
lib/middleware/omniauth_bypass_middleware.rb:43:in `call'
lib/content_security_policy/middleware.rb:12:in `call'
lib/middleware/anonymous_cache.rb:387:in `call'
lib/middleware/gtm_script_nonce_injector.rb:10:in `call'
config/initializers/100-quiet_logger.rb:20:in `call'
config/initializers/100-silence_logger.rb:29:in `call'
lib/middleware/enforce_hostname.rb:24:in `call'
lib/middleware/request_tracker.rb:233:in `call'
If user just reloads the page, everything works, with log info:
Started GET "/auth/oauth2_basic/callback?code=[coderemoved]&state=[stateremoved]" for [IP] at [timestamp]
(oauth2_basic) Setup endpoint detected, running now.
(oauth2_basic) Callback phase initiated.
Processing by Users::OmniauthCallbacksController#complete as HTML
Parameters: {"code"=>"[coderemoved]", "state"=>"[stateremoved]", "provider"=>"oauth2_basic"}
Deprecation notice: `SiteSetting.anonymous_posting_min_trust_level` has been deprecated. Please use `SiteSetting.anonymous_posting_allowed_groups` instead. (removal in Discourse 3.3)
At /var/www/discourse/lib/site_setting_extension.rb:160:in `public_send`
start
Redirected to https://[pageremoved]
Completed 302 Found in 83ms (ActiveRecord: 0.0ms | Allocations: 11138)
Sadly no steps to repro. It tends to happen when users who have been away for a longer period of time, but I can not confirm it with confidence. It is possible, that there has been a platform upgrade since their last visit.
Any suggestions or additional information I could provide?
I was thinking that, but the error arises way too quickly for it to be normal behaviour. I’m wondering if there might be an overzealous DNS lookup timeout somewhere:
The error is in “resolver.rb”
It is temporarily fixed by refreshing - when the DNS lookup would be cached
For some completely inexplicable reason, I can’t get it to read the OIDC discovery document from any URL that involves our self-hosted DNS. This is despite the fact that I’m perfectly able to curl the file manually from within the docker instance. I’ve eliminated many different variables and the DNS seems to be the only common factor.
Importantly, the Discourse server is able to talk to the OIDC server, even when it fails like this. From the access logs, there is one request:
when it succeeds. Regardless, it never takes longer than 5 seconds. I’m yet to try setting up a proxy for the OIDC server that uses Cloudflare DNS, but that’ll be my next step.
Welp, it’s definitely DNS. Rather than set up a proxy I just added my OIDC server to the hosts file in the docker container and it seems to work now. This is a fragile and suboptimal solution though; I think the developers need to fix the timeout so that it’s something sane. This case reminds me of the 500 mile email story.
In my case, IdP and Discourse VMs as sitting next to each other and while noone can completely exclude possible network issues, no other service experiences it.