After a rebuild today with 3.3.3 (latest stable released a few days ago), SSO stopped working. Still logged in users are fine for now, but new sessions end the SSO flow with the error:
Account login timed out, please try logging in again.
Enabling verbose discourse connect logging shows
Verbose SSO log: Nonce is incorrect, was generated in a different browser session, or has expired
However, nothing on our SSO flow has changed in the last years. Clocks between servers are in sync.
On the other hand, we have very recently updated to 3.3.3 (from 3.3.2) which has security fixes related with Discourse Connect which could be related.
Unlikely relevant, but the rebuild was to enable a CDN. But, I have already reverted all those changes and the SSO issue remains.
After several rebuilds, I was able to make SSO work again by pinning it back to v3.3.2 so it does seem that something was introduced in v3.3.3 that broke SSO support.
I had a cursory look at a git diff v3.3.2 v3.3.3 and nothing obvious jumped out, but it does have changes related to Discourse Connect.
However, I suspect this will start hitting more people as they move into 3.3.3 and user sessions starts to expire and fail to renew. Maybe worth a closer look by someone who knows the code, specially the SSO flow? /cc @sam
PS: Not sure if it may be relevant: I had updated to 3.3.3 over a day ago, but the issues seem to only come up soon after a rebuild via the console few hours ago (to enabled a CDN, but reverting that didn’t fix SSO).
Yes, in the sense that most people run the tests-passed branch, but no in this sense that it’s the latest release on the stable branch, shipped this week: 3.3.3: Security and maintenance release
It’s a long shot, but are you generating the nonce in a different browser session, for example by making the SSO requests from the backend of your application, instead of having users go through the SSO process using browser redirects?
There’s a hidden site setting called discourse_connect_csrf_protection that is enabled by default. To allow SSO requests to be made from outside of a user’s session, it needs to be disabled.
I’m assuming that setting was in place in version 3.3.2, but possibly it was added later.
While we are not doing any unusual SSO things, I did try that anyway by disabling it in the Rails console and all it did for me was remove the error message, in the sense that when the SSO provider redirected back to Discourse, instead of the Account login timed out, please try logging in again. error, there was no message at all (error or otherwise) – but, unfortunately, still logged out.
I’m also grasping for straws here as this is quite odd. I think the fact that the issue didn’t show up when we initially updated to 3.3.3 via the the web interface but only (~36h) later after a console rebuild may be a clue, but I don’t know enough about the differences between the two.
I have tried upping again to 3.3.3 and the issue returned immediately. Switching back to 3.3.2 made SSO work again.
I suspect the issue here is not the DiscourseConnect security fix but rather the nginx change. On tests-passed we had to make a followup on Thursday because it was causing problems on some environments and another user on Github noted CSRF issues.
I appreciate your took time to look into this, specially on a weekend and given that both stable and SSO are a bit niche, but hopefully it will help others too. Thank you!