Sidekiq queue too large - Google email provider problems

(Chris Tomlinson) #1

My admin panel says that “There are 131453 email jobs that failed.” and “The number of queued jobs is 236488, which is high”.

Anecdotal evidence suggests that my email provider (Google) silently stopped accepting most emails a few weeks ago. Right now there is a high rate of “Job exception: 454 4.7.0 Too many login attempts, please try again later. xxxxxxxxxxx - gsmtp” errors displayed in the log view.

Emails appear to be sent at a rate of about 4 per day at the moment but it is sporadic; I get this figure by looking at the sent mail history of the SMTP account Discourse is configured to connect to. Given that previous topics about sidekiq indicate that there should be a priority given to user signup and password reset emails, it is potentially also notable that since the 4th of June none of these have been sent - only digest emails have succeeded.

The image below demonstrates quite clearly when the problem begun.

The forum has 1000 users and low activity so the occasional digest email to most users makes up the bulk of the emails sent out by the system.

I am attempting to find an affordable email provider that doesn’t suck in this special Google way but in the mean time, I am concerned at the figures Discourse is reporting regarding this problem and hope someone with more knowledge about Discourse or sidekiq can shed some light on what’s happening.

I guess there is some sort of timeout for email deliverability and those failed jobs relate to emails that have disappeared forever?

Why are the numbers so much higher than the number of emails that the system sends out? The combined total is around 100 times greater than the highest bounds of expected email quantity over a few week period. Does sidekiq really require 100 jobs for every email that it sends out?!

What will happen when I rebuild and restart Discourse with a new SMTP server? I don’t want every email to be sent 100s of times :smiley:

If Sidekiq is receiving a message to try again later from the email service, why is the rate of login attempts still so high? Shouldn’t it back off exponentially?

Does It initiate a new connection for every single email sent out rather than batching through a single SMTP connection? If so, won’t that always upset the upstream SMTP provider since the overhead of that approach is high?

I’m on Discourse (standard docker) version v2.1.0.beta2 +45 and by the way, loving the software so thanks to everyone that’s worked on it and offered support in this forum.

(Jeff Atwood) #2

If you are using Gmail as a general outgoing SMTP mail provider for Discourse you are absolutely in violation of Google’s terms of services, and I’m surprised it works at all.

We repeatedly tell people not to do this in the setup docs, and elsewhere.

(Chris Tomlinson) #3

This is Google Apps rather than Gmail. I didn’t see anything in the Google Apps ToS that prohibited this and didn’t see anything on the Discourse setup docs either despite days of research and careful planning (last year). I did find some recommendations for paid email providers somewhere on the GitHub documentation but none were suitable (e.g. Elasticemail injects invalid unsubscription information into all emails, which causes all email sent through that service to contravene GDPR, and the UK data protection laws that preceded that).

I appreciate the advice and am sorry if I have caused you some frustration by missing something that you felt should have been obvious to me a year ago, although I’m not sure if the specifics of the email provider being used really has much bearing on the overall topic I posted?

(Jeff Atwood) #4

It is possible google apps might work, did you try searching here? I remember topics on it;

(Chris Tomlinson) #5

Yeah I definitely searched here. It was a long time ago now though so I don’t recall exactly which posts I took advice from but the consensus I formed was that it should work. I found a variety of external articles on how to do it too, although I typically wouldn’t blindly follow such articles so I expect I would have combined a few different sources of advice before going ahead with the most appropriate solution for my circumstances. This had been all working just fine for around 9 months since the forum launched until the start of June.

Overall, it seemed as though for quantities of email that are higher than the free tiers of the commercial providers Google offered a good free option and was within their ToS. The sending limits were never going to work forever if the forum got very busy but I have always been far below them (2000 per day) so the “Too many login attempts” error has come as a surprise (and in fact doesn’t tally with the list of expected error messages on that Google limits page).

I can accept that for whatever potentially AI-blackbox reason they have decided to apply some other form of secret limit to the account (and be upset with them for the lack of notification, etc.) but ultimately I’m trying to be pragmatic in abandoning that service now and just want to get a better understanding of the impact the prolonged email deliverability delay will have on my forum when I switch provider to re-enable email.

If the technical answer is that Discourse should be creating all of these sidekiq jobs in the event of a prolonged email outage and that it will recover gracefully when given an opportunity, that’s cool - it just all seems a little out of the ordinary to me and I thought that at the very least the information might act as a useful data point about a particular failure mode.

Sending test email failed
(Chris Tomlinson) #6

Some questions remain unanswered but the good news is that I have the forum emails working again with Amazon SES and can answer that the queue length bears little resemblance to the number of emails that were sent.

A queue of around 2.5M sidekiq items resulted in about 1500 emails being sent from Discourse, with no evidence of duplicates on the SES dashboard.

To get the last batch of “retry” jobs to clear (125K of them) I went to the sidekiq “retry” page and clicked on “retry all” at the bottom. It asked “am I sure?” I had no idea what I supposed to be sure of but now I do - the action appears to be synchronous so I hit a 502 error shortly afterwards. Despite this, a lot of the jobs were processed in that time so I just repeated the process a dozen times to get everything back to normal.

The task backlog took around 3 hours to clear on a Digital Ocean $15pm 2Core/2GB server.