Extreme memory usage due to bad mail credentials

I’ve got a discourse server which has been failing a lot over recent days, and I’m looking for some help in understanding the problem and fixing it.

I have a lot of OOM killer events, some of which have taken the server down such that a reboot has been required even to get access to ssh.

Disk space is low, with redis data on disk having increased to around 20GB at present.

There is an unusually high CPU level. Whenever I’ve looked, this is a single ruby process. Looking at it’s activity with strace, shows little activity (i.e. it’s not busy with any kind of system calls). Looking with ltrace, I see a lot of calls to malloc, memcpy and strlen. I can see some of the contents of the data being manipulated and it looks like junk data. E.g.

memcpy(0xc000da80, "\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337"..., 87) = 0xc000da80

Note the sequence of byte values.

Does anyone know what’s likely to be going on here, or have any suggestions on how to interrogate this further?

I gather the redis data is cache data rather than primary data? Can I safely wipe it to get the server back up? Is there a preferred procedure?

Rebuilding the docker container effectively does this, but is a slower procedure than is presumably necessary.

Do you have a lot of jobs in your sidekiq queues? Did this happen suddenly?

You could consider solving it the hard way by clearing everything in redis. You will lose all pending jobs in sidekiq though. Start up redis-cli and then enter the command flushdb. After that, restart Discourse.

1 Like

Yes, there’s a lot of jobs there. Looks like SMTP authentication errors.

Thanks. That’s likely to be enough for me to solve this.

1 Like

Unless there’s more to this than I’ve understood, it looks like this is a predictable failure pattern if SMTP is not set up right. digest emails get queued and retried indefinitely till redis/sidekiq falls over.

A better approach would be to have a limit on the number of retries of email jobs, and an alert in the dashboard if SMTP is consistently failing for more than some threshold amount of time.

2 Likes

flushdb removes the sidekiq jobs, but doesn’t recover the disk space in /var/discourse/shared/standalone/redis_data. It mostly consists of files with names like temp-19685.rdb

On the strength of this serverfault query, I’m deleting the files. I’m mostly posting this for the sake of anyone who comes after, and I’ll post something if there’s any ill effects.

Unlikely, as failure to set up email correctly is easily the number one all time support request for Discourse.

I think there was something unique about your configuration problem, perhaps?

1 Like

But… there is such an alert.

1 Like

That sounds reasonable, and suggests there’s another component to what’s going on.

I don’t recall changes to the app configuration file, and it’s dated May 2018.

Another 18K sidekiq jobs have accumulated. In the low priority sidekiq queue, they look like:

Jobs::UserEmail                 {"type"=>"digest", "user_id"=>809, "current_site_id"=>"default"} 

Under retries, I see entries like:

just now     0     low     Jobs::UserEmail     {"type"=>"digest", "user_id"=>2101,    "current_site_id"=>"default"}    Jobs::HandledExceptionWrapper: Wrapped Net::SMTPAuthenticationError: 435 4.7.8 Error: authentication failed: 

shared/standalone/log/rails/sidekiq.log is empty.

the current shared/standalone/log/rails/production.log-20190612 file has entries like:

  Rendered user_notifications/digest.text.erb (122.4ms)
  Rendering user_notifications/digest.html.erb
  Rendered user_notifications/digest.html.erb (103.7ms)
  Rendering user_notifications/digest.text.erb
  Rendered user_notifications/digest.text.erb (114.9ms)
Sent mail to example@example.com (2204.7ms)
Job exception: 435 4.7.8 Error: authentication failed: 

Curiously, I don’t see those in the production.log file. Maybe it takes more retry attempts for that to happen?

In the container/app.yml file, the DISCOURSE_SMTP_* details point to smtp.mandrillapp.com on port 587. I can see a good deal of traffic there. Looking in the packets, it is using starttls, so I can’t see much more than the handshake. I might be able to get at that via strace if it’s important.

I don’t see the API key that’s being used as a password with mandrillapp in the list of enabled keys there. I don’t know how that happened, but it makes the SMTP authentication errors unsurprising.

I expect that I can fix my site, but I am still a bit concerned that these errors weren’t being dealt with better.

I didn’t see one.

I’ve changed the SMTP details now, and am rebuilding the container. Would I still expect to be able to see evidence of that alert if it has been triggered previously?

The old jobs weren’t supposed to pile up, but did. Whether or not there was a dashboard message, these errors were not handled in the expected way.

I didn’t see a dashboard warning, but I’m not a regular user of the site, and it’s possible I just missed it. I’ve asked the client I run the server for whether they’ve seen anything about the SMTP issue on the admin dashboard. I’ll pass that on when I know.

I used to receive the Summary emails for the site, but haven’t recently. Looking back, it appears that the last of those was 21 April 2019.

A test email now gets through OK using new SMTP details. Sidekiq queues are empty.

2 Likes

If you don’t log in to the site (and don’t have other monitoring set up) you can’t know if it stops being able to send email. There isn’t a way to program around an admin who ignores the site altogether.

1 Like

Sure. It’s not that there isn’t a site admin, it’s just that I’m mostly not that person. I provide sysadmin support.

I did have a fairly quick look over the admin dashboard when I noticed problems. It told me that the version wasn’t the latest (one beta version out of date), and I updated accordingly. I didn’t note anything about emails, but perhaps I didn’t look in the right place.

1 Like

Hmm. That “you have lots of sidekiq” jobs thing has been around for a while, but not forever. Was the site waaay out of date?

Also, while getting mail configured is really hard for many people to get figured out, it’s fairly unusual for it to stop working.

It was only a little out of date. 2.3.0.beta10 vs 2.3.0.beta11. The server however is quite old, and still running Ubuntu 14.04, which is out of support. That means that the docker version is also out of date, and I did see one issue here which made me wonder if that could be related. Seems less likely now.

Indeed. Usually email would be working before users subscribe and content starts flowing. That means that the digest emails piling up significantly would be unlikely.

My best guess here is that the authentication API key got inadvertently deleted at the SMTP service, and these emails then started piling up.

It occurs to me that a part of the puzzle might be processes which didn’t complete due to memory constraints. E.g. if raising an alert is something done at the end of a large sidekiq job which failed due to lack of memory, then it might not happen at all. Similarly if a sidekiq operation involved first attempting delivery of all the queued emails, and then cleaning up ones with too many retries later in the same job, then the cleanup would not happen.

1 Like