Users receive multiple summary emails when sidekiq queue is >30 minutes

I’ve got a high-volume system with 24GB ram, a 30GB on disk database, and 2M page views/month. There were some bumps as the site got migrated and updated to the new image format, but last I looked it appeared to be running smoothly according to those familiar with the site.

Earlier today I noticed 25K sidekiq jobs. They got processed pretty quickly, but sidekiq reported a 10Ks latency.

A user reported that they had received 5 summary emails in the space of an hour. Here’s the mail-sent graph:

Here’s grafana stats for the day.

The site was upgraded a couple days back when I installed the prometheus exporter plugin.

There were complaints of multiple summary emails earlier, but I chalked that up to the site being swamped due to the image processing.

Does anyone have an explanation for why a user might receive multiple summary emails in a day?

I found a user who has weekly summaries set in their settings. That user received 3-7 summary emails per day over the past few days.

Edit: No I didn’t. But I did find a user who was sent four digests today (in that list of 300 messages):

image

Here’s the query in case I did something stupid:

SELECT user_id, created_at
from email_logs 
WHERE email_type='digest'
AND created_at > now() - interval '4 day' 
ORDER BY user_id, created_at DESC
LIMIT 300

There are more than a dozen users with 3-6 digests today.

Database corruption, maybe? We have no other systemic reports of digest emails going out inappropriately.

Did you check and make sure this user doesn’t have anything weird in their user settings for the activity summary (aka “digest”)

The ones I inspected had “weekly” checked.

I can have a closer look at the rails console.

Any other idea of what kind of corruption could cause this?

.

Zero idea but we have no other reports of this, so it’s specific to the migration or config somehow.

Could there be multiple instances running or multiple sidekiqs or anything bizarre like that? :man_shrugging:

Something like multiple sidekiq is something I considered, or somehow sidekiq is slow so it (or postgres) doesn’t report that the digest has been created when it checks again?

Remember that we were concerned about disk performance. Is it possible that postgres could somehow not commit that the digest has been sent and another one gets started before the last one gets finished?

I’ll double check that there isn’t somehow an extra container running somehow.

EDIT: I see just the data, web, and mail-receiver containers.

I checked the mail logs for the user that complained. There were four digests yesterday:

  created_at: Thu, 21 Mar 2019 15:11:33 UTC +00:00,
  created_at: Thu, 21 Mar 2019 15:58:46 UTC +00:00,
  created_at: Thu, 21 Mar 2019 17:15:29 UTC +00:00,
  created_at: Thu, 21 Mar 2019 18:35:39 UTC +00:00,

Each took had the same time for created_at as updated_at. UserOptions look normal.

  digest_after_minutes: 10080,
1 Like

I think I’ve identified a race condition. I know that the sidekiq queue had a 10K second (~3 hour) latency. Messages are queued every 30 minutes:

https://github.com/discourse/discourse/blob/master/app/jobs/scheduled/enqueue_digest_emails.rb#L3-L4

So a digest gets queued in an hour-long queue. Half an hour later we test for whether to queue another digest, checking the last time that a message was sent:

https://github.com/discourse/discourse/blob/master/app/jobs/scheduled/enqueue_digest_emails.rb#L22

The last digest is queued, but not sent, so last_emailed_at isn’t updated, so another message gets queued.

The solutions that I see are:

  1. this site gets faster hardware (beyond my control)
  2. target_user_ids somehow searches the queue to see if there is a queued, unsent digest (seems really hard and won’t get into core in my lifetime).
  3. Have the process that sends the digest do another check of last_emailed_at before sending the digest (I think this might be the best long-term solution, but haven’t found that code)
  4. I add a hook to modify enqueue_digest_emails.rb to make every 30.minutes something like 4.hours (6 hours?) and hope that is enough time for the queue to get processed before it runs again.

Number 4 above seems like my only immediate solution. Is there some deleterious side effect that I am overlooking?

(I did recently increase db_shared_buffers from 10GB to 16GB for the 30GB on disk database. This might help, but I suspect that it won’t be enough. There are ~25K users at TL1 and above.)

6 Likes

Entirely plausible, I suggest running basic disk perf stats, if your disks are slow (non-SSD) and you have a lot of activity and a large database… it isn’t going to be pretty.

We both know that the disk performance is insufficient. That is beyond my control.

I guess I’ll up the digest queue to 4 hours. I don’t see an alternative.

2 Likes

I think this is a good solution too. Summary emails (and many others) are put into the low-priority queue, so are most likely to be delayed when sidekiq has a lot of jobs to process. I don’t think it’s too much of a hack to check last_emailed_at if it’s a digest email in here.

5 Likes

Thanks, @neil. So something like

      return if type.to_s == "digest" && user.last_emailed_at < user.digest_after_minutes

right about

https://github.com/discourse/discourse/blob/master/app/jobs/regular/user_email.rb#L101

?

2 Likes

Yup, that’s about right.

return if type.to_s == "digest" &&
  user.last_emailed_at < user.user_option.digest_after_minutes.minutes.ago
3 Likes

Aha! .ago. Obvious once I see it.

Should I submit a PR for this two-liner or will it be faster for you to just do it?

That’d be awesome if you could also include a test :wink:

Offer to give a guy an inch. . . :wink:

A test, huh? Having never written one, I think it’ll take me a couple hours to pull that off. It’s not too soon for me to learn how to do that, though. Maybe next week.

I can do it and then you can be my accomplice if something breaks. :stuck_out_tongue:

4 Likes

I think I"m a few minutes away from figuring out how to write the test. . . but then I’ll still have to figure out how to run it.

But I’d be honored to be your accomplice.

And if you’re going to write the test, then I’ll go back to writing bash tab completion for ./launcher. It’s been bugging me for years.

3 Likes

Turns out there’s a similar test to the one I wanted to write that has a problem… It isn’t actually testing anything, so I should fix that too.

  it "doesn't call the mailer when the user is missing" do
    Jobs::UserEmail.new.execute(type: :digest, user_id: 1234)

    expect(ActionMailer::Base.deliveries).to eq([])
  end

^ Passes for valid user_id values too.

5 Likes

This is getting way above my pay grade. :slight_smile:

Fix is here:

https://github.com/discourse/discourse/commit/399e937a38f4b4917e0c44897c308d39d4a844e1

9 Likes