Users receive multiple summary emails when sidekiq queue is >30 minutes

pfaffman · March 21, 2019, 10:15pm

I’ve got a high-volume system with 24GB ram, a 30GB on disk database, and 2M page views/month. There were some bumps as the site got migrated and updated to the new image format, but last I looked it appeared to be running smoothly according to those familiar with the site.

Earlier today I noticed 25K sidekiq jobs. They got processed pretty quickly, but sidekiq reported a 10Ks latency.

A user reported that they had received 5 summary emails in the space of an hour. Here’s the mail-sent graph:

Here’s grafana stats for the day.

The site was upgraded a couple days back when I installed the prometheus exporter plugin.

There were complaints of multiple summary emails earlier, but I chalked that up to the site being swamped due to the image processing.

Does anyone have an explanation for why a user might receive multiple summary emails in a day?

pfaffman · March 21, 2019, 10:28pm

~~I found a user who has weekly summaries set in their settings. That user received 3-7 summary emails per day over the past few days.~~

Edit: No I didn’t. But I did find a user who was sent four digests today (in that list of 300 messages):

Here’s the query in case I did something stupid:

SELECT user_id, created_at
from email_logs 
WHERE email_type='digest'
AND created_at > now() - interval '4 day' 
ORDER BY user_id, created_at DESC
LIMIT 300

There are more than a dozen users with 3-6 digests today.

codinghorror · March 22, 2019, 6:44am

Database corruption, maybe? We have no other systemic reports of digest emails going out inappropriately.

Did you check and make sure this user doesn’t have anything weird in their user settings for the activity summary (aka “digest”)

pfaffman · March 22, 2019, 9:27am

The ones I inspected had “weekly” checked.

I can have a closer look at the rails console.

Any other idea of what kind of corruption could cause this?

.

codinghorror · March 22, 2019, 9:28am

Zero idea but we have no other reports of this, so it’s specific to the migration or config somehow.

Could there be multiple instances running or multiple sidekiqs or anything bizarre like that?

pfaffman · March 22, 2019, 9:38am

Something like multiple sidekiq is something I considered, or somehow sidekiq is slow so it (or postgres) doesn’t report that the digest has been created when it checks again?

Remember that we were concerned about disk performance. Is it possible that postgres could somehow not commit that the digest has been sent and another one gets started before the last one gets finished?

I’ll double check that there isn’t somehow an extra container running somehow.

EDIT: I see just the data, web, and mail-receiver containers.

I checked the mail logs for the user that complained. There were four digests yesterday:

  created_at: Thu, 21 Mar 2019 15:11:33 UTC +00:00,
  created_at: Thu, 21 Mar 2019 15:58:46 UTC +00:00,
  created_at: Thu, 21 Mar 2019 17:15:29 UTC +00:00,
  created_at: Thu, 21 Mar 2019 18:35:39 UTC +00:00,

Each took had the same time for created_at as updated_at. UserOptions look normal.

  digest_after_minutes: 10080,

pfaffman · March 22, 2019, 11:38am

I think I’ve identified a race condition. I know that the sidekiq queue had a 10K second (~3 hour) latency. Messages are queued every 30 minutes:

https://github.com/discourse/discourse/blob/master/app/jobs/scheduled/enqueue_digest_emails.rb#L3-L4

So a digest gets queued in an hour-long queue. Half an hour later we test for whether to queue another digest, checking the last time that a message was sent:

https://github.com/discourse/discourse/blob/master/app/jobs/scheduled/enqueue_digest_emails.rb#L22

The last digest is queued, but not sent, so last_emailed_at isn’t updated, so another message gets queued.

The solutions that I see are:

this site gets faster hardware (beyond my control)
target_user_ids somehow searches the queue to see if there is a queued, unsent digest (seems really hard and won’t get into core in my lifetime).
Have the process that sends the digest do another check of last_emailed_at before sending the digest (I think this might be the best long-term solution, but haven’t found that code)
I add a hook to modify enqueue_digest_emails.rb to make every 30.minutes something like 4.hours (6 hours?) and hope that is enough time for the queue to get processed before it runs again.

Number 4 above seems like my only immediate solution. Is there some deleterious side effect that I am overlooking?

(I did recently increase db_shared_buffers from 10GB to 16GB for the 30GB on disk database. This might help, but I suspect that it won’t be enough. There are ~25K users at TL1 and above.)

codinghorror · March 22, 2019, 11:43am

Entirely plausible, I suggest running basic disk perf stats, if your disks are slow (non-SSD) and you have a lot of activity and a large database… it isn’t going to be pretty.

pfaffman · March 22, 2019, 11:59am

We both know that the disk performance is insufficient. That is beyond my control.

I guess I’ll up the digest queue to 4 hours. I don’t see an alternative.

neil · March 22, 2019, 1:28pm

I think this is a good solution too. Summary emails (and many others) are put into the low-priority queue, so are most likely to be delayed when sidekiq has a lot of jobs to process. I don’t think it’s too much of a hack to check last_emailed_at if it’s a digest email in here.

pfaffman · March 22, 2019, 2:28pm

Thanks, @neil. So something like

      return if type.to_s == "digest" && user.last_emailed_at < user.digest_after_minutes

right about

https://github.com/discourse/discourse/blob/master/app/jobs/regular/user_email.rb#L101

?

neil · March 22, 2019, 2:44pm

Yup, that’s about right.

return if type.to_s == "digest" &&
  user.last_emailed_at < user.user_option.digest_after_minutes.minutes.ago

pfaffman · March 22, 2019, 2:47pm

Aha! .ago. Obvious once I see it.

Should I submit a PR for this two-liner or will it be faster for you to just do it?

zogstrip · March 22, 2019, 2:53pm

That’d be awesome if you could also include a test

pfaffman · March 22, 2019, 2:57pm

Offer to give a guy an inch. . .

A test, huh? Having never written one, I think it’ll take me a couple hours to pull that off. It’s not too soon for me to learn how to do that, though. Maybe next week.

neil · March 22, 2019, 3:11pm

I can do it and then you can be my accomplice if something breaks.

pfaffman · March 22, 2019, 3:17pm

I think I"m a few minutes away from figuring out how to write the test. . . but then I’ll still have to figure out how to run it.

But I’d be honored to be your accomplice.

And if you’re going to write the test, then I’ll go back to writing bash tab completion for ./launcher. It’s been bugging me for years.

neil · March 22, 2019, 3:19pm

Turns out there’s a similar test to the one I wanted to write that has a problem… It isn’t actually testing anything, so I should fix that too.

  it "doesn't call the mailer when the user is missing" do
    Jobs::UserEmail.new.execute(type: :digest, user_id: 1234)

    expect(ActionMailer::Base.deliveries).to eq([])
  end

^ Passes for valid user_id values too.

pfaffman · March 22, 2019, 3:24pm

This is getting way above my pay grade.

neil · March 22, 2019, 4:35pm

Fix is here:

https://github.com/discourse/discourse/commit/399e937a38f4b4917e0c44897c308d39d4a844e1

Topic		Replies	Views
Sidekiq Error Many Summary Mail sent failed Support	0	360	November 15, 2021
Digests all sending at the same time Support activity-summary	6	805	September 22, 2018
More then 3 million enqueued jobs show in sidekiq? How to I stop it generate the queue? Support	17	225	March 6, 2025
Posts in rapid succession do not send all emails Support	2	408	November 29, 2021
Summary email limit sent in 1 hour Support	13	1420	December 11, 2020

Users receive multiple summary emails when sidekiq queue is >30 minutes

Related topics