I found a user who has weekly summaries set in their settings. That user received 3-7 summary emails per day over the past few days.
Edit: No I didn’t. But I did find a user who was sent four digests today (in that list of 300 messages):
Here’s the query in case I did something stupid:
SELECT user_id, created_at
from email_logs
WHERE email_type='digest'
AND created_at > now() - interval '4 day'
ORDER BY user_id, created_at DESC
LIMIT 300
There are more than a dozen users with 3-6 digests today.
Something like multiple sidekiq is something I considered, or somehow sidekiq is slow so it (or postgres) doesn’t report that the digest has been created when it checks again?
Remember that we were concerned about disk performance. Is it possible that postgres could somehow not commit that the digest has been sent and another one gets started before the last one gets finished?
I’ll double check that there isn’t somehow an extra container running somehow.
EDIT: I see just the data, web, and mail-receiver containers.
I checked the mail logs for the user that complained. There were four digests yesterday:
created_at: Thu, 21 Mar 2019 15:11:33 UTC +00:00,
created_at: Thu, 21 Mar 2019 15:58:46 UTC +00:00,
created_at: Thu, 21 Mar 2019 17:15:29 UTC +00:00,
created_at: Thu, 21 Mar 2019 18:35:39 UTC +00:00,
Each took had the same time for created_at as updated_at. UserOptions look normal.
So a digest gets queued in an hour-long queue. Half an hour later we test for whether to queue another digest, checking the last time that a message was sent:
The last digest is queued, but not sent, so last_emailed_at isn’t updated, so another message gets queued.
The solutions that I see are:
this site gets faster hardware (beyond my control)
target_user_ids somehow searches the queue to see if there is a queued, unsent digest (seems really hard and won’t get into core in my lifetime).
Have the process that sends the digest do another check of last_emailed_at before sending the digest (I think this might be the best long-term solution, but haven’t found that code)
I add a hook to modify enqueue_digest_emails.rb to make every 30.minutes something like 4.hours (6 hours?) and hope that is enough time for the queue to get processed before it runs again.
Number 4 above seems like my only immediate solution. Is there some deleterious side effect that I am overlooking?
(I did recently increase db_shared_buffers from 10GB to 16GB for the 30GB on disk database. This might help, but I suspect that it won’t be enough. There are ~25K users at TL1 and above.)
Entirely plausible, I suggest running basic disk perf stats, if your disks are slow (non-SSD) and you have a lot of activity and a large database… it isn’t going to be pretty.
I think this is a good solution too. Summary emails (and many others) are put into the low-priority queue, so are most likely to be delayed when sidekiq has a lot of jobs to process. I don’t think it’s too much of a hack to check last_emailed_at if it’s a digest email in here.
A test, huh? Having never written one, I think it’ll take me a couple hours to pull that off. It’s not too soon for me to learn how to do that, though. Maybe next week.
Turns out there’s a similar test to the one I wanted to write that has a problem… It isn’t actually testing anything, so I should fix that too.
it "doesn't call the mailer when the user is missing" do
Jobs::UserEmail.new.execute(type: :digest, user_id: 1234)
expect(ActionMailer::Base.deliveries).to eq([])
end