CleanUpUploads job never completes leading to Sidekiq hanging and restarting


(Andrew Waugh) #1

We were having problems with backup jobs failing after sidekiq restarts, then we increased UNICORN_SIDEKIQ_MAX_RSS as suggested here.

This helped for a while, but since our latest updates (we’re at 3725fd8 right now) we’re seeing:

Incorrect information in /admin

  • Latest backup reported is not there in /backups
  • "A check for updates has not been performed lately. Ensure sidekiq is running. " (sidekiq IS running, “latest” is blank, but /admin/upgrade shows “New Version Available!”

Sidekiq restarts because of “using too much memory” about every half hour.
The backup files do not appear in /backups.
I’m seeing “Job exception: can’t modify frozen String” and “Job exception: deadlock; recursive locking” in /logs, which I believe are new since we started this round of updates.

We’ve tried:

./launcher rebuilds (a couple of times)
Changing UNICORN_SIDEKIQ_MAX_RSS up to 3000 (it’s back at 1000 now)

Our last successful backup was on the 16th, but since about a week the backup has only run properly about half the time.

What can we try next?


Automatic backups are a hit or miss
Performance improvements on long topics?
(Matt Palmer) #2

Hmm, that’s the same error as was reported in this topic. Wonder if they’re related?


(Andrew Waugh) #3

Could be, but then mails wouldn’t be transferred, no?

Sidekiq falls over every 30 minutes, like clockwork.


(Andrew Waugh) #5

Does your sidekiq restart every 30 min, exactly?


(jj11909) #6

Sorry mate, I retract my previous statement.

My issue was: Polling seems to have died after message


(Andrew Waugh) #7

After a droplet reboot and yet another full rebuild our sidekiq is still misbehaving every half hour.

Automated backups do not run at all, but manual ones do (and upload to S3).

Since the last rebuild /admin now shows the correct information:

/logs (filtered on “Sidekiq”):

Can anyone provide any insight or suggestions as to what we might try to resolve this issue?


(Jeff Atwood) #8

I suspect plugins maybe? What non official third party plugins do you have in place?


(Andrew Waugh) #9

I have no SSH access. We’re running:
image

On a side note, I’ve noticed a couple of things which seem to be a result of the Sidekiq restarts:

Notifications are sometimes repeated or delayed (not surprising).

At least one user stopped getting emails for about 10 days, then after the last rebuild they started going out again (the logs indicated repeatedly that he’d rung the 100/day limit).

(Both of the above are consistent with what I would expect to happen with counters which get out of sync if the update procedure fails.)


(Jeff Atwood) #10

Almost certainly a rogue plugin is at fault, otherwise we would see similar sidekiq memory bloat, but we are not seeing that.


(Sam Saffron) #11

sidekiq going out of memory is very likely due to a scheduled job. Since there are so many plugins here it is hard for me to guess which it is. Maybe mlm-daily-summary that sounds like something that is making big strings.

I would recommend first stripping this down to official plugins only and then building it back up to see how memory goes.


(Andrew Waugh) #12

That is the premis we’re working on, a job from a plugin which runs every 30 minutes and causes sidekiq to fall over.

It’s a bit odd that the backup job recovered every now and then when it first started happening, but fails consistently now.

We’ll have to step through the plugins, hopefully Gunnar has some time.


(Gunnar Helliesen) #13

I’ll start doing so now, we’ll see if it makes any difference. Unfortunately it’ll take some time to identify the culprit, as we have a lot of plugins and we’ll need around 24 hours of testing time per.


(Rafael dos Santos Silva) #14

Yeah, we removed “mailing list daily summary” from core because it was bloating sidekiq memory:

and the plugin

so it’s a very good first candidate for the first to test.


(Andrew Waugh) #15

So, we’ve taken out babble, and MLM. /admin is indicating

> A check for updates has not been performed lately. Ensure sidekiq is running.

and /logs shows

> Sidekiq is consuming too much memory (using: 2272.23M) for 'forums.jag-lovers.com', restarting

with minor variations in the amount of memory exactly 30 minutes apart.

and

> Sidekiq heartbeat test failed, restarting

currently preceeded with 279.

Questions:

  1. Shouldn’t /admin be correct once the “DashboardStats”, “AboutStats”, and “PeriodicalUpdates” jobs have run?

  2. Is the “Sidekiq heartbeat test failed” log message a summary of the individual “Consuming too much memory” messages?

  3. Is it possible that, even after removing MLM and rebuilding, there are unprocessed MLM mail jobs which sidekiq still needs to work through until things return to normal?


(Rafael dos Santos Silva) #16

If your queue is > 0, yes.


(Andrew Waugh) #17

I’m not sure where to look to see if we have MLM mails in the queue.

Is there a way to flush those jobs?

Surely if the plugin has been removed from app.yml and the site rebuilt, then there won’t be a job in scheduler to even try to handle outstanding mails?


(Andrew Waugh) #18

Even after commenting out all non discourse plugins and a rebuild, it still does the exact same thing.

We need some help.


(Jeff Atwood) #19

Are you sure your pending sidekiq job queue was flushed, reset to zero? If this was some systemic issue we would see it on our hosting, or in meta reports. Neither is happening.


(Gunnar Helliesen) #20

How can I tell?

(Oh, and is there documentation for the Sidekiq interface somewhere?)

Thanks!


(Gunnar Helliesen) #21

I’ve been watching the Sidekiq GUI for a while today. Most of the time it’s idle. We have 0 Busy, 0 Enqueued, 0 Retries, and between 5 and 35 Scheduled.

5 of the Scheduled jobs are “Jobs::UnpinTopic” with dates set months or even years in the future. Most of the time those are the only Scheduled jobs. Every now and then a “Jobs::NotifyMailingListSubscribers” pops up in Busy for a few seconds, and then we get 10-30 “Jobs::UserEmail” in Scheduled. After a minute or two they’re gone too, and we’re back to the baseline 5 future jobs.

This is what I’m seeing. Retries remains at 0, same with Dead. Failed jobs remains at the same number, at least for as long as I’ve been watching the Sidekiq GUI today.

After 30 minutes of this, Sidekiq dies (from Error Logs):

Sidekiq heartbeat test failed, restarting

/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/logster-1.2.9/lib/logster/logger.rb:93:in `add_with_opts'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/logster-1.2.9/lib/logster/logger.rb:50:in `add'
/usr/local/lib/ruby/2.5.0/logger.rb:536:in `warn'
config/unicorn.conf.rb:182:in `check_sidekiq_heartbeat'
config/unicorn.conf.rb:199:in `master_sleep'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/unicorn-5.4.0/bin/unicorn:126:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/2.5.0/bin/unicorn:23:in `load'
/var/www/discourse/vendor/bundle/ruby/2.5.0/bin/unicorn:23:in `<main>'

If something is tripping up Sidekiq and making it barf, how can I find it?