CleanUpUploads job never completes leading to Sidekiq hanging and restarting


(Sam Saffron) #22

What do you see in the scheduler tab? Is everything done? Anything take forever to run?


(Andrew Waugh) #23

I’ve been watching it too, nothing caught my eye as running overly long.

scheduler shows a few “OK”, and a bunch of “QUEUED”.

/logs just shows the sidekiq restarts, and the odd browser message from a client or a rejected email.


(Gunnar Helliesen) #24

Recurring Jobs history

Worker Last Run Last Result Last Duration Last Owner Next Run Due Actions
Jobs::WhosOnlineGoingOffline 56 seconds ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc right now
Jobs::CategoryStats 22 hours ago OK 17.03 secs right now
Jobs::ProcessBadgeBacklog 40 seconds ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 19 seconds
Jobs::MigrateUploadScheme 9 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 40 seconds
Jobs::EnqueueOnceoffs 9 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 46 seconds
Jobs::PollFeed 4 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 53 seconds
Jobs::Heartbeat just now QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 2 minutes
Jobs::PollMailbox 1 minute ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 4 minutes
Jobs::EnqueueDigestEmails 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 9 minutes
Jobs::AboutStats 19 minutes ago OK 966ms in 9 minutes
Jobs::PeriodicalUpdates 4 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 10 minutes
Jobs::DestroyOldDeletionStubs 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 10 minutes
Jobs::UnsilenceUsers 4 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 10 minutes
Jobs::DashboardStats 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 10 minutes
Jobs::PendingQueuedPostReminder 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 37 minutes
Jobs::CleanUpUploads 19 minutes ago RUNNING -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 38 minutes
Jobs::PendingUsersReminder 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 40 minutes
Jobs::TopRefreshToday 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 40 minutes
Jobs::CreateMissingAvatars 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 41 minutes
Jobs::PendingFlagsReminder 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 42 minutes
Jobs::DirectoryRefreshDaily 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 42 minutes
Jobs::ReindexSearch 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 1 hour
Jobs::ScheduleBackup 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 2 hours
Jobs::CleanUpExports 19 hours ago OK 4ms in 2 hours
Jobs::CleanUpUnusedStagedUsers 19 hours ago OK 21ms in 3 hours
Jobs::PatreonSyncPatronsToGroups 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 5 hours
Jobs::Tl3Promotions 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 6 hours
Jobs::CleanUpUnmatchedEmails 16 hours ago OK 5ms in 6 hours
Jobs::EnsureDbConsistency 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 11 hours
Jobs::AutoQueueHandler 7 hours ago OK 9ms in 16 hours
Jobs::CleanUpUnsubscribeKeys 7 hours ago OK 192ms in 16 hours
Jobs::CleanUpEmailLogs 6 hours ago OK 2.74 secs in 16 hours
Jobs::CleanUpUnmatchedIPs 7 hours ago OK 3ms in 16 hours
Jobs::BadgeGrant 5 hours ago OK 52.29 secs in 18 hours
Jobs::CalculateAvgTime 4 hours ago OK 13.05 secs in 19 hours
Jobs::CleanUpCrawlerStats 4 hours ago OK 7ms in 19 hours
Jobs::BabblePruneHistory 2 hours ago OK 3ms in 21 hours
Jobs::GrantAnniversaryBadges 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 22 hours
Jobs::DisableBootstrapMode 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 22 hours
Jobs::TopRefreshOlder 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 22 hours
Jobs::PurgeOldWebHookEvents 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 22 hours
Jobs::PurgeUnactived 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 22 hours
Jobs::DailyPerformanceReport 1 hour ago OK 2ms in 23 hours
Jobs::PurgeDeletedUploads 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 23 hours
Jobs::DirectoryRefreshOlder 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 1 day
Jobs::VersionCheck 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 1 day
Jobs::DestroyOldHiddenPosts 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 1 day
Jobs::GrantNewUserOfTheMonthBadges 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 1 day
Jobs::CleanUpSearchLogs 6 days ago OK 98ms in 1 day
Jobs::PatreonUpdateTokens 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 6 days
Jobs::Weekly 19 minutes ago QUEUED -1ms _scheduler_dewis-app:5582:1:0916df1bf46d0a4b2aaa9031ec7d5ccc in 6 days

And that’s it. The BadgeGrant job, that took the longest, is a Patreon plugin job, I think.


(Sam Saffron) #25

In a steady state everything should show OK, if you click history does anything take a real long time?


(Gunnar Helliesen) #26

No, nothing over 1 minute. Most of the entries are in the low ms range. OTOH, even History doesn’t show stuff going back very far (12 hours, by the looks of it).


(Gunnar Helliesen) #27

Actually… Andrew noticed that there are a bunch of Jobs::CleanupUploads jobs in History showing as RUNNING. They were all (most?) started pretty much 30 minutes apart.

Could those be our culprits?


(Sam Saffron) #28

very very likely… you can try to run that job manually via the console.

./launcher enter app
rails c
> Jobs::CleanUpUploads.new.execute(nil)

… wait a lot …


(Andrew Waugh) #29

Hours, days, or weeks?

Sidekiq is still restarting, nearly 24 hours since Gunnar started the command.

Today I notice a whole bunch of “postgres: 10/main: discourse discourse [local] DELETE waiting” processes when I look at /admin/upgrade@/processes.

Those delete tasks seem to be about 30 min apart, in step with the sidekiq restart messages in /logs.


(Gunnar Helliesen) #30

I was just wondering about that. Is there any way we can stop Sidekiq from triggering new CleanUpUploads jobs while the one I started manually is still running?


(Sam Saffron) #31

Honestly, I am not sure, I would expect minutes… not hours… is it stuck? What happens when you run it? after 10 minutes hit ctrl - c, what do you see?


(Gunnar Helliesen) #32

I tried twice. Both times it ran for more than a full 24 hours before I interrupted it. The only result I could see of it running was that it ate a lot of memory.

One Ctrl/C: Nothing. It continued to hang.
Another Ctrl/C: I got the rails prompt back. Nothing else.

What is that job trying to do, anyway? Is there anything we can do to make it complete successfully, so we can get our site functioning properly again?


(Sam Saffron) #33

It runs this:

I have a lot of difficulty helping figuring out where it is hanging if I can not figure out which line it is stuck at.

Maybe the big query it runs at the end never returns? Maybe the first query fails?

Maybe it should not be doing a find_each here and instead it should do a limit 100

My guess is the simplest way for you to ublock yourself would be to set clean_up_uploads to false. But … you are likely to hit storage problems cause files that are abandoned will stick around in your system forever.

My gut feeling here is that the Upload query never completes.


(Kane York) #34

that says it’s hung in a DELETE statement – can we check the pg stats for which statement is running right now?


(Andrew Waugh) #35

I’m going to be out today, but I can try running some permutations of the queries in that job with data explorer in the evening. Perhaps I can narrow it down to a specific query or even upload.


(Andrew Waugh) #36

The cleanup job must have started multiple times via the regular sidekiq schedule, but I see no “DELETE waiting” tasks in /admin/upgrade#/processes now. What happens with tasks from scheduled jobs when sidekiq restarts, do they go zombie and get cleaned out, run to completion, or does the next run of the job clean them up?

Hopefully Gunnar will have a chance to look at pg stats today (He’s on the west coast).


(Gunnar Helliesen) #37

I ran it manually one more time. This time I let it run for about 12 minutes, then hit Ctrl/C once and just waited. After several minutes it spit out this:

root@dewis:~# cd /var/discourse/
root@dewis:/var/discourse# ./launcher enter app
root@dewis-app:/var/www/discourse# rails c
[1] pry(main)> Jobs::CleanUpUploads.new.execute(nil)
^CInterrupt:
from /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/rack-mini-profiler-1.0.0/lib/patches/db/pg.rb:92:in `async_exec'
[2] pry(main)>

Does that help?


(Gunnar Helliesen) #38

I’d love to, but I’m not that familiar with Postgres. How would I do that?


(Matt Palmer) #39
SELECT procpid,current_query FROM pg_stat_activity;

(Alan Tan) #40

I noticed that we were using destroy_all which can blow the memory gasket easily if there are alot of records in the relation. Patched in


(Gunnar Helliesen) #41

Thanks!

Lemme run a quick pull and rebuild and report back.