CleanUpUploads job never completes leading to Sidekiq hanging and restarting

(Gunnar Helliesen) #42

It’s been running for over 30 minutes now without completing. I’ve got one of those DELETE process waiting already (from ps):

postgres  2123  0.0  1.1 7704720 182196 ?      Ss   05:00   0:00 postgres: 10/main: discourse discourse [local] DELETE waiting

I’d like to do as @riking and @mpalmer suggested and find out which statement is hanging, but when I enter Matt’s SQL command I get an error:

discourse=# SELECT procpid,current_query FROM pg_stat_activity;
ERROR:  column "procpid" does not exist
LINE 1: SELECT procpid,current_query FROM pg_stat_activity;

Please keep in mind that I’m not a developer and I’m not working on a development machine. However, I googled and found out about pg_stats_activity, which gave me this:

client backend
 16400 | discourse | 2123 |    16385 | discourse | pry                                                       |             |                 |              -1 | 2018-07-02 05:00:11.960878+00 | 2018-07-02 05:00:11.991213+00 | 2018-07-02 05:26:15.060119+00 | 2018-07-02 05:26:15.060169+00 | Client          | ClientRead          | idle in transaction |   131041787 |              | DELETE FROM "post_uploads" WHERE "post_uploads"."id" = 30966                                                                                                            

Several minutes later, that process is still hanging.

Does this help?

(Kane York) #43

Pulled out what i’m pretty sure are the relevant parts.

So it’s doing a giant loop in Ruby-land?

(Gunnar Helliesen) #44

FYI, I did a rebuild, reboot, and started another manual CleanUpUploads job, and let it sit for ~10 minutes. Sure enough, I got another DELETE waiting process. This time it says:

DELETE FROM "post_uploads" WHERE "post_uploads"."id" = 989209

So, a different id. At least we now know it’s not a specific entry that’s the problem. Right?

Ideas, anyone?

(Gunnar Helliesen) #45

I interrupted it after about 30 minutes. It continued hanging for a long time, then spat:

[1] pry(main)>
from /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/rack-mini-profiler-1.0.0/lib/patches/db/pg.rb:92:in `async_exec'

So, same as last time.

What should I try next? BTW, this is with fresh code pulled an hour ago.

(Gunnar Helliesen) #46

In a fit of desperation, I unchecked the Clean up Uploads setting in Discourse settings. I restarted the server and waited 30 minutes.

Boom. After 30 minutes I got the dreaded “Sidekiq is consuming too much memory” message in the Error Log, and Sidekiq was restarted.

So, unless I’m missing something, the CleanUpUploads job is a red herring. Or?

(Sam Saffron) #47

Possibly your db is in a bad state. One thing we do in some cases like this is a full backup / restore.

Dump the db to file, load db from file. This type of full rebuild tends to recover a bunch of issues in PG. Not saying it will sort everything out, but it will not do any damage.

(Gunnar Helliesen) #48

OK, that sounds like a good thing to try. To save me some googling, do you know of a set of instructions somewhere on how to do this? I’d have to do it from inside the Docker container, correct?


(Sam Saffron) #49

Yes you would, I am not sure where the instruction are online but this will work more or less:

Just be sure to run sudo -iu posgtres dumpdb discourse > backup.db so you have a backup file to work with.

(Gunnar Helliesen) #50

Yes, that bit is kinda important. :wink: Thanks!

(Andrew Waugh) #51

I’m guessing the site should be in read only mode before the backup, and stay until the restore is finished, right?

(Sam Saffron) #52

The site should be down that is the sv stop unicorn thing.

(Gunnar Helliesen) #53

Did you mean pg_dump? There’s no dumpdb command, at least not on the test machine I’m using now (Postgres 9.5).

(Sam Saffron) #54

Yes, sorry, also you really want to use this same chance to upgrade to PG 10

(Gunnar Helliesen) #55

Sure. Our production system is on 10, but I’m performing this procedure on our test system first, just to get familiar with it.

(Gunnar Helliesen) #56

OK, a full dump and restore is now done on our production system as well.

After the restore, I performed a'testing').execute test, as mentioned here:

And it returned very quickly. We still have the “A check for updates has not been performed lately. Ensure sidekiq is running.” error message on the /admin page, though. Sidekiq is running and the Jobs::VersionCheck job did run after reboot.

The Jobs::CleanUpUploads job ran for 30 minutes straight, then Sidekiq was killed for using too much memory and restarted. So, it seems we’re back to square 1.

What can we try next?


(Sam Saffron) #57

I am not sure… I guess PM me with access to the server and I will have a quick look.

(Sam Saffron) #58

Well … this is why this is happening:

> Upload.find(2500).post_uploads.count
=> 941451

So we have an upload somehow associated with about 1 million posts … hmmm… digging deeper

> PostUpload.count
=> 973724

Oh my…

Fixing per:

[27] pry(main)> PostUpload.where(upload_id: 2500).delete_all
=> 941451
[28] pry(main)> PostUpload.count
=> 32274

@Gunnar should be good now, can you keep an eye on this, if this starts again we need to find out what is inserting all these rows into post upload table … cc @zogstrip / @gerhard

My guess is this is some bug in a merge script? Did you merge any users recently?

(Gunnar Helliesen) #59

We may have, but @JagWaugh would know more about that than I do. Andrew?

Thank you!

(Andrew Waugh) #60

No. Not via the shell. I did, via the UX, reassign some post ownerships to merge multiple accounts, but that was back when we first went live.

(Andrew Waugh) #61

If I am not mistaken there was a “great rebake” a while ago, roughly March/April iirc. I have noticed that any of our historical (imported) posts which included a link to our photos site were then oneboxed. Would a huge number of thumbnails, all the same image, have caused this problem?