CleanUpUploads job never completes leading to Sidekiq hanging and restarting


(Sam Saffron) #62

hmmm please explain, can I have an explicit example here?


(Andrew Waugh) #63

Here’s two:

These are both from before we moved to Discourse.

When we first went live they were just urls pointing to a specific album/photo in our photo site. Now they point to something incorrect.

The second post above should point to:
http://www.jag-lovers.org/snaps/snap_view.php3?id=1282259181

EDIT: The form for a url to our photos is:

Welcome to Jag-lovers - The Members' Photo Albumtimestamp of the upload


(Andrew Waugh) #64

Thanks for looking at this Sam.

This started with our scheduled backups failing.

Yesterday the backup ran (i.e. there is a file in /backups), but it sent a PM indicating that the backup had failed. Also, about 10 minutes into that backup, /logs shows

Sidekiq heartbeat test failed, restarting

followed immediately by:

can't alloc thread
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:248:in `new'

Today, the backup ran to completion, no “Failed” PM, and no failed hearbeat.


(Sam Saffron) #65

That is an out-of-memory.

Do you have swap setup? Is memory pressure high? (looks unrelated to the OP though)


(Andrew Waugh) #66

I don’t have SSH access, I’ll ask @Gunnar to give you the current specs (but I think we have oodles of memory).

I monitored our instance while the backup ran yesterday, watching sidekiq. Is sidekiq supposed to pause during the entire backup, or should it only pause while the database dump is going on?


(Gunnar Helliesen) #67

We have 16 GB of memory, 4 GB of swap (and 4 CPU cores).

From app.yml:

 db_shared_buffers: "7168MB"
 db_work_mem: "128MB"
 UNICORN_WORKERS: 8
 UNICORN_SIDEKIQ_MAX_RSS: 1000

Here’s output from top just now:

top - 05:30:07 up 4 days,  5:33,  1 user,  load average: 0.03, 0.23, 0.22
Tasks: 226 total,   1 running, 225 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.8 sy,  0.0 ni, 98.7 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16430868 total,   254948 free,  4148396 used, 12027524 buff/cache
KiB Swap:  4194300 total,  3457204 free,   737096 used.  4290288 avail Mem

Thanks,
Gunnar


(Sam Saffron) #68

Yeah looks like you tuned your self up to the edge here and memory pressure is highish. You can probably get away with 7 workers and 6.5 gig dedicated to PG which will reduce the pressure some.


(Gunnar Helliesen) #69

Cool, thanks! I’ve rebuilt Discourse with the new values and rebooted the server. I’ll let it run for a day or two to see how it settles.

Thanks,
Gunnar


(Andrew Waugh) #70

Same again today. Backup ran, but a “failed PM”, and the same two events in /logs


(Andrew Waugh) #71

Something still not quite right.

Today’s backup completed, without sending a “failed” PM, but /logs shows (just as the backup should have been completing):

Sidekiq heartbeat test failed, restarting
can’t alloc thread (for mini-racer)
Job exception: BackupRestore::OperationRunningError

There are now 3 backups held locally, but our settings are to only retain 2:

and /admin shows:
image

Notice that dashboard was last updated July 11, but the indicates that the latest backup was July 9, although /backups lists 2 backups since July 9.

The Backup on July 9 is there in /backups, but the system sent a PM that the backup had failed that day.

It seems to me like a semaphore for the backup job is out of sync.

Yesterday I noticed that although sidekiq is paused while the backup runs, the “Processed” count ticked up although the only thing executing was the backup. Perhaps I’ve misunderstood how sidekiq works (or what “paused” and “processed” mean).

Feel free to split this off into a separate topic.


(Gunnar Helliesen) #72

@sam, here are the stats after the server has been running for a few days:

gunnar@dewis:~$ free
              total        used        free      shared  buff/cache   available
Mem:       16430868     4301592      148584     7000500    11980692     4681180
Swap:       4194300      946084     3248216
gunnar@dewis:~$

top - 19:43:17 up 2 days, 13:33,  1 user,  load average: 0.62, 0.96, 0.84
Tasks: 234 total,   1 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.4 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.1 st
KiB Mem : 16430868 total,   663128 free,  3797152 used, 11970588 buff/cache
KiB Swap:  4194300 total,  3232148 free,   962152 used.  5185896 avail Mem

Like @JagWaugh posted above, we have entries in our error log, all from 2:04 AM (local time? If so, that would be 9:04 AM UTC).

Sidekiq heartbeat test failed, restarting
can't alloc thread /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/mini_racer-0.2.0/lib/mini_racer.rb:271:in `new'
Job exception: BackupRestore::OperationRunningError

The latest backup file is named 2018-07-11-083252 and has a file date of 8:46 AM (UTC).

What do you think, given this and what @JagWaugh posted above? We suddenly have three complete backup files this morning, and no message from system about a failed backup. Our backup settings are:

enable backups: Checked
readonly mode during backup: Unchecked
maximum backups: 2
automatic backups enabled: Checked
backup frequency: 1
enable uploads to S3: Checked
(S3 settings set and verified)
backup with uploads: Checked

Grateful for any ideas.

Gunnar


(Sam Saffron) #73

Can you disable automatic backups for a few days and see what happens? Maybe there is some bug in the aws upload code that is causing excessive memory usage.

I would:

  1. Disable all automatic backups
  2. If good for 4 days, enable automatic backups without aws upload
  3. If good for 4 days, enable aws upload in backups and see what happens

At least you will isolate what the issue is.


(Gunnar Helliesen) #74

I’ll do as you suggest and report back.

Thanks!
Gunnar


(Andrew Waugh) #77

Picking back up on this:

  1. We ran for a week without any scheduled backups. No sidekiq restarts.
  2. We ran for a week with just scheduled backups. worked fine every day, no sidekiq restarts.
  3. 2 days ago I re-enabled AWS uploads. 1st day it ran fine, today I see a sidekiq restart, followed immediately by a “can’t allocate thread” mini-racer. The backup completed, and is shown in /admin/backups.

When does the AWS upload get triggered? While uploads were still disabled I noticed that with “keep 2 backups” set, the 3rd backup appears while the .gz is still going on, and the number of backups listed only gets reduced when the .gz completes.


(Sam Saffron) #78

Clearly something is fishy with this giant AWS upload code, for context how big is your backup file.

Also be sure to do a rebuild ASAP, I have a very important memory related patch, curious if this resolves the issue for you.


(Andrew Waugh) #79

The backup is 21.7 G.

Just as I finished posting the last post a PM arrived indicating that the backup had failed (no space left on device), but… here’s the kicker… the .gz file for today is there in /admin/backups, and it’s the same size as the (successful) backup file from yesterday.


(Sam Saffron) #80

I get that you want backups and totally want to fix this, but is there any reason not to simply rely on digital ocean backups here?

21.7G is huge, I have a feeling some stuff may need changing.


(Andrew Waugh) #81

We’re at 1ca999f. (Gunnar did a rebuild yesterday)

Isn’t that gonna be difficult to do, using AWS for hosting as we are?


(Sam Saffron) #82

Ooops … was not aware of that … I guess you are just going to have to hold tight a bit here, we are planning to do some work in this area soon.

How much free disk space do you have on local?

The particular change I want was committed yesterday to Discourse Docker, so another rebuild would be a good idea just to make sure it hits.

Have you considered moving uploads to s3 and fronting that with a cdn… it will make your backups way smaller.


(Andrew Waugh) #83

backups is: 43.4 GB (49.8 GB free)
uploads is: 14.3 GB (43.3 GB free)

I’ll ask Gunnar to do a rebuild tonight.

We talked about moving uploads, I can’t remember why we decided not to.