CleanUpUploads job never completes leading to Sidekiq hanging and restarting


(Sam Saffron) #62

hmmm please explain, can I have an explicit example here?


(Andrew Waugh) #63

Here’s two:

These are both from before we moved to Discourse.

When we first went live they were just urls pointing to a specific album/photo in our photo site. Now they point to something incorrect.

The second post above should point to:
http://www.jag-lovers.org/snaps/snap_view.php3?id=1282259181

EDIT: The form for a url to our photos is:

Welcome to Jag-lovers - The Members' Photo Albumtimestamp of the upload


(Andrew Waugh) #64

Thanks for looking at this Sam.

This started with our scheduled backups failing.

Yesterday the backup ran (i.e. there is a file in /backups), but it sent a PM indicating that the backup had failed. Also, about 10 minutes into that backup, /logs shows

Sidekiq heartbeat test failed, restarting

followed immediately by:

can't alloc thread
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:248:in `new'

Today, the backup ran to completion, no “Failed” PM, and no failed hearbeat.


(Sam Saffron) #65

That is an out-of-memory.

Do you have swap setup? Is memory pressure high? (looks unrelated to the OP though)


(Andrew Waugh) #66

I don’t have SSH access, I’ll ask @Gunnar to give you the current specs (but I think we have oodles of memory).

I monitored our instance while the backup ran yesterday, watching sidekiq. Is sidekiq supposed to pause during the entire backup, or should it only pause while the database dump is going on?


(Gunnar Helliesen) #67

We have 16 GB of memory, 4 GB of swap (and 4 CPU cores).

From app.yml:

 db_shared_buffers: "7168MB"
 db_work_mem: "128MB"
 UNICORN_WORKERS: 8
 UNICORN_SIDEKIQ_MAX_RSS: 1000

Here’s output from top just now:

top - 05:30:07 up 4 days,  5:33,  1 user,  load average: 0.03, 0.23, 0.22
Tasks: 226 total,   1 running, 225 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.8 sy,  0.0 ni, 98.7 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16430868 total,   254948 free,  4148396 used, 12027524 buff/cache
KiB Swap:  4194300 total,  3457204 free,   737096 used.  4290288 avail Mem

Thanks,
Gunnar


(Sam Saffron) #68

Yeah looks like you tuned your self up to the edge here and memory pressure is highish. You can probably get away with 7 workers and 6.5 gig dedicated to PG which will reduce the pressure some.


(Gunnar Helliesen) #69

Cool, thanks! I’ve rebuilt Discourse with the new values and rebooted the server. I’ll let it run for a day or two to see how it settles.

Thanks,
Gunnar


(Andrew Waugh) #70

Same again today. Backup ran, but a “failed PM”, and the same two events in /logs


(Andrew Waugh) #71

Something still not quite right.

Today’s backup completed, without sending a “failed” PM, but /logs shows (just as the backup should have been completing):

Sidekiq heartbeat test failed, restarting
can’t alloc thread (for mini-racer)
Job exception: BackupRestore::OperationRunningError

There are now 3 backups held locally, but our settings are to only retain 2:

and /admin shows:
image

Notice that dashboard was last updated July 11, but the indicates that the latest backup was July 9, although /backups lists 2 backups since July 9.

The Backup on July 9 is there in /backups, but the system sent a PM that the backup had failed that day.

It seems to me like a semaphore for the backup job is out of sync.

Yesterday I noticed that although sidekiq is paused while the backup runs, the “Processed” count ticked up although the only thing executing was the backup. Perhaps I’ve misunderstood how sidekiq works (or what “paused” and “processed” mean).

Feel free to split this off into a separate topic.


(Gunnar Helliesen) #72

@sam, here are the stats after the server has been running for a few days:

gunnar@dewis:~$ free
              total        used        free      shared  buff/cache   available
Mem:       16430868     4301592      148584     7000500    11980692     4681180
Swap:       4194300      946084     3248216
gunnar@dewis:~$

top - 19:43:17 up 2 days, 13:33,  1 user,  load average: 0.62, 0.96, 0.84
Tasks: 234 total,   1 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.4 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.1 st
KiB Mem : 16430868 total,   663128 free,  3797152 used, 11970588 buff/cache
KiB Swap:  4194300 total,  3232148 free,   962152 used.  5185896 avail Mem

Like @JagWaugh posted above, we have entries in our error log, all from 2:04 AM (local time? If so, that would be 9:04 AM UTC).

Sidekiq heartbeat test failed, restarting
can't alloc thread /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/mini_racer-0.2.0/lib/mini_racer.rb:271:in `new'
Job exception: BackupRestore::OperationRunningError

The latest backup file is named 2018-07-11-083252 and has a file date of 8:46 AM (UTC).

What do you think, given this and what @JagWaugh posted above? We suddenly have three complete backup files this morning, and no message from system about a failed backup. Our backup settings are:

enable backups: Checked
readonly mode during backup: Unchecked
maximum backups: 2
automatic backups enabled: Checked
backup frequency: 1
enable uploads to S3: Checked
(S3 settings set and verified)
backup with uploads: Checked

Grateful for any ideas.

Gunnar


(Sam Saffron) #73

Can you disable automatic backups for a few days and see what happens? Maybe there is some bug in the aws upload code that is causing excessive memory usage.

I would:

  1. Disable all automatic backups
  2. If good for 4 days, enable automatic backups without aws upload
  3. If good for 4 days, enable aws upload in backups and see what happens

At least you will isolate what the issue is.


(Gunnar Helliesen) #74

I’ll do as you suggest and report back.

Thanks!
Gunnar