hmmm please explain, can I have an explicit example here?
These are both from before we moved to Discourse.
When we first went live they were just urls pointing to a specific album/photo in our photo site. Now they point to something incorrect.
The second post above should point to:
EDIT: The form for a url to our photos is:
Welcome to Jag-lovers - The Members' Photo Albumtimestamp of the upload
Thanks for looking at this Sam.
This started with our scheduled backups failing.
Yesterday the backup ran (i.e. there is a file in /backups), but it sent a PM indicating that the backup had failed. Also, about 10 minutes into that backup, /logs shows
Sidekiq heartbeat test failed, restarting
followed immediately by:
can't alloc thread /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:248:in `new'
Today, the backup ran to completion, no “Failed” PM, and no failed hearbeat.
That is an out-of-memory.
Do you have swap setup? Is memory pressure high? (looks unrelated to the OP though)
I don’t have SSH access, I’ll ask @Gunnar to give you the current specs (but I think we have oodles of memory).
I monitored our instance while the backup ran yesterday, watching sidekiq. Is sidekiq supposed to pause during the entire backup, or should it only pause while the database dump is going on?
We have 16 GB of memory, 4 GB of swap (and 4 CPU cores).
db_shared_buffers: "7168MB" db_work_mem: "128MB" UNICORN_WORKERS: 8 UNICORN_SIDEKIQ_MAX_RSS: 1000
Here’s output from top just now:
top - 05:30:07 up 4 days, 5:33, 1 user, load average: 0.03, 0.23, 0.22 Tasks: 226 total, 1 running, 225 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.4 us, 0.8 sy, 0.0 ni, 98.7 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 16430868 total, 254948 free, 4148396 used, 12027524 buff/cache KiB Swap: 4194300 total, 3457204 free, 737096 used. 4290288 avail Mem
Yeah looks like you tuned your self up to the edge here and memory pressure is highish. You can probably get away with 7 workers and 6.5 gig dedicated to PG which will reduce the pressure some.
Cool, thanks! I’ve rebuilt Discourse with the new values and rebooted the server. I’ll let it run for a day or two to see how it settles.
Same again today. Backup ran, but a “failed PM”, and the same two events in /logs
Something still not quite right.
Today’s backup completed, without sending a “failed” PM, but /logs shows (just as the backup should have been completing):
Sidekiq heartbeat test failed, restarting
can’t alloc thread (for mini-racer)
Job exception: BackupRestore::OperationRunningError
There are now 3 backups held locally, but our settings are to only retain 2:
and /admin shows:
Notice that dashboard was last updated July 11, but the indicates that the latest backup was July 9, although /backups lists 2 backups since July 9.
The Backup on July 9 is there in /backups, but the system sent a PM that the backup had failed that day.
It seems to me like a semaphore for the backup job is out of sync.
Yesterday I noticed that although sidekiq is paused while the backup runs, the “Processed” count ticked up although the only thing executing was the backup. Perhaps I’ve misunderstood how sidekiq works (or what “paused” and “processed” mean).
Feel free to split this off into a separate topic.
@sam, here are the stats after the server has been running for a few days:
gunnar@dewis:~$ free total used free shared buff/cache available Mem: 16430868 4301592 148584 7000500 11980692 4681180 Swap: 4194300 946084 3248216 gunnar@dewis:~$ top - 19:43:17 up 2 days, 13:33, 1 user, load average: 0.62, 0.96, 0.84 Tasks: 234 total, 1 running, 233 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.4 sy, 0.0 ni, 98.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.1 st KiB Mem : 16430868 total, 663128 free, 3797152 used, 11970588 buff/cache KiB Swap: 4194300 total, 3232148 free, 962152 used. 5185896 avail Mem
Like @JagWaugh posted above, we have entries in our error log, all from 2:04 AM (local time? If so, that would be 9:04 AM UTC).
Sidekiq heartbeat test failed, restarting can't alloc thread /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/mini_racer-0.2.0/lib/mini_racer.rb:271:in `new' Job exception: BackupRestore::OperationRunningError
The latest backup file is named 2018-07-11-083252 and has a file date of 8:46 AM (UTC).
What do you think, given this and what @JagWaugh posted above? We suddenly have three complete backup files this morning, and no message from
system about a failed backup. Our backup settings are:
enable backups: Checked readonly mode during backup: Unchecked maximum backups: 2 automatic backups enabled: Checked backup frequency: 1 enable uploads to S3: Checked (S3 settings set and verified) backup with uploads: Checked
Grateful for any ideas.
Can you disable automatic backups for a few days and see what happens? Maybe there is some bug in the aws upload code that is causing excessive memory usage.
- Disable all automatic backups
- If good for 4 days, enable automatic backups without aws upload
- If good for 4 days, enable aws upload in backups and see what happens
At least you will isolate what the issue is.
I’ll do as you suggest and report back.
Picking back up on this:
- We ran for a week without any scheduled backups. No sidekiq restarts.
- We ran for a week with just scheduled backups. worked fine every day, no sidekiq restarts.
- 2 days ago I re-enabled AWS uploads. 1st day it ran fine, today I see a sidekiq restart, followed immediately by a “can’t allocate thread” mini-racer. The backup completed, and is shown in /admin/backups.
When does the AWS upload get triggered? While uploads were still disabled I noticed that with “keep 2 backups” set, the 3rd backup appears while the .gz is still going on, and the number of backups listed only gets reduced when the .gz completes.
Clearly something is fishy with this giant AWS upload code, for context how big is your backup file.
Also be sure to do a rebuild ASAP, I have a very important memory related patch, curious if this resolves the issue for you.
The backup is 21.7 G.
Just as I finished posting the last post a PM arrived indicating that the backup had failed (no space left on device), but… here’s the kicker… the .gz file for today is there in /admin/backups, and it’s the same size as the (successful) backup file from yesterday.
I get that you want backups and totally want to fix this, but is there any reason not to simply rely on digital ocean backups here?
21.7G is huge, I have a feeling some stuff may need changing.
We’re at 1ca999f. (Gunnar did a rebuild yesterday)
Isn’t that gonna be difficult to do, using AWS for hosting as we are?
Ooops … was not aware of that … I guess you are just going to have to hold tight a bit here, we are planning to do some work in this area soon.
How much free disk space do you have on local?
The particular change I want was committed yesterday to Discourse Docker, so another rebuild would be a good idea just to make sure it hits.
Have you considered moving uploads to s3 and fronting that with a cdn… it will make your backups way smaller.
backups is: 43.4 GB (49.8 GB free)
uploads is: 14.3 GB (43.3 GB free)
I’ll ask Gunnar to do a rebuild tonight.
We talked about moving uploads, I can’t remember why we decided not to.