Discourse crashed after backup - http 500


#1

I woke up to a crashed Discourse showing an HTTP 500 error this morning. I performed git-pull and ./launcher rebuild app and it was quickly back online. This happened once before, so I’d like to figure out why and prevent future occurrences.

A backup runs at 11 pm, and it looks like Discourse crashed shortly after that:

Sidekiq from this morning:

There are no errors in /log.

free -h after rebuilding looks like this:

              total        used        free      shared  buff/cache   available
Mem:           3.9G        2.0G        215M        282M        1.7G        1.3G
Swap:          2.0G         31M        2.0G

Discourse is running alone on a 4 GB Memory / 2 CPU / 60 GB Disk DO droplet. It was originally set up on the smaller 2 GB memory / 2 CPU / 40 GB Disk size droplet.

Plugins:

I’ve checked the S3 bucket, and there is a 4.9GB backup from last night - looks normal.

Should I run ./discourse-setup to tweak the settings, or do I need to look elsewhere?


Is it safe to roll Docker from 18.01 back to 17.10?
Discourse stops responding after the nightly backup has run
(Andrew Waugh) #2

How much free space have you got on your VM, and how big are your backups?

We had a weird crash about 4 days ago, just as the backup was running. We’re a bit shy on space, and since then the backup has failed a few times because of lack of space a couple of times. Our admin is going to bump the disk size.


#3

It’s on a 60GB disk droplet with ~66% used, so there should be plenty of room, unless the docker is otherwise confined to a smaller space?


(Andrew Waugh) #4

How many backups are you keeping on disk?

You’ve only got 18.3 free, that is enough space for a backup of 8-9G, depending on how much compression can be done on your /uploads directory.


#5

Keeping 3 backups - each is 4.9 GB.

In the first screenshot in the OP you can see the blip in disk usage that is caused by the backup, but there is plenty of space to go.


(Jay Pfaffman) #6

I’ve been having similar crashes during backups. My instance is a two-container multi-site instance, and restarting the database is all that’s required to get things running again.

I’ve stopped including images in the backups and have staggered backups by 30 minutes to make sure that two aren’t running at once.

I’m not clear if my problem is too many sites on a single 4GB instance (about a dozen, most with almost zero traffic) or there’s something wrong with my database config. The problem appears to have started a couple weeks ago; I don’t know if it was an upgrade or adding a back-breaking straw.


(Andrew Waugh) #7

Interesting.

Are you storing uploads locally, or on S3?

Ours is just a single container. It fails during the addition of uploads to the .gz file. The instance hung once on Monday, Gunnar did a reboot and an update. Since then the backups have failed, but the forum stays up.

It could just be coincidence.

/logs doesn’t reveal anything informative.

Something going wrong when sidekiq comes off pause during the sql dump? (We’re not going read only during the backup - you?)


#8

Aren’t backups held locally AND uploaded to S3 on completion?

In my case Discourse is not in read-only mode during backup.


(Andrew Waugh) #9

Yes.

i.e. “maximum backups” in settings determines how many copies are stored locally, and afaik once each backup is complete the latest is also copied to S3 AND the excess local one is deleted.


(Sam Saffron) #10

Try doubling your swap, I think the backup compression stage is a huge memory hog.


(Andrew Waugh) #11

At the moment it looks like our site is just plain running out of space (we haven’t enough headroom for the .tar and the .gz, let alone whatever temp files the compression may create). Once the disk is larger we’ll see.

It’s just conjecture, but I thought it was a bit of a coincidence that a few people are having crashes since about a week.


#12

Do I need to rebuild the app after changing swap size?


(Sam Saffron) #13

You should not need one, no, but you must ensure the new swap is enabled.


#14

Like this?

swapon -s
Filename				Type		Size	Used	Priority
/swapfile                              	file    	4194300	29264	-1

free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        1.9G        148M        270M        1.9G        1.5G
Swap:          4.0G         29M        4.0G</pre>

#15

Thanks for this topic, I finally had the answer to a few crashs I had for some partners when I wasn’t around.

The backup system has been changed the last two weeks ? I never had this issue before, tonight during some manual backup two sites crashed (one with 1 cpu/1 gb, the other with 2 cpu / 2gb but with a 130+mb database without images) which confirmed the issue for me too.

A reboot is enough to fix it by the way.

Anyway, I deactivated the automatic backup for now, I’ll add some swap if I have the same issue in a few days during a manual backup


#16

Unfortunately this did not solve the problem. I ran a manual backup two days ago, and it was ok. Last night the backup ran automatically and it crashed again.

I’ve turned off backup with uploads for now, but I do need to backup uploads, so this is a temporary workaround - if it works.

Still nothing in /logs.


(Matt Palmer) #17

The symptoms described in this topic are consistent with a bug in Docker I just reported. The linked post has more details and a procedure for downgrading Docker to a less unpleasant version.


(Rich) #18

Sorry to bump an old thread but I’m currently having the exact same issue.

Docker version:

Docker version 18.01.0-ce, build 03596f5

Do you know if 18.01 is also still containing the bug you mentioned @mpalmer ?

Is it safe to downgrade from 18.0 to 17.10 as per your linked post?


(Jay Pfaffman) #19

He did tests of 17.12 and it seemed to be fixed. Also, changes to discourse worked around that bug. It’s a good bet that you’re not having this problem.

Do you have enough ram and disk storage?


(Rich) #20

Hey @pfaffman

To give you an idea of the scale, our discourse is relatively small, perhaps 5,500 posts or so.

We run on a single DO droplet with 1GB/30GB:

root@greyarro:~# sudo swapon -s
 Filename                                Type            Size    Used    Priority
 /swapfile                               file            2097148 649408  -1

 root@greyarro:~# free -h
               total        used        free      shared  buff/cache   available
 Mem:           992M        778M         72M         42M        141M         42M
 Swap:          2.0G        634M        1.4G

root@greyarro:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            488M     0  488M   0% /dev
tmpfs           100M  4.5M   95M   5% /run
/dev/vda1        29G   13G   17G  45% /
tmpfs           497M  1.3M  495M   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           497M     0  497M   0% /sys/fs/cgroup
/dev/vda15      105M  3.4M  102M   4% /boot/efi
none             29G   13G   17G  45% /var/lib/docker/aufs/mnt/840e6f4d9f2984c1d72ef416b7a64b19395914384de29ad89732b434e0599b5a
shm              64M  4.0K   64M   1% /var/lib/docker/containers/225ed791ef380a3dd272060a6c140d138df139bf696290af3624018633cccc61/shm
tmpfs           100M     0  100M   0% /run/user/0

A couple of days ago I recently uplifted all user-uploaded images to S3 so I’m not sure why that docker is using 13GB (perhaps that’s normal).

A db-only backup is approx 24MB and a full backup is 124MB - but the backups usually fail now, then the entire site stops working within a few minutes after that. A full reboot does cure it.

This problem only recently started, perhaps less than one week ago. Around the time I did a apt-get update (we are running Ubuntu 16.04.3 LTS) and also on the same day upgraded to Discourse v2.0.0 (currently on beta1 +65).