Scheduled database backups cause periodic out of memory errors?


(Peter Bäckgren) #1

Skipped upgrades for several months since I needed (and had) a stable system for a few months.

So today I upgraded from something like v1.9.0.beta5 +42 (might have been this, from memory, skipped upgrading for a while not to end up in upgrade hell like this) to v1.9.0.beta15 +70.

A Docker upgrade was required and I followed the instructions here, basically doing steps like

apt-get update
apt-get dist-upgrade
wget -qO- https://get.docker.com/ | sh
cd /var/discourse
git pull
./launcher rebuild app

And now I ended up with 500 server errors, 502 ngix bad gateways errors after just a few minutes of system uptime or after half an hour… guessing it’s something specific I do though, possibly logging in different users using sso with something bad happening maybe (that is what I was working on at least).

Looked into the best log I could think of:

/shared/log/rails/production.log

Which contained a lot of

Job exception: Connection timed out

and lines like

/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58:in `rescue in read'

So been doing a lot of

./launcher stop app
./launcher start app

for the past hours :frowning: .

So, the next time things go down (and stay down apparently), any better places to look at trying to figure out what is going on?

FYI: Even though this might be far too much information, the exact issue I was investigating was using Firefox to login with two different users using sso. Even with a different external id they both overmapped each other each time they logged in (the Discourse user changed from one external id to the other). Both have the same email (on purpose). Several other temporary read-only users with different external id’s and the same email work just fine. These two, seem to map based on email instead of external id. Sso overrides is off etc.Really fishy behaviour which I of course have to investigate in case it could happen to an admin user.

Of course tomorrow Finland is celebrating it’s 100 year independence day so this was happening just at the right time…


#2

Do you have any external plugins enabled?


(Peter Bäckgren) #3

None. SSO is the only external thing really (and that worked flawlessly prior to the upgrade).

No crashes for hours now. Could indicate it’s some particular action that crashes things (but it was crashing even as I was looking at logs) or that the OS upgrades had some background process that maybe has completed by now.


#4

How does your memory usage look like? Did you perform ./launcher cleanup ?


(Jay Pfaffman) #5

I have a standalone instance that it failing similarly and another multisite two-container instance that I keep needing to restart the database for. The database seems to fail at about 8pm PST, which I think is about noon GMT.

I can’t make sense of it.

I’ve tried rebuilding the data container (and I think these problems started a week or two back when I last rebuilt it) and upgrading Discourse.

I probably upgraded Docker about the same time too.

Typically if I restart the data container it’ll last until the next day at just before 8pm PST.


(Peter Bäckgren) #6

Nope, had the usual disk space cleanup during upgrade though. I really wish the Digital Ocean discourse had a bit more space by now (and all old ones would get automatically upgraded).

Running it now:
Starting Cleanup (bytes free 5852244)
Finished Cleanup (bytes free 7882796)


(Jeff Atwood) #7

This is a bit curious if Jay can reproduce it. Any ideas @sam?


(Sam Saffron) #8

To me this sounds like your redis connectivity issues, how is redis configured? How is disk space? Are there any errors specifically in the redis logs, docker logs, machine logs. Maybe you are having connectivity issues between containers. I don’t know, I can not debug any of this without significantly more information.

I definitely can not make sense of it without specific logs, postgres logs would help, discourse logs, something.


(Jay Pfaffman) #9

This has seemed like a fluke, but then it was a fluke 3 or more times, and the last 2 or 3, it’s happened at very close to the same time of day (these are from uptimerobot which has 5 minute intervals)

2017-12-02 03:48:47
2017-12-02 04:18:31
2017-12-04 03:57:39
2017-12-05 03:48:16

It’s a 4GB droplet. There is 21G of disk available. There is a separate data container that includes postgres.template.yml and redis.template.yml. with

db_shared_buffers: "1GB"
db_work_mem: "50MB"

Postgres logs are empty and I don’t see any redis logs on the data side. In shared/webmulti/log/rails/unicorn.stdout.log-20171202.gz, I have one Connection Refused error connecting to Redis. There seems to be one of those each day that apparently goes away when I restart. There are no time stamps.

One guess was that when backups run the database gets swamped and gives up, but backups wouldn’t kill Redis, would they? I tried changing backup times to different times in several of the sites, but that didn’t seem to change anything.

Sidekiq shows no scheduled jobs, but I think there’s something I don’t understand about sidekiq and multisite.

syslog has a few lines like

Dec  5 03:53:24 support systemd-udevd[27095]: Could not generate persistent MAC address for veth13a3e61: No such file or directory
Dec  5 03:53:25 support systemd-udevd[27094]: Could not generate persistent MAC address for vethb128a96: No such file or directory

Another possibility is that this is when the backups get sshed to a remote server.

Happy to provide other logs or give @sam access to the server if you like.


(Sam Saffron) #10

Try disabling scheduled backups, does the issue go away?

Try running backup manually, does it cause the failure?


(Jay Pfaffman) #11

Thanks, @sam.

Well, I ran five backups at once (I’ve got 11 extremely low-volume sites on this server) and that broke it. It’s not entirely clear that it broke in the same way. I’ll try disabling all the backups and see if that makes it not die in the next 4 hours as it has been.

(And on a tangential note, so would it make more sense to just do a pg dump of all of the databases and back up the attachments with rsync?)

Edit: and with all backups disabled, there was no problem last night. I guess next I’ll turn them back on an hour apart.


(Jeff Atwood) #12

Since we now know one thing that can cause this, @Gunnel_Backgren how much memory does your instance have and what is the swapfile size?


(Sam Saffron) #13

I remember on freecodecamp that backups would take out the server unless it had swap defined.


(Jay Pfaffman) #14

I’ve got 2GB swap. The Digital Ocean monitoring shows memory usage staying at around 66% as a flat line. OTOH, I imagine that once memory is exhausted, it stops sending updates to Digital Ocean.

Here’s the last thing top(1) said (before I quite it at I don’t know when):

KiB Swap:  2097148 total,  1346956 free,   750192 used.    47120 avail Mem 

but I can’t be sure that those data are from when the problem happened. . . but the load average went up over 5 (8, maybe?) and again, it’s 4GB RAM.


(Peter Bäckgren) #15

Yes, I obviously tried to make an additional backup just in case when the problems started so good I’m not imagining things :smiley: . Haven’t been an active Linux user for 5 years so bear with me… Won’t have any problems giving @sam access to the droplet either if that speeds up anything.

root@Discourse:~# free -mt
              total        used        free      shared  buff/cache   available
Mem:            992         566          80          21         345         216
Swap:          2047         701        1346
Total:         3040        1268        1426

And since I’m at it although nothing seems full here

root@Discourse:~# df -l
Filesystem     1K-blocks     Used Available Use% Mounted on
udev              487728        0    487728   0% /dev
tmpfs             101616     3100     98516   4% /run
/dev/vda1       30832488 21520112   7837084  74% /
tmpfs             508064     1732    506332   1% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs             508064        0    508064   0% /sys/fs/cgroup
none            30832488 21520112   7837084  74% /var/lib/docker/aufs/mnt/e07d0424a08bf9249a2bac710820918bf6aef62cf0e34e02d26cbea273255bf4
shm                65536        4     65532   1% /var/lib/docker/containers/e59147096e33953e1fff7616ff08826ba9758e15d297ffcf071f9bcb63a16218/shm
tmpfs             101616        0    101616   0% /run/user/0

Totally off-topic, is there some smart way to show the above in Discource fixed, i.e. as readable output with proper spacing?


(Jeff Atwood) #16

Sure, refer to http://commonmark.org/help there is a 10 minute interactive tutorial there as well.


(Peter Bäckgren) #17

So, has this issue been investigated or even resolved? Couldn’t find any “memory” related things in December GitHub so guessing not? Well, upgrading one beta notch anyway, just in case.

Current status:

Tried doing absolutely nothing to the system for as long as possible (only posting). 2nd automatic backup and it was time for a restart app again.

It is mildy interesting that apparently two backups in a row were happening at the time of the crash when normally there is only one.

Guess I need to set aside some time for this during the xmas holiday…


(Sam Saffron) #18

We have plans to fix this just not sure if it will be this year or next

We need to eliminate the forking from the backup process


(Dmitry Krasnoperov) #19

Yep, I upgraded Docker as well.
Same problem here. Every morning it goes 504.


(Sam Saffron) #20

Can you update to latest tests-passed that includes:

And try again, please report back about memory usage asap!