Getting 500 errors when writing to the database (post, reply, heart, etc)

(Quincy Larson) #1

We’re getting 500s every time we try to write to the database (post, reply, heart, etc).

We haven’t had a new post in more than an hour (usually during this time, there would be a post every 5 - 10 minutes) so it looks like this started about an hour ago. Could anyone help us diagnose this real quick?

(Quincy Larson) #2

Update: I seem to be able to write to the database again (through replying, etc.) and I’m no longer getting 500s. I still think there was a partial outage though, because so much time passed without any posts despite having ~70 concurrent users on the Discourse instance.

(Daniela) #3

Check the log errors on your site

(Jay Pfaffman) #4

My guess is disk space.

(Jeff Atwood) #5

As @trash said, check /logs in your web browser while logged in as admin.

(Quincy Larson) #6

Here’s the log from the time we experienced this:

Times were PST.

(Jeff Atwood) #7

Hmm, not seeing any 500s there. Anyplace else to look in the logs @sam in this case?

(Sam Saffron) #8

Having a look, disk space looks fine, CPU is fine…
can not see 500s in the logs

Job exception: FATAL:  the database system is in recovery mode

So somehow the db got into recovery mode

@ossia looks like it magically recovered, let us know if this happens again.

(Jeff Atwood) #9

Is this because we have the DB on the slow external disks at Digital Ocean? That is increasingly looking more and more like a liability.

(Sam Saffron) #10

No, the DB is on the SSD not on the external volume

(Quincy Larson) #11

OK - this seems to have happened again today: Error 500 posting a message - Contributors - freeCodeCamp Forum

(Sam Saffron) #12

I you sure that is a report from today and not leftover from yesterday? Not seeing anything new in logs.

(Quincy Larson) #13

This is from a regular. I can’t see why he would write such a post nearly a day after he experienced the issue.

I’ve asked him to be sure.

(Sam Saffron) #14

Redis is not saving properly for some reason, I am just doing a rebuild to see what happens. Sorry for the downtime.

Note NGINX logs show that you are being flooded by certain IP address and we are doing what we can to slow them down.

EDIT: rebuild appears to have corrected the redis issue

(Quincy Larson) #15

Hi Sam,

I just checked and we’re getting 500 errors right now - I can’t even access the forum.

(Sam Saffron) #16

I’m on holidays atm can you try a rebuild

(Daniela) #17

Your forum is online (I can see the safe-mode page). Disable all the unofficial plugins from your yml file and try to cleanup and rebuild

apt-get autoclean
apt-get autoremove
cd /var/discourse
./launcher cleanup

and then

git pull
./launcher rebuild app

(Jeff Atwood) #18

The database was killed by the out of memory killer process.

The underlying problem IMO is that the droplet was not configured with any swap space, and I’ve fixed that by adding swap space, but I don’t understand why the usage patterns are so bizarre on this particular Discourse instance.

(Daniela) #19

To be honest something similar happened to me last night but the process killed was just the backup.
We had 4 gb ram (without swap) and never even approached the 4 gb limit until yesterday (we’ve always been < 2 gb). Today we added 4 gb of swap to prevent another out of memory event.

(Jeff Atwood) #21

Hmm so the backup process specifically triggered this on your site?