Bursts of 502 Service Unavailable, pointers to debug


(Soumith Chintala) #1

Hi,

I run discuss.pytorch.org hosted on DigitalOcean. A couple of nights ago, the website became REALLY slow, and we were seeing “nginx 502 service unavailable” pages.
Disk space was plenty available, nothing changed recently.
htop showed that there was enough processing available, cores were not fully occupied.
The only thing I remember changing on the machine in a recent time window was that I did a apt dist-upgrade a week earlier to get Spectre/Meltdown patches.

After looking around on the forums and finding no references / pointers that are relevant, I decided to take a backup, and then nuked the machine.

I created a fresh machine, updated 16.04 to the latest kernel / patches, and deployed a new discourse install. I t was REALLY slow / was 502-ing. I presumed that the upgrades were probably the issue.

So, I created yet another fresh machine, didn’t update 16.04 and deployed a new discourse install. Then I restored from backup. The site was back to normal performance.

However, today afternoon for a period of an hour or so, users saw 502 Service Unavailable, and then the site came back to normal performance.

My question is:

  • can I get any pointers to handle such situations, from a sysadmin perspective? What exactly to look for, any log parsing.
  • are there any recent known issues with discourse installs across the community with elevated 502 Service Unavailable errors?

(Jeff Atwood) #2

Sounds to me like out of memory.


(Soumith Chintala) #3

sorry I should’ve mentioned that I also monitored memory. Memory is plenty available (50% of memory occupied). I got the 4 CPU, 8GB Mem 160GB Disk instance, and it almost never goes to peak on any of the resources.


(Sam Saffron) #4

This is going to cost you in performance, its unavoidable.

What version of Discourse are you running? tests-passed branch?


(Soumith Chintala) #5

I am running whatever came (at that time) with the autoupdater, i.e. v2.0.0.beta2 +144 – tests-passed.

The flakiness hasn’t repeated since, so I think the thread is good to close.
But having any pointers on what you guys (as the core-devs) would go hunt down in terms of log files or other things in such situations would be helpful, so that I can bother you less and become self-sufficient :slight_smile:


(Jeff Atwood) #6

Any ideas of where to look @mpalmer? In a general sense? There might be a good post you wrote on this on meta, somewhere?


(Matt Palmer) #7

I can’t think of anything I’ve written that covers tracing a request through the layers. I’d suggest starting with @Sam’s topic on where the wild logs are and working from the outside in – find a failing request in the logs for the outermost proxy (either nginx in the container, or the host’s nginx/apache if it’s setup that way), then find the corresponding entry at the next hop (unicorn, or container nginx), and so on.


(Soumith Chintala) #8

Thanks a lot. I’ve been seeing 502 again, and I see full CPU usage. I’ll start following the wild logs and see how far I get.


(Soumith Chintala) #9

The post with the pointer to logs is super helpful.

Lots of workers are timing out:

unicorn.stderr.log

E, [2018-02-17T19:44:21.208533 #800] ERROR -- : worker=4 PID:29923 timeout (31s > 30s), killing
I, [2018-02-17T19:44:29.211582 #29979]  INFO -- : worker=5 ready
I, [2018-02-17T19:44:29.699253 #29990]  INFO -- : worker=7 ready
I, [2018-02-17T19:44:31.985875 #30008]  INFO -- : worker=2 ready
I, [2018-02-17T19:44:46.317249 #30050]  INFO -- : worker=4 ready
E, [2018-02-17T19:45:57.690129 #800] ERROR -- : worker=5 PID:29979 timeout (36s > 30s), killing
E, [2018-02-17T19:46:07.212510 #800] ERROR -- : worker=3 PID:29952 timeout (35s > 30s), killing
E, [2018-02-17T19:46:07.221215 #800] ERROR -- : worker=7 PID:29990 timeout (33s > 30s), killing
E, [2018-02-17T19:46:07.783812 #800] ERROR -- : worker=7 PID:29990 timeout (33s > 30s), killing
I, [2018-02-17T19:46:44.135563 #30165]  INFO -- : worker=3 ready
I, [2018-02-17T19:46:44.840466 #30156]  INFO -- : worker=5 ready
I, [2018-02-17T19:46:46.952119 #30174]  INFO -- : worker=7 ready
E, [2018-02-17T19:46:59.823120 #800] ERROR -- : worker=2 PID:30008 timeout (31s > 30s), killing
E, [2018-02-17T19:47:43.274893 #800] ERROR -- : worker=7 PID:30174 timeout (31s > 30s), killing
E, [2018-02-17T19:47:46.393225 #800] ERROR -- : worker=1 PID:29767 timeout (33s > 30s), killing
E, [2018-02-17T19:47:46.579485 #800] ERROR -- : worker=5 PID:30156 timeout (31s > 30s), killing
I, [2018-02-17T19:47:46.686074 #30254]  INFO -- : worker=2 ready
E, [2018-02-17T19:47:46.894589 #800] ERROR -- : worker=1 PID:29767 timeout (33s > 30s), killing
E, [2018-02-17T19:47:49.736359 #800] ERROR -- : worker=3 PID:30165 timeout (32s > 30s), killing
I, [2018-02-17T19:48:14.864398 #30307]  INFO -- : worker=5 ready
I, [2018-02-17T19:48:16.177603 #30336]  INFO -- : worker=3 ready
I, [2018-02-17T19:48:16.345038 #30322]  INFO -- : worker=1 ready
I, [2018-02-17T19:48:20.622346 #30296]  INFO -- : worker=7 ready
D, [2018-02-17T19:48:22.908873 #800] DEBUG -- : waiting 16.0s after suspend/hibernation
E, [2018-02-17T19:49:22.238502 #800] ERROR -- : worker=5 PID:30307 timeout (31s > 30s), killing
E, [2018-02-17T19:49:31.480543 #800] ERROR -- : worker=4 PID:30050 timeout (32s > 30s), killing
E, [2018-02-17T19:49:31.588904 #800] ERROR -- : worker=1 PID:30322 timeout (33s > 30s), killing
E, [2018-02-17T19:49:32.221465 #800] ERROR -- : worker=1 PID:30322 timeout (34s > 30s), killing
E, [2018-02-17T19:49:45.783025 #800] ERROR -- : worker=7 PID:30296 timeout (32s > 30s), killing
E, [2018-02-17T19:49:50.598973 #800] ERROR -- : worker=3 PID:30336 timeout (31s > 30s), killing

After digging in, it looks like postgres is being hammered, with a lot of queries taking 700+ms. one query took 50 seconds.
I’ve posted the last 1000 lines here: foo.log · GitHub
Edit: deleted log – security

If anything looks weird or non-standard, I’d appreciate if anyone lets me know.

For now I’ll try upgrading to an even more beefier instance and see if that helps.


(Jeff Atwood) #10

Can you quote the stats on your /about page? Did you migrate data from another forum software? Seems like there is something strange about your database.


(Soumith Chintala) #11

Did you migrate data from another forum software?

No. Started fresh with Discourse and nothing else.

Here are stats:



(Sam Saffron) #12

I am going to need to see the sql log


(Soumith Chintala) #13

@sam thanks for wanting to see the sql logs. I sent you a message with them.


(Sam Saffron) #14

I had a look at the log there, nothing sticks out, just looks like server is totally overloaded


(Sam Saffron) #15

Looking at that screenshot, the high CPU is coming from the app. unicorns and sidekiq are all very high there. You can see that even htop is taking tons of CPU in that case. My guess is that you have some very bursty traffic headed your way.

Also looking at your machine running 7/8 unicorns seems overkill, I would maybe run 5 in your current setup otherwise there is very high risk of a thundering herd taking everyone down.

Can you include the expanded screenshot of detailed traffic reports? (bottom of dashboard)


(Soumith Chintala) #16

Yesterday, I upgraded to a VERY beefy instance, CPU Optimized Droplets with 16 vCPUs. (previously Standard Droplet with 4 vCPUs)
It’s an overkill for the workload, but I wanted to first upgrade and then gradually scale down. It’s working extremely well.

And @sam you were right about bursty traffic. Looking at the traffic logs, I see massive spike in server errors right around the crashes.

[the site was recovered from backup mid-week last week to see if it was the instance-level issues, so traffic report before that might be blank before that date]