Bursts of 502 Service Unavailable, pointers to debug

smth · February 14, 2018, 9:10pm

Hi,

I run discuss.pytorch.org hosted on DigitalOcean. A couple of nights ago, the website became REALLY slow, and we were seeing “nginx 502 service unavailable” pages.
Disk space was plenty available, nothing changed recently.
htop showed that there was enough processing available, cores were not fully occupied.
The only thing I remember changing on the machine in a recent time window was that I did a apt dist-upgrade a week earlier to get Spectre/Meltdown patches.

After looking around on the forums and finding no references / pointers that are relevant, I decided to take a backup, and then nuked the machine.

I created a fresh machine, updated 16.04 to the latest kernel / patches, and deployed a new discourse install. I t was REALLY slow / was 502-ing. I presumed that the upgrades were probably the issue.

So, I created yet another fresh machine, didn’t update 16.04 and deployed a new discourse install. Then I restored from backup. The site was back to normal performance.

However, today afternoon for a period of an hour or so, users saw 502 Service Unavailable, and then the site came back to normal performance.

My question is:

can I get any pointers to handle such situations, from a sysadmin perspective? What exactly to look for, any log parsing.
are there any recent known issues with discourse installs across the community with elevated 502 Service Unavailable errors?

codinghorror · February 15, 2018, 2:34am

Sounds to me like out of memory.

smth · February 15, 2018, 3:21am

sorry I should’ve mentioned that I also monitored memory. Memory is plenty available (50% of memory occupied). I got the 4 CPU, 8GB Mem 160GB Disk instance, and it almost never goes to peak on any of the resources.

sam · February 15, 2018, 11:29pm

This is going to cost you in performance, its unavoidable.

What version of Discourse are you running? tests-passed branch?

smth · February 16, 2018, 4:12pm

I am running whatever came (at that time) with the autoupdater, i.e. v2.0.0.beta2 +144 – tests-passed.

The flakiness hasn’t repeated since, so I think the thread is good to close.
But having any pointers on what you guys (as the core-devs) would go hunt down in terms of log files or other things in such situations would be helpful, so that I can bother you less and become self-sufficient

codinghorror · February 17, 2018, 12:15am

Any ideas of where to look @mpalmer? In a general sense? There might be a good post you wrote on this on meta, somewhere?

mpalmer · February 17, 2018, 1:03am

I can’t think of anything I’ve written that covers tracing a request through the layers. I’d suggest starting with @Sam’s topic on where the wild logs are and working from the outside in – find a failing request in the logs for the outermost proxy (either nginx in the container, or the host’s nginx/apache if it’s setup that way), then find the corresponding entry at the next hop (unicorn, or container nginx), and so on.

smth · February 17, 2018, 7:44pm

Thanks a lot. I’ve been seeing 502 again, and I see full CPU usage. I’ll start following the wild logs and see how far I get.

smth · February 17, 2018, 8:06pm

The post with the pointer to logs is super helpful.

Lots of workers are timing out:

unicorn.stderr.log

E, [2018-02-17T19:44:21.208533 #800] ERROR -- : worker=4 PID:29923 timeout (31s > 30s), killing
I, [2018-02-17T19:44:29.211582 #29979]  INFO -- : worker=5 ready
I, [2018-02-17T19:44:29.699253 #29990]  INFO -- : worker=7 ready
I, [2018-02-17T19:44:31.985875 #30008]  INFO -- : worker=2 ready
I, [2018-02-17T19:44:46.317249 #30050]  INFO -- : worker=4 ready
E, [2018-02-17T19:45:57.690129 #800] ERROR -- : worker=5 PID:29979 timeout (36s > 30s), killing
E, [2018-02-17T19:46:07.212510 #800] ERROR -- : worker=3 PID:29952 timeout (35s > 30s), killing
E, [2018-02-17T19:46:07.221215 #800] ERROR -- : worker=7 PID:29990 timeout (33s > 30s), killing
E, [2018-02-17T19:46:07.783812 #800] ERROR -- : worker=7 PID:29990 timeout (33s > 30s), killing
I, [2018-02-17T19:46:44.135563 #30165]  INFO -- : worker=3 ready
I, [2018-02-17T19:46:44.840466 #30156]  INFO -- : worker=5 ready
I, [2018-02-17T19:46:46.952119 #30174]  INFO -- : worker=7 ready
E, [2018-02-17T19:46:59.823120 #800] ERROR -- : worker=2 PID:30008 timeout (31s > 30s), killing
E, [2018-02-17T19:47:43.274893 #800] ERROR -- : worker=7 PID:30174 timeout (31s > 30s), killing
E, [2018-02-17T19:47:46.393225 #800] ERROR -- : worker=1 PID:29767 timeout (33s > 30s), killing
E, [2018-02-17T19:47:46.579485 #800] ERROR -- : worker=5 PID:30156 timeout (31s > 30s), killing
I, [2018-02-17T19:47:46.686074 #30254]  INFO -- : worker=2 ready
E, [2018-02-17T19:47:46.894589 #800] ERROR -- : worker=1 PID:29767 timeout (33s > 30s), killing
E, [2018-02-17T19:47:49.736359 #800] ERROR -- : worker=3 PID:30165 timeout (32s > 30s), killing
I, [2018-02-17T19:48:14.864398 #30307]  INFO -- : worker=5 ready
I, [2018-02-17T19:48:16.177603 #30336]  INFO -- : worker=3 ready
I, [2018-02-17T19:48:16.345038 #30322]  INFO -- : worker=1 ready
I, [2018-02-17T19:48:20.622346 #30296]  INFO -- : worker=7 ready
D, [2018-02-17T19:48:22.908873 #800] DEBUG -- : waiting 16.0s after suspend/hibernation
E, [2018-02-17T19:49:22.238502 #800] ERROR -- : worker=5 PID:30307 timeout (31s > 30s), killing
E, [2018-02-17T19:49:31.480543 #800] ERROR -- : worker=4 PID:30050 timeout (32s > 30s), killing
E, [2018-02-17T19:49:31.588904 #800] ERROR -- : worker=1 PID:30322 timeout (33s > 30s), killing
E, [2018-02-17T19:49:32.221465 #800] ERROR -- : worker=1 PID:30322 timeout (34s > 30s), killing
E, [2018-02-17T19:49:45.783025 #800] ERROR -- : worker=7 PID:30296 timeout (32s > 30s), killing
E, [2018-02-17T19:49:50.598973 #800] ERROR -- : worker=3 PID:30336 timeout (31s > 30s), killing

After digging in, it looks like postgres is being hammered, with a lot of queries taking 700+ms. one query took 50 seconds.
I’ve posted the last 1000 lines here: https://gist.github.com/886f2324e50958dea43c51a2595ec15d
Edit: deleted log – security

If anything looks weird or non-standard, I’d appreciate if anyone lets me know.

For now I’ll try upgrading to an even more beefier instance and see if that helps.

codinghorror · February 17, 2018, 9:27pm

Can you quote the stats on your /about page? Did you migrate data from another forum software? Seems like there is something strange about your database.

smth · February 17, 2018, 9:42pm

Did you migrate data from another forum software?

No. Started fresh with Discourse and nothing else.

Here are stats:

sam · February 18, 2018, 5:43am

I am going to need to see the sql log

smth · February 18, 2018, 6:44am

@sam thanks for wanting to see the sql logs. I sent you a message with them.

sam · February 18, 2018, 7:15am

I had a look at the log there, nothing sticks out, just looks like server is totally overloaded

sam · February 18, 2018, 7:43am

Looking at that screenshot, the high CPU is coming from the app. unicorns and sidekiq are all very high there. You can see that even htop is taking tons of CPU in that case. My guess is that you have some very bursty traffic headed your way.

Also looking at your machine running 7/8 unicorns seems overkill, I would maybe run 5 in your current setup otherwise there is very high risk of a thundering herd taking everyone down.

Can you include the expanded screenshot of detailed traffic reports? (bottom of dashboard)

smth · February 18, 2018, 4:37pm

Yesterday, I upgraded to a VERY beefy instance, CPU Optimized Droplets with 16 vCPUs. (previously Standard Droplet with 4 vCPUs)
It’s an overkill for the workload, but I wanted to first upgrade and then gradually scale down. It’s working extremely well.

And @sam you were right about bursty traffic. Looking at the traffic logs, I see massive spike in server errors right around the crashes.

[the site was recovered from backup mid-week last week to see if it was the instance-level issues, so traffic report before that might be blank before that date]

JammyDodger · June 8, 2024, 12:37pm

This topic was automatically closed after 2305 days. New replies are no longer allowed.

Topic		Replies	Views
502 Bad Gateway for only two Topics Installation arm	20	1763	November 17, 2022
"Random" 502 Errors Installation	6	551	December 15, 2022
Lots of HTTP 502 and 429 after updating to 3.4.0.beta1-dev Installation	24	340	September 12, 2024
Due to extreme load, this is temporarily being shown to everyone... when it's not really the case Installation server-resources	19	1623	July 21, 2023
Discourse installation has been getting slower and slower and slower Installation server-resources	37	1583	May 15, 2023

Bursts of 502 Service Unavailable, pointers to debug

Related topics