Connection timed out while connecting to upstream on AWS

stefanini.mf · June 27, 2016, 4:15pm

I’m hosting a discourse instance (running latest), on aws with 7.5gb ram and 2 CPU’s (it’s a m3.large).

While sometimes discourse works okay, some other times requests seem to get stuck and people either see a infinite loading or get 502 from ngnix.

taking a look at our error.log from ngnix the following message pops up at these times:

upstream timed out (110: Connection timed out) while connecting to upstream, client: 187.62.216.182, server: _, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:3000/", host: "www.guj.com.br"

Which seems strange as most of times the machine still has 2-3 GB of free memory, and CPU and RDS don’t show any excessive usage.

Any ideas on tunings we could try, or what could we be doing wrong?

Falco · June 27, 2016, 4:20pm

With that amount of ram, I would start adding some cores.

I have a instance 5~10x bigger than GUJ with 8GB ram an 8 cores using the standard install (everything on one container) and it’s under 50% on peaks.

Maybe move to PostgreSQL inside the container and increase cores?

PS.: Bem que nós podíamos nos unir pra botar a tradução PT-BR em dia hein?

stefanini.mf · June 27, 2016, 5:04pm

Hi, thanks for your reply.

Are you running on AWS with that setup? With their machines increasing the number of cores also increases the amount of memory (which we have 7.5 and don’t use). So upgrading our instance could end up in an overpricing.

RP PS: Está nos nossos planos tentar contribuir mais para o discourse.

Falco · June 27, 2016, 5:36pm

I’m on a private cloud.

Looking through aws instance types they aren’t that great. Maybe we should debug better your problems before spending money.

Since you’ve come from a big import (> 2M posts) the database may be struggling. Do you hit the timings in the upper left when browsing the forums? See there so you can find if the sql performance is the problem.

stefanini.mf · June 27, 2016, 8:42pm

When opening home (/) the query is what takes longest, it ran right now, for instance, in 300ms of a total of 485ms.

But if it were to always take about this amount, I can’t see how the request would time out because of the query.

Fun thing is, even though sometimes RDS seems to reach a peek of utilization that goes up to 80-90% of cpu usage, plenty of times we have experienced this instabilities with our database using almost nothing of its CPU and memory. I don’t know if this could be a indication that it’s not the direct cause, even though it could contribute to the problem.

stefanini.mf · June 27, 2016, 8:52pm

After refreshing a bunch of times, I got the query to take 10710.6ms once.

I’m not sure what would make it take this long (my rds is not at 7% of it’s cpu utilization). Could my database be missing important indexes or something similar?

Falco · June 27, 2016, 8:55pm

Home on GUJ is /categories right?

A GET /categories.json takes 262ms with only 24.2ms on 18 queries (9%) for me.

Your database seen to be very bad. People talk shit about RDS but I didn’t know it was that bad.

You can try to VACUUM VERBOSE ANALYZE the entire database and see if it works. If you expand the timing and see the slow queries, you can target the slow tables first.

stefanini.mf · June 27, 2016, 9:14pm

You are right, we’ve made /categories our home /.

I can try running VACUUM and checking if it outputs anything relevant. I’ll post the results here as soon as I get to it. Thanks

codinghorror · June 27, 2016, 11:42pm

We have seen very slow database performance cripple a site before. It’s been reported at least twice here on meta with the reproduction steps being, “have a terribly slow database”

mpalmer · June 27, 2016, 11:49pm

Yeah, it’s that bad. Even more fun is that it’s a lot better than it used to be. RDS also uses tiers of EBS performance that you can’t get on a regular instance, so I have no idea how they can manage to make it perform worse than running the DB on a regular instance. My mind, she is boggled.

Paulo_Silveira · June 28, 2016, 2:40am

@Falco, I would be really glad if it turns out and the RDS is the one to blame. But we have this behavior even at 3am with no one using it…

if you browse circle-ci discouse forum at https://discuss.circleci.com/, you will also get the same 502 from nginx from time to time, even in such a small forum…

mpalmer · June 28, 2016, 3:30am

This seems like a problem that should be fairly simple to diagnose if you apply the scientific method. It definitely seems isolated to the application, so start examining the application logs (/var/discourse/shared/<name>/logs/rails/production.log) and see what they report. The unicorn logs (in the same directory) might also show something like a worker timeout. From there, make a hypothesis as to what might be going wrong, based on the data available, design an experiment, and run it. We can only guess wildly at what might be the problem, and while that’s fun (and, if you’re lucky, might give you the answer), but it’s a really poor use of everyone’s time.

Paulo_Silveira · June 28, 2016, 3:41am

my hypothesis is that there is a corner case in the latests versions of discourse that generates heavy sql queries for users with a lot of posts (migrated posts by the way). we will try to further investigate it, take a look at the query speed and maybe try newrelic. We will keeo you posted if we find real data. The production.log did not offer any insights.

Topic		Replies	Views
How to avoid upstream timeouts? Support	26	9335	March 26, 2022
Another discourse mystery Installation	12	715	October 16, 2022
Nginx upstream timed out (110: Connection timed out) Installation	20	2943	February 13, 2024
Bursts of 502 Service Unavailable, pointers to debug Installation	16	1597	June 8, 2024
504 Gateway Time-out errors with 2 million post migrated vBulletin forum Installation	4	1091	December 7, 2021

Connection timed out while connecting to upstream on AWS

Related topics