Connection timed out while connecting to upstream on AWS

I’m hosting a discourse instance (running latest), on aws with 7.5gb ram and 2 CPU’s (it’s a m3.large).

While sometimes discourse works okay, some other times requests seem to get stuck and people either see a infinite loading or get 502 from ngnix.

taking a look at our error.log from ngnix the following message pops up at these times:

upstream timed out (110: Connection timed out) while connecting to upstream, client: 187.62.216.182, server: _, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:3000/", host: "www.guj.com.br"

Which seems strange as most of times the machine still has 2-3 GB of free memory, and CPU and RDS don’t show any excessive usage.

Any ideas on tunings we could try, or what could we be doing wrong?

With that amount of ram, I would start adding some cores.

I have a instance 5~10x bigger than GUJ with 8GB ram an 8 cores using the standard install (everything on one container) and it’s under 50% on peaks.

Maybe move to PostgreSQL inside the container and increase cores?

PS.: Bem que nós podíamos nos unir pra botar a tradução PT-BR em dia hein?

Hi, thanks for your reply.

Are you running on AWS with that setup? With their machines increasing the number of cores also increases the amount of memory (which we have 7.5 and don’t use). So upgrading our instance could end up in an overpricing.

RP PS: Está nos nossos planos tentar contribuir mais para o discourse.

I’m on a private cloud.

Looking through aws instance types they aren’t that great. Maybe we should debug better your problems before spending money.

Since you’ve come from a big import (> 2M posts) the database may be struggling. Do you hit the timings in the upper left when browsing the forums? See there so you can find if the sql performance is the problem.

When opening home (/) the query is what takes longest, it ran right now, for instance, in 300ms of a total of 485ms.

But if it were to always take about this amount, I can’t see how the request would time out because of the query.

Fun thing is, even though sometimes RDS seems to reach a peek of utilization that goes up to 80-90% of cpu usage, plenty of times we have experienced this instabilities with our database using almost nothing of its CPU and memory. I don’t know if this could be a indication that it’s not the direct cause, even though it could contribute to the problem.

After refreshing a bunch of times, I got the query to take 10710.6ms once.

I’m not sure what would make it take this long (my rds is not at 7% of it’s cpu utilization). Could my database be missing important indexes or something similar?

Home on GUJ is /categories right?

A GET /categories.json takes 262ms with only 24.2ms on 18 queries (9%) for me.

Your database seen to be very bad. People talk shit about RDS but I didn’t know it was that bad.

You can try to VACUUM VERBOSE ANALYZE the entire database and see if it works. If you expand the timing and see the slow queries, you can target the slow tables first.

2 Likes

You are right, we’ve made /categories our home /.

I can try running VACUUM and checking if it outputs anything relevant. I’ll post the results here as soon as I get to it. Thanks :smile:

2 Likes

We have seen very slow database performance cripple a site before. It’s been reported at least twice here on meta with the reproduction steps being, “have a terribly slow database” :wink:

1 Like

Yeah, it’s that bad. Even more fun is that it’s a lot better than it used to be. :scream: RDS also uses tiers of EBS performance that you can’t get on a regular instance, so I have no idea how they can manage to make it perform worse than running the DB on a regular instance. My mind, she is boggled.

1 Like

@Falco, I would be really glad if it turns out and the RDS is the one to blame. But we have this behavior even at 3am with no one using it…

if you browse circle-ci discouse forum at https://discuss.circleci.com/, you will also get the same 502 from nginx from time to time, even in such a small forum…

This seems like a problem that should be fairly simple to diagnose if you apply the scientific method. It definitely seems isolated to the application, so start examining the application logs (/var/discourse/shared/<name>/logs/rails/production.log) and see what they report. The unicorn logs (in the same directory) might also show something like a worker timeout. From there, make a hypothesis as to what might be going wrong, based on the data available, design an experiment, and run it. We can only guess wildly at what might be the problem, and while that’s fun (and, if you’re lucky, might give you the answer), but it’s a really poor use of everyone’s time.

my hypothesis is that there is a corner case in the latests versions of discourse that generates heavy sql queries for users with a lot of posts (migrated posts by the way). we will try to further investigate it, take a look at the query speed and maybe try newrelic. We will keeo you posted if we find real data. The production.log did not offer any insights.