Lots of HTTP 502 and 429 after updating to 3.4.0.beta1-dev

Discourse version: 3.4.0.beta1-dev (bf3d8a0a94)

Updated yesterday and had to disable minify from Cloudflare as suggested here:

However, since then, lots of users (me included) experimented several instances of 502 (Bad Gateway) and 529 (Too Many Requests).

In an attempt at trying to alleviate the problem, I followed this guide as well:

However, nothing seems to have changed in terms of frequency of those errors.

The update happened yesterday around 11AM. I performed a full rebuild because I wanted to disable a plugin as well.

I have a Prometheus+Grafana instance that monitor the server and discourse, but the server seems fine in terms of loading:

Discourse metrics (the drop in metrics around 11AM yesterday is the rebuild bringing down the container):

Again, I don’t see any strange pattern.

However, this is the browser console just a minute ago, after trying to send a PM to a user:

Anything else I might provide (logs of any kind) please ask away. Thank you.

Oh, if it helps, also lots of “background” real time operations are clearly lagging behind.

Topics already read are not registered as being read, for example.

Just to make sure, you did this step as well, then rebuilt, correct?

1 Like

Yes, sorry, I forgot to add, I had already added the Cloudflare template to the app.yml file a long time ago. We have always been behind Cloudflare, since the very first day.

This is a partial of the app.yml, we have our own certificates independently renewed, which is why the letsencrypt one is commented out:

## this is the all-in-one, standalone Discourse Docker container template
##
## After making changes to this file, you MUST rebuild
## /var/discourse/launcher rebuild app
##
## BE *VERY* CAREFUL WHEN EDITING!
## YAML FILES ARE SUPER SUPER SENSITIVE TO MISTAKES IN WHITESPACE OR ALIGNMENT!
## visit http://www.yamllint.com/ to validate this file as needed

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  - "templates/web.ssl.template.yml"
#  - "templates/web.letsencrypt.ssl.template.yml"
  - "templates/cloudflare.template.yml"

## which TCP/IP ports should this container expose?
## If you want Discourse to share a port with another webserver like Apache or nginx,
## see https://meta.discourse.org/t/17247 for details
expose:
  - "80:80"   # http
  - "443:443" # https

[...]

Excerpt of the /logs

I see this moved to installation, just to be clear, this is not a new installation.

This instance of discourse have been running since march 2023 and never had this specific problem.

There has been an issue in the past with some 529 but had been since resolved.

I think it still fits into

3 Likes

Looks like your PostgreSQL is overwhelmed. Looks like most of you RAM is idle, I´d try tweaking the DB to use it and see how things fare after that.

1 Like

What does /sidekiq/queues look like?
What version where you updating from?

1 Like

From the latest stable the 6th of May, v3.2.1 to the latest test-passed.

Sidekiq queues:

The Dead job section is this one but it’s the same job since the beginning of time it seems.

Oldest entries:

The ones in retry are the same job being retried over and over it seems.

But… why suddenly? After just an update of the application layer?

I am using the discourse prometheus exporter plugin.
If I added a postgresql exporter as another container on the VM, would it be possible to allow it to access the metrics on the discourse postgresql installation?

Any more precise direction about how to fine tune the db for discourse?

Not sure if related, but surely started happening after the update, clicking on the dismiss button in the unread tab always return a 503.

Well, as it seems like there isn’t a solution I’ll try to go back to the latest stable as it’s supposed to be… you know, stable.

Crossing finger that there isn’t a core dependency breaking the build process, like last time.

You can’t go back from tests-passed to stable, unless there is a higher stable version available. So the next opportunity for you is when 3.4.0 is out, I figure that’s around or after Christmas…

Besides, you’ll have to bite the bullet someday.

1 Like

Well, I just did. Seems to be working. We don’t care anyway for any of the features in 3.3.0 anyway.

I’ll see if there are still issues. Worse it can get is that we still get a plethora of 429 and 502, not that big of a change.

I’d appreciate directions on how to configure the Postgres on discourse so it has more resources available, however.

Edit: Deployed version 3.2.5. System seems stable.

Please remind us that you did this when you’re posting your next issue :wink:

I always mention the version I am in when posting an issue.
I think it’s important to remember that exactly because this is presented as an open source software, critical issues should be considered instead of writing things like this:

This is yet another example of people going out of their way and switching to the “stable” version encountering some bugs that fall between the cracks because it’s a not the most popular version deployed.

When stable should mean “stable”, not “legacy”.
The fact that core dependencies like discourse docker are pushed without a tag system should be enough to be a bit more humble when responding to users that are reporting an issue.

I was talking about mentioning the fact that you downgraded when you technically couldn’t.

I think it’s important to remember… that I do not work for Discourse and I am helping you in my own time, so I do not appreciate your tone, nor am I able to do anything with your feedback.

2 Likes

1 Like