Several 503 errors after update

mentalstring · October 21, 2023, 10:57am

We just updated from 3.0.6 to 3.1.2 and I’m seeing many 503 errors at mainly 3 points:

Many avatars fail to load
Image uploads only works sometimes
Also seeing many errors for topics/timings

I have looked at the server logs, and most 503 don’t even show on production.log, but nginx is full of them. Thinking it could be nginx rate limiting, I tried not using the templates/web.ratelimited.template.yml but it didn’t seem to help. I’m still seeing a high amount of requests answered with 503, mostly user_avatars/show and from what I gather, production.log doesn’t seem to see them at all.

Don’t notice anything wrong in sidekiq. However, /logs did have errors with,

'hijack user_avatars show ' is still running after 90 seconds on db default, this process may need to be restarted!

but these were a couple of hours ago and I’ve rebuilt the instance couple of times since then and they haven’t shown again.

This instance is using SSO, so the avatars (url) comes from there. We are using S3 for images.

I’m a bit puzzled to what is causing this and I’m out of ideas.

Any clues where/what to look into?

pfaffman · October 21, 2023, 11:51am

How much ram do you have? Have your rebooted the server lately?

mentalstring · October 21, 2023, 11:57am

Server has 16GB and has been running fine for months before the update.

It’s an AWS instance and this one was started today, shortly before the update (Discourse data is on EBS volumes) to change some unrelated parameters.

Network traffic (in & out) has increased significantly after the update: avatars seem to work a few seconds after the initial 503, so I’m guessing there’s some process running the first time they are requested.

However, I’m at loss why image uploads are randomly failing, as well as the topics/timings endpoint.

mentalstring · October 21, 2023, 1:13pm

Not sure if it may be related to this:

github.com/discourse/discourse

FEATURE: reduce avatar sizes to 6 from 20

discourse:main ← discourse:avatar_sizes

opened 02:57AM - 01 May 23 UTC

SamSaffron

+159 -95

This PR introduces 3 changes: 1. SiteSetting.avatar_sizes, now does what is say…s on the tin. previously it would introduce a large number of extra sizes, to allow for various DPIs. Instead we now trust the admin with the size list. 2. When `avatar_sizes` changes, we ensure consistency and remove resized avatars that are not longer allowed per site setting. This happens on the 12 hourly job and limited out of the box to 20k cleanups per cycle, given this may reach out to AWS 20k times to remove things. 3.Our default avatar sizes are now "24|48|72|96|144|288" these sizes were very specifically picked to limit amount of bluriness introduced by webkit. Our avatars are already blurry due to 1px border, so this corrects old blur. This change heavily reduces storage required by forums which simplifies site moves and more.

Could it be that this avatar background update process is hitting the 3500 PUT/s rate limit on AWS, causing regular uploads to fail while the avatars are updated? /cc @sam

sam · October 23, 2023, 11:30pm

Possibly … it should clear up though. Did it clear up by now?

mentalstring · October 24, 2023, 11:18am

Yes, in part.

Update was done on the morning of the 21st. Inbound network traffic seems to be normalizing now. Outbound still higher than usual, but I assume it’s while avatars are getting cached. The amount of 503 to user_avatars/show is much smaller now. I’m guessing these will slowly get sorted over time as more avatars are processed.

However, we are still seeing many 503 errors on the logs for mainly two other endpoints:

POST /topics/timings

Still many 503 errors to this endpoint and some users reporting visited topics are not being marked as read. I have not found any info about it as the request doesn’t seem to be logged on production.log at all. The /logs don’t show anything related.

Where would one go about debugging these 503? Is there some other logs I’m unaware of, or is there a way to make logs more verbose perhaps (on a production system)?

POST /uploads.json?client_id=....

All I find on production.log for these 503 errors are along these lines:

Extract from production.log

Started POST “/uploads.json?client_id=X” for x.x.x.x at 2023-10-24 10:24:55 +0000
Processing by UploadsController#create as JSON
Parameters: {“upload_type”=>“composer”, “relativePath”=>“null”, “name”=>“Screenshot 2023-10-24 at 11.22.32.png”, “type”=>“image/png”, “sha1_checksum”=>“d1f11731320437724003c3840c5dcc5f934ba25a”, “file”=>#<ActionDispatch::Http::UploadedFile:0x00007f3c5e3c9898 @tempfile=#Tempfile:/tmp/RackMultipart20231024-1991-b30vit.png, @content_type=“image/png”, @original_filename=“Screenshot 2023-10-24 at 11.22.32.png”, @headers=“Content-Disposition: form-data; name="file"; filename="Screenshot 2023-10-24 at 11.22.32.png"\r\nContent-Type: image/png\r\n”>, “client_id”=>“X”}
Rendered text template (Duration: 0.0ms | Allocations: 1)
Completed 503 Service Unavailable in 10ms (Views: 0.4ms | ActiveRecord: 0.0ms | Allocations: 5007)

Our users report that they retry a few times until it works… I’m able to reproduce the error more or less consistently if I try to upload (another) file if one is still uploading. Uploading them one by one seems less prone to it for some reason.

# free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       3.8Gi       621Mi       1.1Gi        10Gi        10Gi

# lscpu --parse=core | egrep -v ^# | sort -u | wc -l
2

UNICORN_WORKERS: 4

db_shared_buffers: "1024MB"

Topic		Replies	Views
/user_avatar returns HTTP 500 after new update Support s3	6	50	October 1, 2025
503 errors, can't post, edit, like, etc Installation	4	331	March 18, 2025
Some requests fail with 504 Support	3	451	April 18, 2023
Problems with avatar uploads due to S3 changes Support	22	3340	October 27, 2018
Avatars.discourse.org returning Server 500 error Bug	2	1691	May 5, 2016

Several 503 errors after update

Related topics