Several 503 errors after update

We just updated from 3.0.6 to 3.1.2 and I’m seeing many 503 errors at mainly 3 points:

  • Many avatars fail to load
  • Image uploads only works sometimes
  • Also seeing many errors for topics/timings

I have looked at the server logs, and most 503 don’t even show on production.log, but nginx is full of them. Thinking it could be nginx rate limiting, I tried not using the templates/web.ratelimited.template.yml but it didn’t seem to help. I’m still seeing a high amount of requests answered with 503, mostly user_avatars/show and from what I gather, production.log doesn’t seem to see them at all.

Don’t notice anything wrong in sidekiq. However, /logs did have errors with,

'hijack user_avatars show ' is still running after 90 seconds on db default, this process may need to be restarted!

but these were a couple of hours ago and I’ve rebuilt the instance couple of times since then and they haven’t shown again.

This instance is using SSO, so the avatars (url) comes from there. We are using S3 for images.

I’m a bit puzzled to what is causing this and I’m out of ideas.

Any clues where/what to look into?

How much ram do you have? Have your rebooted the server lately?

Server has 16GB and has been running fine for months before the update.

It’s an AWS instance and this one was started today, shortly before the update (Discourse data is on EBS volumes) to change some unrelated parameters.

Network traffic (in & out) has increased significantly after the update: avatars seem to work a few seconds after the initial 503, so I’m guessing there’s some process running the first time they are requested.

However, I’m at loss why image uploads are randomly failing, as well as the topics/timings endpoint.

Not sure if it may be related to this:

Could it be that this avatar background update process is hitting the 3500 PUT/s rate limit on AWS, causing regular uploads to fail while the avatars are updated? /cc @sam

1 Like

Possibly … it should clear up though. Did it clear up by now?

1 Like

Yes, in part.

Update was done on the morning of the 21st. Inbound network traffic seems to be normalizing now. Outbound still higher than usual, but I assume it’s while avatars are getting cached. The amount of 503 to user_avatars/show is much smaller now. I’m guessing these will slowly get sorted over time as more avatars are processed.

However, we are still seeing many 503 errors on the logs for mainly two other endpoints:

POST /topics/timings

Still many 503 errors to this endpoint and some users reporting visited topics are not being marked as read. I have not found any info about it as the request doesn’t seem to be logged on production.log at all. The /logs don’t show anything related.

Where would one go about debugging these 503? Is there some other logs I’m unaware of, or is there a way to make logs more verbose perhaps (on a production system)?

POST /uploads.json?client_id=....

All I find on production.log for these 503 errors are along these lines:

Extract from production.log

Started POST “/uploads.json?client_id=X” for x.x.x.x at 2023-10-24 10:24:55 +0000
Processing by UploadsController#create as JSON
Parameters: {“upload_type”=>“composer”, “relativePath”=>“null”, “name”=>“Screenshot 2023-10-24 at 11.22.32.png”, “type”=>“image/png”, “sha1_checksum”=>“d1f11731320437724003c3840c5dcc5f934ba25a”, “file”=>#<ActionDispatch::Http::UploadedFile:0x00007f3c5e3c9898 @tempfile=#Tempfile:/tmp/RackMultipart20231024-1991-b30vit.png, @content_type=“image/png”, @original_filename=“Screenshot 2023-10-24 at 11.22.32.png”, @headers=“Content-Disposition: form-data; name="file"; filename="Screenshot 2023-10-24 at 11.22.32.png"\r\nContent-Type: image/png\r\n”>, “client_id”=>“X”}
Rendered text template (Duration: 0.0ms | Allocations: 1)
Completed 503 Service Unavailable in 10ms (Views: 0.4ms | ActiveRecord: 0.0ms | Allocations: 5007)

Our users report that they retry a few times until it works… I’m able to reproduce the error more or less consistently if I try to upload (another) file if one is still uploading. Uploading them one by one seems less prone to it for some reason.

# free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       3.8Gi       621Mi       1.1Gi        10Gi        10Gi

# lscpu --parse=core | egrep -v ^# | sort -u | wc -l

db_shared_buffers: "1024MB"