When visiting my site, sometimes I’ll get a bunch of 500 errors on all of the JavaScript files, usually on first pageload (even if the files are, or should be, cached). Sometimes they say NS_ERROR_CORRUPTED_CONTENT.
My theme is loading fine (the background was removed for visibility) but besides that there should be nothing causing this. It is fixed by waiting 30 seconds and reloading.
My site is up to date on the almost latest commit (2 behind) and there is nothing weird in the logs. My disk cluster is healthy. What is causing this and how do I fix it?
my initial guess was a cloudflare optimization issue - make sure rocket launcher is disabled (which it probably is but worth a check).
also, i think you might want to configure S3 bucket object storage using Cloudflare R2 if you want to continue using the hardware you are using.
you mentioned in our chat that you are using some donated older SSDs that you installed shortly before this started happening, which is a smoking gun - dell firmware says they are bad, but smartctl shows as ok. i think the dell firmware is blinking orange because it detects high latency and erratic I/O responses or unsupported firmware. discourse asks the drives to read dozens of compiled javascript files all at exactly the same time and older unsupported SSDs can choke under this sudden burst of I/O. the storage controller hangs trying to get the data, timing out after exactly 30 seconds - a common default SCSI/block device timeout.
since the drives hang for 30 seconds, discourse rails/nginx crashes or times out trying to fetch the files and throws a 500 error. cloudflare is likely catching the 500 errors and applying the wrong headers and passing to the browser, thus causes the NS_ERROR_CORRUPTED_CONTENT you are seeing.
replace those SSDs, or if you must use them, then use an object storage bucket to offload discourse assets and uploads - you can use the Cloudflare R2 free S3 compatible (i have this and it works well). then your server won’t need to read thousands of small files from those failing SSDs because it bypasses the hardware bottleneck for web assets.
This isn’t something I had known before and definitely would explain it. Every single manual for the server everywhere says that a blinking orange LED means the drive is 100% dead but if these are giving I/O timeouts or otherwise behaving weirdly it’d make sense to me why Discourse gives the 500 error code.
I am very anti-subscription and long-term it probably makes sense for me to bite the bullet and buy new SSDs once affordable again. If I spent $150 on some new SSDs for this server that I forever own I would much rather do that than pay Cloudflare $5 a month for the rest of eternity. If it’s free (which you say it is) I may look into it as a temporary alternative while I wait for this AI bubble to blow over