I get an AWS cloud watch alert at 9:09pm ET, along with some friends who text me “hey is discourse down?”
I can’t ssh into the AWS lightsail instance, and all the metrics are hung/not reporting.
Eventually I give up and stop/restart the lightsail instance.
I check the logs post service recovery, looking to learn.
I run discourse as a single instance so the error at 9:05 about Redis network connection has me flummoxed.
I can’t sort out what happened other than “something” hung/failed for “some reason”.
Anyone who can explain or leave some breadcrumbs appreciated.
What are the server specs? Sounds like it’s running out of resources? Most likely CPU. Perhaps there is some daily task running at that time?
Its a 1 vCPU, 1GB RAM, 40 GB SSD lightsail instance.
Storage is about 60% consumed, and when I do cleanups it drops quite a bit.
AWS shows I am out of burstable CPU credits, which is only odd because the other metrics don’t support that.
It’s a pretty small community (20-30 active participants) so I will be surprised if there is a real CPU or RAM constraint.
No daily task I am aware of other than something discourse might schedule by default.
1GB with swap is the absolute minimum to run discourse.
How long has this instance been up? How big is the database?
I’ll check the db size, not expecting it to be large (backups are all about 57 MB).
Uptime of the instance is not quite ten hours now since recovery required stopping and restarting the virtual server- I could not get a shell or console connection.
Been running fine on this instance type since I built it (Feb 2021 as a guess).
This sounds like what happens when AWS moves your VM from one host to another and leaves it in a weird state because of it. Usually a reboot solves it.
Overall db size is 423MB.
Largest tables are
Second similar “high load” failure occurred.
Going to guess resource contention.
Has anyone tried to use the Lightsail snapshot to snapshot the instance, and restore it to a larger instance as an upgrade method?
You can try rebooting the AWS instance, that can fix a lot of issues.
I’ve moved using Lightsail snapshot from a 1 CPU, 1GB RAM 40GB SSD to a 2 CPU, 4GB RAM 80 GB SSD.
Aside from having to detach the public IP and reattach, which was straightforward enough, my remaining concerns is “what have I missed”?
Is there anything (backups, email, S3 bucket config, etc) that I should check or do I need to re-run any initial install parameters to take advantage of the upgraded resources?
I’m thinking based on this link I could bump the db_shared_buffer to at least 1GB.
Current app.yml says 128MB, also indicates auto adjust at bootstrap.
1GB is fine for a 4GB system. Make sure you also update
unicorn_workers to 4.
Usual recommendation if you were moving between servers would be to re-run discourse-setup which would take care of the above automatically.
Thanks. I’m now going down the prometheus rabbit hole.