fuse
(Geoff Hughes)
October 13, 2022, 1:50am
1
I get an AWS cloud watch alert at 9:09pm ET, along with some friends who text me “hey is discourse down?”
I can’t ssh into the AWS lightsail instance, and all the metrics are hung/not reporting.
Eventually I give up and stop/restart the lightsail instance.
Service recovered.
I check the logs post service recovery, looking to learn.
I run discourse as a single instance so the error at 9:05 about Redis network connection has me flummoxed.
I can’t sort out what happened other than “something” hung/failed for “some reason”.
Anyone who can explain or leave some breadcrumbs appreciated.
Thank you!
MarcP
(MarcP)
October 13, 2022, 2:08am
2
What are the server specs? Sounds like it’s running out of resources? Most likely CPU. Perhaps there is some daily task running at that time?
fuse
(Geoff Hughes)
October 13, 2022, 2:18am
3
Its a 1 vCPU, 1GB RAM, 40 GB SSD lightsail instance.
Storage is about 60% consumed, and when I do cleanups it drops quite a bit.
AWS shows I am out of burstable CPU credits, which is only odd because the other metrics don’t support that.
It’s a pretty small community (20-30 active participants) so I will be surprised if there is a real CPU or RAM constraint.
No daily task I am aware of other than something discourse might schedule by default.
Stephen
(Stephen)
October 13, 2022, 2:33am
4
1GB with swap is the absolute minimum to run discourse.
How long has this instance been up? How big is the database?
3 Likes
fuse
(Geoff Hughes)
October 13, 2022, 10:47am
5
I’ll check the db size, not expecting it to be large (backups are all about 57 MB).
Uptime of the instance is not quite ten hours now since recovery required stopping and restarting the virtual server- I could not get a shell or console connection.
Been running fine on this instance type since I built it (Feb 2021 as a guess).
Falco
(Falco)
October 13, 2022, 1:59pm
6
This sounds like what happens when AWS moves your VM from one host to another and leaves it in a weird state because of it. Usually a reboot solves it.
5 Likes
fuse
(Geoff Hughes)
October 13, 2022, 4:29pm
7
Overall db size is 423MB.
Largest tables are
Posts 66MB
Post_timings 60MB
fuse
(Geoff Hughes)
October 16, 2022, 4:40pm
8
Second similar “high load” failure occurred.
Going to guess resource contention.
Has anyone tried to use the Lightsail snapshot to snapshot the instance, and restore it to a larger instance as an upgrade method?
darkpixlz
(darkpixlz)
October 16, 2022, 5:06pm
9
You can try rebooting the AWS instance, that can fix a lot of issues.
fuse
(Geoff Hughes)
October 16, 2022, 5:10pm
10
I’ve moved using Lightsail snapshot from a 1 CPU, 1GB RAM 40GB SSD to a 2 CPU, 4GB RAM 80 GB SSD.
Aside from having to detach the public IP and reattach, which was straightforward enough, my remaining concerns is “what have I missed”?
Is there anything (backups, email, S3 bucket config, etc) that I should check or do I need to re-run any initial install parameters to take advantage of the upgraded resources?
fuse
(Geoff Hughes)
October 16, 2022, 6:49pm
11
I’m thinking based on this link I could bump the db_shared_buffer to at least 1GB.
Current app.yml says 128MB, also indicates auto adjust at bootstrap.
When you install Discourse on an instance with 4GB or more you should consider the following:
Monitor your setup
If you elect to use a higher end setup we strongly recommend you set up monitoring using Newrelic or some other monitoring service. You will need to analyze the results of configuration changes to reach an optimal setup.
Out of the box Discourse Docker ships with 3 web workers
Web workers are served via unicorn , this process is capable of serving one request at a time, you should …
Stephen
(Stephen)
October 16, 2022, 7:01pm
12
1GB is fine for a 4GB system. Make sure you also update unicorn_workers
to 4.
Geoff Hughes:
Is there anything (backups, email, S3 bucket config, etc) that I should check or do I need to re-run any initial install parameters to take advantage of the upgraded resources?
Usual recommendation if you were moving between servers would be to re-run discourse-setup which would take care of the above automatically.
}
##
## If we have lots of RAM or lots of CPUs, bump up the defaults to scale better
##
scale_ram_and_cpu() {
local changelog=/tmp/changelog.$PPID
# grab info about total system ram and physical (NOT LOGICAL!) CPU cores
avail_gb=0
avail_cores=0
os_type=$(check_OS)
if [ "$os_type" == "Darwin" ]; then
avail_gb=$(check_osx_memory)
avail_cores=`sysctl hw.ncpu | awk '/hw.ncpu:/ {print $2}'`
else
avail_gb=$(check_linux_memory)
avail_cores=`lscpu --parse=core | egrep -v ^# | sort -u | wc -l`
fi
echo "Found ${avail_gb}GB of memory and $avail_cores physical CPU cores"
1 Like
fuse
(Geoff Hughes)
October 16, 2022, 8:36pm
13
Thanks. I’m now going down the prometheus rabbit hole.
Good stuff.