Whole machine hangs during upgrade

uwe_keim · January 31, 2024, 6:21am

Ever since I tried out the AI plugins (and later removed them again), my machine totally hangs during /admin/upgrade.

Not every time but approximately 80% of the time.

Usually my whole EC2 instance freezes and I have to do a hard reboot through the AWS EC2 web UI.

Today it hangs again. To my surprise it doesn’t freeze completely. When opening the root URL it now shows:

Oops

The software powering this discussion forum encountered an unexpected problem. We apologize for the inconvenience.

Detailed information about the error was logged, and an automatic notification generated. We’ll take a look at it.

No further action is necessary. However, if the error condition persists, you can provide additional detail, including steps to reproduce the error, by posting a discussion topic in the site’s feedback category.

I now will try to reboot it again and do the usual sudo ./launcher rebuild app stuff which fixed it until now. Fingers crossed it will do this today, again.

My question

Can anyone give me some hints on where I could take a look into log files or things like this to get at least an error message of why the hangs occur?

sam · January 31, 2024, 7:31am

The official ai plugin?

I would run this from the console and see where it gets stuck, share the logs.

uwe_keim · January 31, 2024, 8:00am

Yes, the official plugin.

I uninstalled it by removing the plugins from app.yml again and then rebuilding. Maybe this is not enough to do?

What is meant by “this”? The sudo ./launcher rebuild app?

merefield · January 31, 2024, 10:17am

What’s the spec of your server?

Online upgrades imho require a 4GB server + 2GB swap these days as minimum.

uwe_keim · January 31, 2024, 10:33am

I’m using an AWS EC2 “t2.medium” with 2 vCPUs and 4 GiB RAM.

The HDD is 100 GiB with 60 GiB of free space.

If it helps, I can upgrade “t2.medium” to a larger instance type.

I’m just confused that this setup ran rock solid (for years) before my testing of the official AI plugin and only ever since after removing it these hangs occur during upgrade.

Ed_S · January 31, 2024, 11:55am

Another thing has changed: the version of the software you are upgrading to. It has become more memory-hungry lately. So I think it could be either one.

A temporary and reversible upgrade to an instance with more RAM is probably the easiest way to test if memory shortage is the problem, although it does cost a couple of reboots. The other way is to add swap, which is also reversible.

pfaffman · January 31, 2024, 4:16pm

I would try adding swap.

uwe_keim · January 31, 2024, 4:27pm

Thanks, guys, I’ll google how to do this and then do it .

Update 1

I’ve now added 8 GiB of swap:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       290Mi       2.9Gi       1.0Mi       677Mi       3.3Gi
Swap:          8.0Gi          0B       8.0Gi

I’ll post an update here after some next upgrades whether this helped.

Update 2

Just did an /admin/upgrade and monitored the RAM usage:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       1.4Gi       1.5Gi        50Mi       891Mi       2.0Gi
Swap:          8.0Gi       200Mi       7.8Gi

And the upgrade ran through successfully. Hope this stays the way.

Update 3

Several days and upgrades later, I never experienced a hang again.

So I do think the swap was the solution. Thanks again to anyone helping me on this issue.

Jagster · February 1, 2024, 8:54pm

This is a bit off topic, but I really would like to understand. Why swap, that used 200 MB, helped when there was 2 GB free RAM?

(I understand that in the inch world SI-system can be confusing because it uses 10 scale, but why the heck Mi? I can kind of understand Gi if it is shorted from giga, but should mega then be Me?)

Firepup650 · February 1, 2024, 9:02pm

Mi for Mibibytes I’d assume, and Gi for Gibibytes.

Jagster · February 1, 2024, 9:05pm

Thanks. I did not know that, obvious. But it is mebibyte

And for others who didn’t know either

Ed_S · February 1, 2024, 10:15pm

I think the original problem was probably a process getting killed because the machine was out of memory (beware the OOM killer). Adding swap meant that memory was not exhausted. Those two outputs of free may not be telling the whole story, unless they were very carefully taken at the moment of most machine stress. It’s the peak swap usage that’s interesting, I think.

But there’s also the question of the kernel tunable as mentioned in
MKJ’s Opinionated Discourse Deployment Configuration
which I’ve got set correctly, but which maybe lots of people don’t have set correctly.

MKJ's Opinionated Discourse Deployment Configuration

Kernel configuration

Redis (one of the key components on which Discourse is built) strongly recommends disabling transparent huge pages, and I also allow memory overcommit.
echo 'sys.kernel.mm.transparent_hugepage.enabled=never' > /etc/sysctl.d/10-huge-pages.conf
echo 'vm.overcommit_memory=1' > /etc/sysctl.d/90-vm_overcommit_memory.conf
sysctl --system

Worth nothing that the memory overcommit has nothing much to do with Redis. It’s just that Redis is kind enough to hint that it should be set correctly.

uwe_keim · February 2, 2024, 6:04am

Just started yet another /admin/upgrade and had a shell open to manually call tree -h every second or so.

The highest memory usage values I could find were those:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       3.2Gi       120Mi        80Mi       542Mi       266Mi
Swap:          8.0Gi       276Mi       7.7Gi

The upgrade succeeded.

Jagster · February 2, 2024, 7:21am

So 4 GB is just on the edge at building time, if we suppose that last screenshot tells the most stressful moment.

And that is another thing I can’t understand: why others hit low memory and I, who uses a lot plugins and components, had zero issues what makes that difference?

And I used had because nowadays I have 8 GB because of AI (and for me the price difference was not so important, but that is another story).

Should this thread move somewhere else or are we seeing this as an explain why using swap helped?

Anyway. For other beginners this one example where is talked little bit about low memory and reasons for that:

That is strongly FAQ question when upgraiding fails. But reason for that is rarely explained.

Ed_S · February 2, 2024, 8:11am

@Jagster @uwe_keim please could you report the output of these commands

cat /proc/sys/vm/overcommit_memory 
cat /sys/kernel/mm/transparent_hugepage/enabled

On my systems I have

# cat /proc/sys/vm/overcommit_memory 
1
# cat /sys/kernel/mm/transparent_hugepage/enabled 
always madvise [never]

uwe_keim · February 2, 2024, 9:13am

$ cat /proc/sys/vm/overcommit_memory
0

and

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Ed_S · February 2, 2024, 9:14am

Thanks @uwe_keim - I’m going to suppose that those kernel tunables are the reason you needed to add swap, even though it didn’t seem to be used. (The same would apply if you’d needed to add loads of RAM, because the total available memory is RAM+swap.)

uwe_keim · February 2, 2024, 9:18am

I can change server settings any time if you recommend to do so.

Jagster · February 2, 2024, 12:08pm

root@foorumi-hel:/var/discourse# cat /proc/sys/vm/overcommit_memory
0
root@foorumi-hel:/var/discourse# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

Ed_S · February 2, 2024, 1:13pm

I do recommend!

This will fix it for future reboots (note that it overwrites files without checking the current state):

echo 'sys.kernel.mm.transparent_hugepage.enabled=never' > /etc/sysctl.d/10-huge-pages.conf
echo 'vm.overcommit_memory=1' > /etc/sysctl.d/90-vm_overcommit_memory.conf
sysctl --system

Topic		Replies	Views
I keep getting errors when updating Support	26	162	July 16, 2025
Fail to upgrade - Make sure that `gem install json -v '1.8.2'` succeeds before bundling Installation server-resources	18	6687	February 14, 2015
3.1.x to 3.2.0 upgrade hangs/fails on 1GB instance Installation	34	1392	February 23, 2024
2.6.0 beta 3 update failed on disk and/or memory space Installation server-resources	29	2395	October 26, 2020
"Cannot allocate memory" when upgrading Installation server-resources	25	6118	June 8, 2024

Whole machine hangs during upgrade

My question

Update 1

Update 2

Update 3

Related topics