Whole machine hangs during upgrade

Ever since I tried out the AI plugins (and later removed them again), my machine totally hangs during /admin/upgrade.

Not every time but approximately 80% of the time.

Usually my whole EC2 instance freezes and I have to do a hard reboot through the AWS EC2 web UI.

Today it hangs again. To my surprise it doesn’t freeze completely. When opening the root URL it now shows:

Oops

The software powering this discussion forum encountered an unexpected problem. We apologize for the inconvenience.

Detailed information about the error was logged, and an automatic notification generated. We’ll take a look at it.

No further action is necessary. However, if the error condition persists, you can provide additional detail, including steps to reproduce the error, by posting a discussion topic in the site’s feedback category.

I now will try to reboot it again and do the usual sudo ./launcher rebuild app stuff which fixed it until now. Fingers crossed it will do this today, again.

My question

Can anyone give me some hints on where I could take a look into log files or things like this to get at least an error message of why the hangs occur?

The official ai plugin?

I would run this from the console and see where it gets stuck, share the logs.

1 Like

Yes, the official plugin.

I uninstalled it by removing the plugins from app.yml again and then rebuilding. Maybe this is not enough to do?

What is meant by “this”? The sudo ./launcher rebuild app?

1 Like

What’s the spec of your server?

Online upgrades imho require a 4GB server + 2GB swap these days as minimum.

2 Likes

I’m using an AWS EC2 “t2.medium” with 2 vCPUs and 4 GiB RAM.

The HDD is 100 GiB with 60 GiB of free space.

If it helps, I can upgrade “t2.medium” to a larger instance type.

I’m just confused that this setup ran rock solid (for years) before my testing of the official AI plugin and only ever since after removing it these hangs occur during upgrade.

1 Like

Another thing has changed: the version of the software you are upgrading to. It has become more memory-hungry lately. So I think it could be either one.

A temporary and reversible upgrade to an instance with more RAM is probably the easiest way to test if memory shortage is the problem, although it does cost a couple of reboots. The other way is to add swap, which is also reversible.

3 Likes

I would try adding swap.

2 Likes

Thanks, guys, I’ll google how to do this and then do it :slight_smile:.

Update 1

I’ve now added 8 GiB of swap:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       290Mi       2.9Gi       1.0Mi       677Mi       3.3Gi
Swap:          8.0Gi          0B       8.0Gi

I’ll post an update here after some next upgrades whether this helped.

Update 2

Just did an /admin/upgrade and monitored the RAM usage:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       1.4Gi       1.5Gi        50Mi       891Mi       2.0Gi
Swap:          8.0Gi       200Mi       7.8Gi

And the upgrade ran through successfully. :tada: Hope this stays the way.

Update 3

Several days and upgrades later, I never experienced a hang again.

So I do think the swap was the solution. Thanks again to anyone helping me on this issue.

2 Likes

This is a bit off topic, but I really would like to understand. Why swap, that used 200 MB, helped when there was 2 GB free RAM?

(I understand that in the inch world SI-system can be confusing because it uses 10 scale, but why the heck Mi? I can kind of understand Gi if it is shorted from giga, but should mega then be Me?)

1 Like

Mi for Mibibytes I’d assume, and Gi for Gibibytes.

1 Like

Thanks. I did not know that, obvious. But it is mebibyte :wink:

And for others who didn’t know either :smirk:

2 Likes

I think the original problem was probably a process getting killed because the machine was out of memory (beware the OOM killer). Adding swap meant that memory was not exhausted. Those two outputs of free may not be telling the whole story, unless they were very carefully taken at the moment of most machine stress. It’s the peak swap usage that’s interesting, I think.

But there’s also the question of the kernel tunable as mentioned in
MKJ’s Opinionated Discourse Deployment Configuration
which I’ve got set correctly, but which maybe lots of people don’t have set correctly.

Worth nothing that the memory overcommit has nothing much to do with Redis. It’s just that Redis is kind enough to hint that it should be set correctly.

3 Likes

Just started yet another /admin/upgrade and had a shell open to manually call tree -h every second or so.

The highest memory usage values I could find were those:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       3.2Gi       120Mi        80Mi       542Mi       266Mi
Swap:          8.0Gi       276Mi       7.7Gi

The upgrade succeeded.

1 Like

So 4 GB is just on the edge at building time, if we suppose that last screenshot tells the most stressful moment.

And that is another thing I can’t understand: why others hit low memory and I, who uses a lot plugins and components, had zero issues :thinking: what makes that difference?

And I used had because nowadays I have 8 GB because of AI (and for me the price difference was not so important, but that is another story).

Should this thread move somewhere else or are we seeing this as an explain why using swap helped?

Anyway. For other beginners this one example where is talked little bit about low memory and reasons for that:

That is strongly FAQ question when upgraiding fails. But reason for that is rarely explained.

1 Like

@Jagster @uwe_keim please could you report the output of these commands

cat /proc/sys/vm/overcommit_memory 
cat /sys/kernel/mm/transparent_hugepage/enabled 

On my systems I have

# cat /proc/sys/vm/overcommit_memory 
1
# cat /sys/kernel/mm/transparent_hugepage/enabled 
always madvise [never]
1 Like
$ cat /proc/sys/vm/overcommit_memory
0

and

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
1 Like

Thanks @uwe_keim - I’m going to suppose that those kernel tunables are the reason you needed to add swap, even though it didn’t seem to be used. (The same would apply if you’d needed to add loads of RAM, because the total available memory is RAM+swap.)

1 Like

I can change server settings any time if you recommend to do so.

root@foorumi-hel:/var/discourse# cat /proc/sys/vm/overcommit_memory
0
root@foorumi-hel:/var/discourse# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
2 Likes

I do recommend!

This will fix it for future reboots (note that it overwrites files without checking the current state):

echo 'sys.kernel.mm.transparent_hugepage.enabled=never' > /etc/sysctl.d/10-huge-pages.conf
echo 'vm.overcommit_memory=1' > /etc/sysctl.d/90-vm_overcommit_memory.conf
sysctl --system
1 Like