During upgrade, by far the greatest strain on memory (RAM+swap) is when the ‘ember’ process runs. I think each time I’ve run an update it’s been bigger than before, and it is getting close to being unable to run on recommended-minimum sized computers.
It might be good to look into this before it actually fails. (Hopefully, for cost reasons, the answer will not be to increase the recommended-minimum size. Increasing swap would help, if disk space permits. In principle one could temporarily migrate to a more expensive larger-RAM instance.)
I run two modest-size forums on small instances - both within the recommended minimums, I believe. In both cases, RAM+swap=3G. In one case a Digital Ocean instance with 1G RAM and 2G swap, in the other case a Hetzner instance with 2G RAM and 1G swap.
Here are three snapshots of the ember process, on the DO machine, using ps auxc
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1000 10342 87.7 65.1 32930460 657936 ? Rl 16:57 2:23 ember
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1000 10342 84.9 60.7 43572204 612668 ? Rl 16:57 2:57 ember
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1000 10342 81.2 55.2 43405220 557128 ? Rl 16:57 3:40 ember
Obviously the 43GB process size isn’t all present in virtual memory as we only have 3G available. Using 65% of the RAM size for RSS is impressive, but not in itself a problem. The amount of free memory and free swap shows that the machine is close to an Out of Memory (OOM) condition, which would most likely result in some process getting killed and an untidy end to the update.
Here’s free
as a point-in-time snapshot:
# free
total used free shared buff/cache available
Mem: 1009140 863552 72768 6224 72820 34868
Swap: 2097144 1160628 936516
To try to catch the situation at its closest to failure, I used vmstat 5
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 1392140 61200 11632 76432 41 32 117 93 0 1 2 1 97 0 0
1 1 1467220 63416 324 67284 8786 20499 13178 20567 2539 8924 77 13 0 10 0
0 2 1593340 57916 1096 53832 24262 46868 29986 46889 5377 18534 44 22 0 34 0
4 0 1155632 120680 2772 86280 39111 35424 54768 37824 6987 25174 38 27 0 35 0
3 0 1102988 74096 2852 85276 11261 246 12610 271 1879 6365 86 6 0 8 0
You’ll notice a lot of context switches (cs) a lot of disk activity (bi, bo) and a lot of swap activity (si, so) but the most important thing is the swap usage up to 1.6G with the free memory down to 60M and only 54M of buffer usage. This means about 2.6G of the available 3G of virtual memory is in use. That’s 87% of capacity. (It might be a bit worse, as we’re only sampling every 5 seconds.)
Note that the situation was worrying (at about 2G used so not nearly as close to critical as today) when I updated back in August:
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 700404 62740 1956 48748 35 29 108 92 3 8 2 1 96 0 1
1 0 741000 65996 1880 44360 3708 11190 3982 11191 643 1437 92 4 0 3 1
1 0 834836 70452 1480 53856 528 18969 4274 18974 532 1575 93 6 0 1 0
4 1 1010144 82192 4644 44400 30065 38803 35455 39946 4432 19267 28 26 0 39 7
1 0 644116 307764 1644 55348 24406 21154 27724 21945 2551 8672 52 22 0 21 6