Ember-cli build memory usage risks failure (OOM) on minimum instance size

During upgrade, by far the greatest strain on memory (RAM+swap) is when the ‘ember’ process runs. I think each time I’ve run an update it’s been bigger than before, and it is getting close to being unable to run on recommended-minimum sized computers.

It might be good to look into this before it actually fails. (Hopefully, for cost reasons, the answer will not be to increase the recommended-minimum size. Increasing swap would help, if disk space permits. In principle one could temporarily migrate to a more expensive larger-RAM instance.)

I run two modest-size forums on small instances - both within the recommended minimums, I believe. In both cases, RAM+swap=3G. In one case a Digital Ocean instance with 1G RAM and 2G swap, in the other case a Hetzner instance with 2G RAM and 1G swap.

Here are three snapshots of the ember process, on the DO machine, using ps auxc

USER       PID %CPU %MEM      VSZ    RSS TTY   STAT START   TIME COMMAND
1000     10342 87.7 65.1 32930460 657936 ?     Rl   16:57   2:23 ember

USER       PID %CPU %MEM      VSZ    RSS TTY   STAT START   TIME COMMAND
1000     10342 84.9 60.7 43572204 612668 ?     Rl   16:57   2:57 ember

USER       PID %CPU %MEM      VSZ    RSS TTY   STAT START   TIME COMMAND
1000     10342 81.2 55.2 43405220 557128 ?     Rl   16:57   3:40 ember

Obviously the 43GB process size isn’t all present in virtual memory as we only have 3G available. Using 65% of the RAM size for RSS is impressive, but not in itself a problem. The amount of free memory and free swap shows that the machine is close to an Out of Memory (OOM) condition, which would most likely result in some process getting killed and an untidy end to the update.

Here’s free as a point-in-time snapshot:

# free
              total        used        free      shared  buff/cache   available
Mem:        1009140      863552       72768        6224       72820       34868
Swap:       2097144     1160628      936516

To try to catch the situation at its closest to failure, I used vmstat 5

# vmstat 5 5
procs -----------memory----------    ---swap-- -----io----  -system-- ------cpu-----
 r  b   swpd    free   buff  cache    si    so    bi    bo   in    cs us sy id wa st
 3  0 1392140  61200  11632  76432    41    32   117    93    0     1  2  1 97  0  0
 1  1 1467220  63416    324  67284  8786 20499 13178 20567 2539  8924 77 13  0 10  0
 0  2 1593340  57916   1096  53832 24262 46868 29986 46889 5377 18534 44 22  0 34  0
 4  0 1155632 120680   2772  86280 39111 35424 54768 37824 6987 25174 38 27  0 35  0
 3  0 1102988  74096   2852  85276 11261   246 12610   271 1879  6365 86  6  0  8  0

You’ll notice a lot of context switches (cs) a lot of disk activity (bi, bo) and a lot of swap activity (si, so) but the most important thing is the swap usage up to 1.6G with the free memory down to 60M and only 54M of buffer usage. This means about 2.6G of the available 3G of virtual memory is in use. That’s 87% of capacity. (It might be a bit worse, as we’re only sampling every 5 seconds.)

Note that the situation was worrying (at about 2G used so not nearly as close to critical as today) when I updated back in August:

# vmstat 5 5
procs -----------memory----------    ---swap-- -----io----  -system-- ------cpu-----
 r  b    swpd   free   buff  cache    si    so    bi    bo   in    cs us sy id wa st
 3  0  700404  62740   1956  48748    35    29   108    92    3     8  2  1 96  0  1
 1  0  741000  65996   1880  44360  3708 11190  3982 11191  643  1437 92  4  0  3  1
 1  0  834836  70452   1480  53856   528 18969  4274 18974  532  1575 93  6  0  1  0
 4  1 1010144  82192   4644  44400 30065 38803 35455 39946 4432 19267 28 26  0 39  7
 1  0  644116 307764   1644  55348 24406 21154 27724 21945 2551  8672 52 22  0 21  6
3 Likes

Hi @Ed_S - what version of Discourse were you using for these tests? We regularly update ember-cli and its addons, so I just want to be sure we’re looking at the same thing.

Also, how many CPU cores do your VMs have? 1? (you can check by running lscpu in the console)

So that we’re all working with the same data, could you try running:

/var/discourse/launcher enter app
cd /var/www/discourse/app/assets/javascripts/discourse
apt-get update && apt-get install time
NODE_OPTIONS='--max-old-space-size=2048' /usr/bin/time -v yarn ember build -prod

On my test droplet (1 CPU, 1GB RAM, 2GB swap), I see this:

Command being timed: "yarn ember build -prod"
	User time (seconds): 369.74
	System time (seconds): 22.62
	Percent of CPU this job got: 81%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 8:02.73
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 774912
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 253770
	Minor (reclaiming a frame) page faults: 1158920
	Voluntary context switches: 519269
	Involuntary context switches: 383328
	Swaps: 0
	File system inputs: 7521784
	File system outputs: 316304
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

We’re using pretty standard ember tooling here, so I’m not sure there’s much we can do in terms of config to reduce memory usage. Our long-term aim is to move to using Embroider, which may give us more options.

1 Like

Thanks @david - appreciate that ember is a thing in its own right.

I’ve just done those commands.

# /var/discourse/launcher enter app
x86_64 arch detected.

WARNING: We are about to start downloading the Discourse base image
This process may take anywhere between a few minutes to an hour, depending on your network speed

Please be patient

2.0.20220720-0049: Pulling from discourse/base
Digest: sha256:7ff397003c78b64c9131726756014710e2e67568fbc88daad846d2b368a02364
Status: Downloaded newer image for discourse/base:2.0.20220720-0049
docker.io/discourse/base:2.0.20220720-0049

This is a production installation so, as of yesterday, it was up to date. Presently reporting:

Installed 2.9.0.beta12 (8f5936871c)

It’s a one-CPU instance, like yours it’s 1G of RAM and 2G of swap.

The result of the time command was

Done in 303.21s.
	Command being timed: "yarn ember build -prod"
	User time (seconds): 222.71
	System time (seconds): 17.17
	Percent of CPU this job got: 78%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:04.15
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 702292
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 348190
	Minor (reclaiming a frame) page faults: 1152689
	Voluntary context switches: 617736
	Involuntary context switches: 774189
	Swaps: 0
	File system inputs: 5001936
	File system outputs: 318280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Immediately prior, I’d updated the host and rebooted, so everything in the container would have been freshly restarted.

The worst of the memory usage as reported by a vmstat running in a different window:

# vmstat 1
procs  -----------memory----------    ---swap--  -----io----   -system-- ------cpu-----
 r  b    swpd   free   buff  cache    si     so    bi     bo    in    cs us sy id wa st
 2  0  704000 136044  24136 158144  1517   3503  8256   4377   886  3564 43  8 43  6  0
...
 5  0 1451436  71604   1248  50196 55016 110236 73204 121060 13152 45971 29 60  0 10  1
2 Likes

Looks like we explicitly increased node’s allowable heap from 500M to 2G - possibly this is a step too far, and 1.5G would be better:

It’s worth noting that ember isn’t the only thing running on the machine, and we’re up against the global limit of RAM+swap. So the machine’s history, and the needs of all other running processes, come into play. My reboot might have helped here to reach a lower high water mark compared to yesterday.

The pull request above was referenced in
Failed to upgrade discourse instance to Feb 15 2022
where we also note that someone had a memory shortage which was resolved by a reboot.

It’s unfortunate that the time command doesn’t report peak memory usage. Possibly, on a machine with at least 3G of RAM and no swap the RSS count would tell us ember’s peak usage. Or possibly we could use another tactic - several are outlined here and there are some ideas here too.

What’s awkward is that we really are interested in memory use here, whereas in many cases people are interested in RAM use, which is a different question.

3 Likes

The reason we added that flag was that Node’s own OOM killer was killing the build - 500M wasn’t enough. Happy to try tweaking it to 1.5G - I just tried it on my droplet and it seems to work ok. In fact, it seems even 1.0G is enough.

I tried tracking the memory usage with different max_heap sizes:

(while(true); do (free -m -t | grep Total | awk '{print $3}') && sleep 0.5; done) | tee 1000mb.csv

Shows this usage during the build:

There was very little difference in build time, but the 1GB and 1.5GB limits clearly produce less overall usage. As expected, the time output shows significantly fewer “Major page faults” when the node limit is lower.

It’s curious that the difference between 1.5GB and 1GB is so small… :face_with_monocle:

In any case, I agree decreasing the limit is a good idea. To make sure it doesn’t affect the build performance on higher-spec machines, I think we should only override the limit when we know it’s too low. Otherwise, we can let Node use its default.

Here’s a PR - we’ll try and get this merged soon. Thanks for raising this @Ed_S!

4 Likes