RBoy
(RBoy)
14 Maggio 2019, 10:51pm
1
For the last few weeks I’ve noticed that the system memory usage creeps up every day and maxes it out.
Historically the memory usage has been about 50% - 55% (on a 3GB system). Now after an update it starts out at 50% but then over the next few days it slowly creeps up to 85% and then starts using up the swap.
Is there a way to find what in Discourse is creeping up and taking memory. The task manager only shows Ruby slowly increasing the amount of memory it’s consuming. Each ruby process seems to be taking up 350M and growing. (it starts with under 200M after an update)
Just updated to v2.3.0.beta9 +392 a two days ago, it’s already gone from 50% to 75% and doesn’t seem to stabilizing.
3 Mi Piace
david
(David Taylor)
14 Maggio 2019, 10:53pm
2
Try updating again. We noticed the same issue, and applied a fix a few hours ago. (commit 1 , commit 2 )
11 Mi Piace
RBoy
(RBoy)
14 Maggio 2019, 11:03pm
3
Okay updated and it’s restarted with 47%, will keep an eye on it. Thanks for the quick response.
3 Mi Piace
RBoy
(RBoy)
14 Maggio 2019, 11:46pm
4
It’s already creeped back up to 61% 64%, the Ruby processes are all now in the range of 310M-340M, will watch it for a day and report back.
Not sure if it’s related but I’m seeing this every night for the past week or so around 1am in the logs:
Sidekiq is consuming too much memory (using: 502.99M)
3 Mi Piace
david
(David Taylor)
15 Maggio 2019, 8:14am
5
You could try enabling the sidekiq logs, and then look for which job is causing the problem. Some information on those logs can be found in this commit message
committed 11:19AM - 05 Mar 19 UTC
By default, this does nothing. Two environment variables are available:
- `DI… SCOURSE_LOG_SIDEKIQ`
Set to `"1"` to enable logging. This will log all completed jobs to `log/rails/sidekiq.log`, along with various db/redis/network statistics. This is useful to track down poorly performing jobs.
- `DISCOURSE_LOG_SIDEKIQ_INTERVAL`
(seconds) Check running jobs periodically, and log their current duration. They will appear in the logs with `status:pending`. This is useful to track down jobs which take a long time, then crash sidekiq before completing.
7 Mi Piace
RBoy
(RBoy)
15 Maggio 2019, 1:18pm
6
The memory utilization is back up to 73% and doesnt’ seem to slowing down. It’s now beginning to take up swap space.
I’m not sure how to do this, would need some guidance. I had a look at the commit and it talks about setting 2 environment variables. How do I do this? I’m not familiar with ruby/docker and don’t want mess anything up as this is live.
Is there anything else I can look at to see why the memory utilization is creeping up?
I’m also seeing a new error in the logs after the update (2 since yesterday):
Job exception: post_revision_id
Falco
(Falco)
15 Maggio 2019, 1:55pm
7
RBoy:
Okay updated
Did you do a rebuild? Are you on the default branch of tests-passed?
3 Mi Piace
RBoy
(RBoy)
15 Maggio 2019, 2:21pm
8
Yes and yes I assume, using the default setup (is there a way to select a different branch?)
Stephen
(Stephen)
15 Maggio 2019, 2:38pm
9
There is, but that’s the right release to be getting any fixes.
3 Mi Piace
RBoy
(RBoy)
16 Maggio 2019, 1:05am
10
@sam is this commit related to this issue? If so is it stable enough to update?
committed 11:50PM - 15 May 19 UTC
v8 forking is not supported and can lead to memory leaks.
This commit handles t… he most common case which is the unicorn master forking
There are still some cases related to backup where we fork, however those
forks are usually short lived so the memory leak is not severe, burning
the contexts in the master process could break sidekiq or web process that
do the actual forking
2 Mi Piace
sam
(Sam Saffron)
16 Maggio 2019, 1:07am
11
The issue itself was fixed days ago, it is stable enough to upgrade.
4 Mi Piace
RBoy
(RBoy)
16 Maggio 2019, 1:34am
12
Okay updated, I’ll keep an on eye on it, hopefully this will fix it.
I didn’t get what you meant by the issue was fixed days ago. The memory consumption as of this evening is still creeping up.
2 Mi Piace
Does this fix require rebuild or can I just upgrade via UI?
sam
(Sam Saffron)
16 Maggio 2019, 8:52am
14
Via the UI should be fine
2 Mi Piace
RBoy
(RBoy)
16 Maggio 2019, 8:13pm
15
Okay so I did a update and rebuild last night. The memory usage is back up to 71% and still growing. The only way to reduce it is to restart discourse at which point it drops back down to under 50% and then starts working it’s way up again. The CPU utilization is about 1% on average.
sam
(Sam Saffron)
16 Maggio 2019, 8:15pm
16
What process is growing ? Sidekiq? Unicorn worker? Redis? PG?
3 Mi Piace
RBoy
(RBoy)
16 Maggio 2019, 8:35pm
17
That’s a good question, which is exactly what I had earlier, how do I find out what’s taking up memory within Discourse? I can only see the task manager which shows that Ruby is taking up more memory with time (all the instances of Ruby are growing in memory consumption).
sam
(Sam Saffron)
16 Maggio 2019, 8:36pm
18
As root run ps aux repeat every few hours
3 Mi Piace
RBoy
(RBoy)
16 Maggio 2019, 8:55pm
19
Okay when it was taking up 71% memory the top 14 consumers (%mem) are:
PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
49458 0.3 8.0 938568 326016 ? Sl May15 4:16 unicorn worker[2] -E production -c config/unicorn.conf.rb
49418 0.6 8.0 1041604 324192 ? SNl May15 7:19 sidekiq 5.2.7 discourse [0 of 5 busy]
49448 0.3 7.9 938056 321148 ? Sl May15 4:22 unicorn worker[1] -E production -c config/unicorn.conf.rb
49504 0.3 7.9 943692 319948 ? Sl May15 4:16 unicorn worker[7] -E production -c config/unicorn.conf.rb
49495 0.3 7.9 928328 319480 ? Sl May15 4:21 unicorn worker[6] -E production -c config/unicorn.conf.rb
49476 0.3 7.9 933448 318464 ? Sl May15 4:20 unicorn worker[4] -E production -c config/unicorn.conf.rb
49486 0.3 7.8 946768 315236 ? Sl May15 4:07 unicorn worker[5] -E production -c config/unicorn.conf.rb
49467 0.3 7.8 928840 315108 ? Sl May15 4:05 unicorn worker[3] -E production -c config/unicorn.conf.rb
49439 0.3 7.7 928328 313640 ? Sl May15 4:14 unicorn worker[0] -E production -c config/unicorn.conf.rb
49317 0.1 4.8 485628 196588 ? Sl May15 2:03 unicorn master -E production -c config/unicorn.conf.rb
49311 0.0 2.4 1263836 96848 ? Ss May15 0:08 postgres: 10/main: checkpointer process
49293 0.0 1.3 1263704 54864 ? S May15 0:11 /usr/lib/postgresql/10/bin/postmaster -D /etc/postgresql/10/main
1226 0.0 1.2 280508 49016 tty7 Ssl+ May15 0:21 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
After a restart and a couple of grace minutes, it’s showing 50% and the top memory consumers are
PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
17466 17.2 7.5 913964 304276 ? Sl 16:47 0:09 unicorn worker[1] -E production -c config/unicorn.conf.rb
17494 18.5 7.5 917036 302308 ? Sl 16:47 0:09 unicorn worker[4] -E production -c config/unicorn.conf.rb
17475 17.8 7.4 913964 301368 ? Sl 16:47 0:09 unicorn worker[2] -E production -c config/unicorn.conf.rb
17457 15.7 7.3 909244 297984 ? Sl 16:47 0:08 unicorn worker[0] -E production -c config/unicorn.conf.rb
17522 19.1 7.3 906168 297556 ? Sl 16:47 0:09 unicorn worker[7] -E production -c config/unicorn.conf.rb
17484 16.7 7.3 906168 297244 ? Sl 16:47 0:08 unicorn worker[3] -E production -c config/unicorn.conf.rb
17503 18.6 7.3 899000 294548 ? Sl 16:47 0:09 unicorn worker[5] -E production -c config/unicorn.conf.rb
17512 18.4 7.2 896952 292200 ? Sl 16:47 0:09 unicorn worker[6] -E production -c config/unicorn.conf.rb
17303 13.0 4.8 477436 194544 ? Sl 16:46 0:13 unicorn master -E production -c config/unicorn.conf.rb
17435 0.9 4.5 554280 182640 ? SNl 16:47 0:00 sidekiq 5.2.7 discourse [0 of 5 busy]
17267 0.0 1.4 1263704 57740 ? S 16:46 0:00 /usr/lib/postgresql/10/bin/postmaster -D /etc/postgresql/10/main
1226 0.0 1.2 280508 48464 tty7 Ssl+ May15 0:22 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
1447 0.3 1.2 776896 48360 ? Ssl May15 5:57 /usr/bin/dockerd -H fd://
Looks like sidekiq, some of the unicorn workers and the postgres.
Let me know if you would like me to collect any other data.
sam
(Sam Saffron)
16 Maggio 2019, 8:58pm
20
You are running too many unicorns, those numbers look right to me 300-500 per worker is in the normal range
Cut unicorn count down by 3
4 Mi Piace