Memory creep in last couple of updates

For the last few weeks I’ve noticed that the system memory usage creeps up every day and maxes it out.

Historically the memory usage has been about 50% - 55% (on a 3GB system). Now after an update it starts out at 50% but then over the next few days it slowly creeps up to 85% and then starts using up the swap.

Is there a way to find what in Discourse is creeping up and taking memory. The task manager only shows Ruby slowly increasing the amount of memory it’s consuming. Each ruby process seems to be taking up 350M and growing. (it starts with under 200M after an update)

Just updated to v2.3.0.beta9 +392 a two days ago, it’s already gone from 50% to 75% and doesn’t seem to stabilizing.

3 Likes

Try updating again. We noticed the same issue, and applied a fix a few hours ago. (commit 1, commit 2)

11 Likes

Okay updated and it’s restarted with 47%, will keep an eye on it. Thanks for the quick response.

3 Likes

It’s already creeped back up to 61% 64%, the Ruby processes are all now in the range of 310M-340M, will watch it for a day and report back.

Not sure if it’s related but I’m seeing this every night for the past week or so around 1am in the logs:

Sidekiq is consuming too much memory (using: 502.99M)

3 Likes

You could try enabling the sidekiq logs, and then look for which job is causing the problem. Some information on those logs can be found in this commit message

https://github.com/discourse/discourse/commit/8963f1af30cd72627e35ce04b317b775435c6f22

7 Likes

The memory utilization is back up to 73% and doesnt’ seem to slowing down. It’s now beginning to take up swap space.

I’m not sure how to do this, would need some guidance. I had a look at the commit and it talks about setting 2 environment variables. How do I do this? I’m not familiar with ruby/docker and don’t want mess anything up as this is live.

Is there anything else I can look at to see why the memory utilization is creeping up?

I’m also seeing a new error in the logs after the update (2 since yesterday):

Job exception: post_revision_id

Did you do a rebuild? Are you on the default branch of tests-passed?

3 Likes

Yes and yes I assume, using the default setup (is there a way to select a different branch?)

There is, but that’s the right release to be getting any fixes.

3 Likes

@sam is this commit related to this issue? If so is it stable enough to update?

https://github.com/discourse/discourse/commit/76173dea87115589414a200ad7fc1f08a4deb280

2 Likes

The issue itself was fixed days ago, it is stable enough to upgrade.

4 Likes

Okay updated, I’ll keep an on eye on it, hopefully this will fix it.

I didn’t get what you meant by the issue was fixed days ago. The memory consumption as of this evening is still creeping up.

2 Likes

Does this fix require rebuild or can I just upgrade via UI?

Via the UI should be fine

2 Likes

Okay so I did a update and rebuild last night. The memory usage is back up to 71% and still growing. The only way to reduce it is to restart discourse at which point it drops back down to under 50% and then starts working it’s way up again. The CPU utilization is about 1% on average.

What process is growing ? Sidekiq? Unicorn worker? Redis? PG?

3 Likes

That’s a good question, which is exactly what I had earlier, how do I find out what’s taking up memory within Discourse? I can only see the task manager which shows that Ruby is taking up more memory with time (all the instances of Ruby are growing in memory consumption).

As root run ps aux repeat every few hours

3 Likes

Okay when it was taking up 71% memory the top 14 consumers (%mem) are:

PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
49458  0.3  8.0 938568 326016 ?       Sl   May15   4:16 unicorn worker[2] -E production -c config/unicorn.conf.rb
49418  0.6  8.0 1041604 324192 ?      SNl  May15   7:19 sidekiq 5.2.7 discourse [0 of 5 busy]
49448  0.3  7.9 938056 321148 ?       Sl   May15   4:22 unicorn worker[1] -E production -c config/unicorn.conf.rb
49504  0.3  7.9 943692 319948 ?       Sl   May15   4:16 unicorn worker[7] -E production -c config/unicorn.conf.rb
49495  0.3  7.9 928328 319480 ?       Sl   May15   4:21 unicorn worker[6] -E production -c config/unicorn.conf.rb
49476  0.3  7.9 933448 318464 ?       Sl   May15   4:20 unicorn worker[4] -E production -c config/unicorn.conf.rb
49486  0.3  7.8 946768 315236 ?       Sl   May15   4:07 unicorn worker[5] -E production -c config/unicorn.conf.rb
49467  0.3  7.8 928840 315108 ?       Sl   May15   4:05 unicorn worker[3] -E production -c config/unicorn.conf.rb
49439  0.3  7.7 928328 313640 ?       Sl   May15   4:14 unicorn worker[0] -E production -c config/unicorn.conf.rb
49317  0.1  4.8 485628 196588 ?       Sl   May15   2:03 unicorn master -E production -c config/unicorn.conf.rb
49311  0.0  2.4 1263836 96848 ?       Ss   May15   0:08 postgres: 10/main: checkpointer process   
49293  0.0  1.3 1263704 54864 ?       S    May15   0:11 /usr/lib/postgresql/10/bin/postmaster -D /etc/postgresql/10/main
1226  0.0  1.2 280508 49016 tty7     Ssl+ May15   0:21 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch

After a restart and a couple of grace minutes, it’s showing 50% and the top memory consumers are

PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
17466 17.2  7.5 913964 304276 ?       Sl   16:47   0:09 unicorn worker[1] -E production -c config/unicorn.conf.rb
17494 18.5  7.5 917036 302308 ?       Sl   16:47   0:09 unicorn worker[4] -E production -c config/unicorn.conf.rb
17475 17.8  7.4 913964 301368 ?       Sl   16:47   0:09 unicorn worker[2] -E production -c config/unicorn.conf.rb
17457 15.7  7.3 909244 297984 ?       Sl   16:47   0:08 unicorn worker[0] -E production -c config/unicorn.conf.rb
17522 19.1  7.3 906168 297556 ?       Sl   16:47   0:09 unicorn worker[7] -E production -c config/unicorn.conf.rb
17484 16.7  7.3 906168 297244 ?       Sl   16:47   0:08 unicorn worker[3] -E production -c config/unicorn.conf.rb
17503 18.6  7.3 899000 294548 ?       Sl   16:47   0:09 unicorn worker[5] -E production -c config/unicorn.conf.rb
17512 18.4  7.2 896952 292200 ?       Sl   16:47   0:09 unicorn worker[6] -E production -c config/unicorn.conf.rb
17303 13.0  4.8 477436 194544 ?       Sl   16:46   0:13 unicorn master -E production -c config/unicorn.conf.rb
17435  0.9  4.5 554280 182640 ?       SNl  16:47   0:00 sidekiq 5.2.7 discourse [0 of 5 busy]
17267  0.0  1.4 1263704 57740 ?       S    16:46   0:00 /usr/lib/postgresql/10/bin/postmaster -D /etc/postgresql/10/main
1226  0.0  1.2 280508 48464 tty7     Ssl+ May15   0:22 /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch
1447  0.3  1.2 776896 48360 ?        Ssl  May15   5:57 /usr/bin/dockerd -H fd://

Looks like sidekiq, some of the unicorn workers and the postgres.

Let me know if you would like me to collect any other data.

You are running too many unicorns, those numbers look right to me 300-500 per worker is in the normal range

Cut unicorn count down by 3

4 Likes