Hi,
After upgrading Discourse to v2.5.0.beta4-399-gbf8085e436 (from 2.4 something, using command line bootstrap from discourse_docker git, just before the Postgres 12 upgrade commits), there is notably more CPU churn on the server (even if no one is connected).
“top” tells that several “ruby” processes are active quite often. I have understood these are the unicorn sidekiq workers. Running redis-cli monitor I see that they make blocking BRPOP requests with a timeout of 2 s, to see if there is work to be done. So this amounts to a dozen of redis commands per second. Also there are statistics being updated every 5 seconds. redis-cli info stats tells there are in average 15-20 commands processed per second.
I don’t know if this activity can build up to the 2-4% CPU utilization that we see. With the old version it was less than 0.5%. I didn’t look at the same redis-cli stats back then.
I would be curious to try out a longer timeout to see if that helps, but I haven’t figured out how to do it.
Cool hint. I didn’t see anything alarming there, no jobs running and none with long execution (only a few over 100ms). Dashboard says ~1000 processed per day, same as before the upgrade.
Probably off-topic, but interesting also the Processed graph over 6 months, seems like there are two plateaus (one twice as high as the other) that it toggles between with weeks between. It does not correlate with reboots (and we didn’t update this stuff in that time either).
Is there an easy way to change that 2s BRPOP timeout? I didn’t find it in the code. It would at least indicate whether the work polling loops are the problem.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1024 discour+ 20 0 333584 202888 21660 S 1.3 2.5 3:35.81 ruby
1156 discour+ 20 0 715904 249416 24532 S 1.3 3.1 3:26.50 ruby
1178 discour+ 20 0 726664 251468 24564 S 1.3 3.1 3:26.11 ruby
1189 discour+ 20 0 714368 247912 24444 S 1.3 3.1 3:24.56 ruby
1200 discour+ 20 0 709760 249708 24632 S 1.3 3.1 3:22.96 ruby
1234 discour+ 20 0 713344 250288 24636 S 1.3 3.1 3:30.24 ruby
1167 discour+ 20 0 712832 247928 24436 S 1.0 3.1 3:24.75 ruby
188658 me 20 0 10424 4240 3576 R 0.7 0.1 0:00.36 top
448 root 20 0 1748444 84900 45884 S 0.3 1.1 6:09.98 dockerd
with no one connected…
The 3.5 minutes accumulated CPU time each is after 35 hours up, with very little user activity.
I also tried to use rbtrace, but it complains: (pid is not listening for messages, did you require “rbtrace”)
Do I need to rebuild the image to fix this? Or can I just adjust or rebuild something inside the container?