Higher idle process activity after upgrade

Hi,
After upgrading Discourse to v2.5.0.beta4-399-gbf8085e436 (from 2.4 something, using command line bootstrap from discourse_docker git, just before the Postgres 12 upgrade commits), there is notably more CPU churn on the server (even if no one is connected).

“top” tells that several “ruby” processes are active quite often. I have understood these are the unicorn sidekiq workers. Running redis-cli monitor I see that they make blocking BRPOP requests with a timeout of 2 s, to see if there is work to be done. So this amounts to a dozen of redis commands per second. Also there are statistics being updated every 5 seconds. redis-cli info stats tells there are in average 15-20 commands processed per second.

I don’t know if this activity can build up to the 2-4% CPU utilization that we see. With the old version it was less than 0.5%. I didn’t look at the same redis-cli stats back then.

I would be curious to try out a longer timeout to see if that helps, but I haven’t figured out how to do it.

Hmm would this be related to your work @eviltrout?

Which work were you thinking of? I don’t think anything I worked on recently would be likely to cause this.

My bad I was thinking of ember-cli

Check forum.example.com/sidekiq - there might be some background reprocessing jobs going on after the upgrade.

1 Like

Highly unlikely it touched redis. I’ll never say never because I’ve been burned before but I suspect this is something else :slight_smile:

1 Like

Cool hint. I didn’t see anything alarming there, no jobs running and none with long execution (only a few over 100ms). Dashboard says ~1000 processed per day, same as before the upgrade.

Probably off-topic, but interesting also the Processed graph over 6 months, seems like there are two plateaus (one twice as high as the other) that it toggles between with weeks between. It does not correlate with reboots (and we didn’t update this stuff in that time either).

Is there an easy way to change that 2s BRPOP timeout? I didn’t find it in the code. It would at least indicate whether the work polling loops are the problem.

Such are the things I see in top sometimes:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                     
   1024 discour+  20   0  333584 202888  21660 S   1.3   2.5   3:35.81 ruby                                                                        
   1156 discour+  20   0  715904 249416  24532 S   1.3   3.1   3:26.50 ruby                                                                        
   1178 discour+  20   0  726664 251468  24564 S   1.3   3.1   3:26.11 ruby                                                                        
   1189 discour+  20   0  714368 247912  24444 S   1.3   3.1   3:24.56 ruby                                                                        
   1200 discour+  20   0  709760 249708  24632 S   1.3   3.1   3:22.96 ruby                                                                        
   1234 discour+  20   0  713344 250288  24636 S   1.3   3.1   3:30.24 ruby                                                                        
   1167 discour+  20   0  712832 247928  24436 S   1.0   3.1   3:24.75 ruby                                                                        
 188658 me        20   0   10424   4240   3576 R   0.7   0.1   0:00.36 top                                                                         
    448 root      20   0 1748444  84900  45884 S   0.3   1.1   6:09.98 dockerd                                                                     

with no one connected…

The 3.5 minutes accumulated CPU time each is after 35 hours up, with very little user activity.

I also tried to use rbtrace, but it complains:
(pid is not listening for messages, did you require “rbtrace”)
Do I need to rebuild the image to fix this? Or can I just adjust or rebuild something inside the container?