Sidekiq stops after some time

Hello all,

I have an installation with 3 Openshift deployments, one for Redis (7.0.10), one for Postgress (13.10) and another for discourse (stable 3.0.3), everything works fine when deployed, however, after a couple of hours or days, sidekiq processes (UNICORN_SIDEKIQS=3) stop, there are some things that I noticed, under /shared/log/rails, there is no sidekiq.log being generated, and I believe that brings me to the reason why sidekiq does not restart automatically:

root@discourse-b9f766dcf-52zjq:/var/www/discourse# ls -laF /shared/log/rails/
total 32
drwxr-xr-x. 2 nobody www-data  4096 Jun  9 08:57 ./
drwxr-xr-x. 3 root   root      4096 May 30 06:16 ../
-rw-r--r--. 1 nobody www-data 16082 Jun  9 09:28 production.log
-rw-r--r--. 1 nobody www-data  1345 Jun  9 09:02 unicorn.stderr.log
-rw-r--r--. 1 nobody www-data   204 Jun  9 09:02 unicorn.stdout.log```

When sidekiq stops I see the following message in the host/logs:

Info:
Sidekiq is consuming too much memory (using: 530.35M) for 'discourse.internal.odencluster.com', restarting

backtrace:
config/unicorn.conf.rb:163:in `check_sidekiq_heartbeat'
config/unicorn.conf.rb:243:in `master_sleep'
unicorn-6.1.0/lib/unicorn/http_server.rb:295:in `join'
unicorn-6.1.0/bin/unicorn:128:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/3.2.0/bin/unicorn:25:in `load'
/var/www/discourse/vendor/bundle/ruby/3.2.0/bin/unicorn:25:in `<main>'

Then I can see in the discourse pod log the message:

(48) Reopening logs
(48) Reopening logs
(48) Reopening logs

However, since there is no sidekiq.log under /shared/log/rails/, it does not restart.

My rails knowledge is close to zero, therefore I find it difficult troubleshooting it, but I see that sidekiq is not paused:

[1] pry(main)> Sidekiq.paused?
=> false

And when I start it manually, it works:

2023-06-09T09:47:15.556Z pid=195386 tid=449q INFO: Booting Sidekiq 6.5.8 with Sidekiq::RedisConnection::RedisAdapter options {:host=>"redis", :port=>6379, :namespace=>"sidekiq"}
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: Booted Rails 7.0.4.3 application in production environment
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: See LICENSE and the LGPL-3.0 for licensing details.
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: Upgrade to Sidekiq Pro for more features and support: https://sidekiq.org

There are a couple of thinks that I imagine would help me solve this issue:

  1. How do I get it to create /shared/log/rails/sidekiq.log?
  2. How do I get to allow sidekiq to use more than 530M of memory?

If someone would have a suggestion, please let me know and I thank you in advance to take your time supporting this!

Have a great one! :slight_smile:

1 Like

Hi Denis,

I will focus on provide you info about how to increase RSS for Sidekiq.

For this, have a look to the UNICORN_SIDEKIQ_MAX_RSS environment variable (ffi: discourse/unicorn.conf.rb at 89d7b1861d1625352e82e82c19f93e7272c965ef · discourse/discourse · GitHub) (e.g., UNICORN_SIDEKIQ_MAX_RSS: 1000) so this will allow you use allocate more memory. In any case, I recommend you decrease a bit the UNICORN_SIDEKIQS value to 1 or 2, unless you have several jobs been enqueued up for long.

I unknow what can be the cause of your sidekiq’s reboots though, normally it just reboots on the background after OOM (according to discourse/unicorn.conf.rb at 89d7b1861d1625352e82e82c19f93e7272c965ef · discourse/discourse · GitHub), but check your-forum.com/logs for more info, hope this helps.

Cheers,
Ismael

4 Likes

Hello @trobiyo, thanks so much for the quick and straight forward support!

Yes, my sidekiq is restarting due to OOM, but I’ve now follow your advice and I have reduced the UNICORN_SIDEKIQS=1 and I’ve allocated more memory to RSS using the env UNICORN_SIDEKIQ_MAX_RSS.

I hope that will help out and avoid sidekiq to restart.

Do you have any idea why sidekiq is not generating any logs in /shared/log/rails/sidekiq.log?

Thanks once again and have a great one! :slight_smile:

Cheers,
Denis

Hi,

If I’m not mistaken, you have to set DISCOURSE_LOG_SIDEKIQ env variable to 1, as per discourse/base.rb at 7c768a2ff99f45ab5008c16cf6982652d576a0e2 · discourse/discourse · GitHub, otherwise the write_to_log function will return without dumping the log (see discourse/base.rb at 7c768a2ff99f45ab5008c16cf6982652d576a0e2 · discourse/discourse · GitHub)

Hope this helps.

2 Likes

Hello,

Yes, DISCOURSE_LOG_SIDEKIQ=1 helped and I see a /shared/log/rails/sidekiq.log. That is great!

I’ve also seem that sidekiq has been running for a while and hasn’t restarted due to OOM since I increased the memory allowance and reduced to 1 process only.

That seems to be the solution for my sidekiq problem, I’ll keep monitoring it and will update here in case I still see some issue regarding sidekiq.

In the meantime, I really appreciate your help @trobiyo, awesome support from your side!

Have a great one!

:slight_smile:

2 Likes

That’s excellent news :clap: , glad that this helped solving the issue!

Cheers,
Ismael

3 Likes

Hello again @trobiyo,

Unfortunately my sidekiq is still stopping, looks like the changes were not enough. =/

I see in the logs the following error:

info:
Job exception: FinalDestination: all resolved IPs were disallowed

backtrace:
/var/www/discourse/lib/final_destination/ssrf_detector.rb:104:in `lookup_and_filter_ips'
/var/www/discourse/lib/final_destination/http.rb:13:in `connect'
/usr/local/lib/ruby/3.2.0/net/http.rb:1248:in `do_start'
/usr/local/lib/ruby/3.2.0/net/http.rb:1237:in `start'
/usr/local/lib/ruby/3.2.0/net/http.rb:687:in `start'
/var/www/discourse/lib/final_destination.rb:511:in `safe_session'
/var/www/discourse/lib/final_destination.rb:450:in `safe_get'
/var/www/discourse/lib/final_destination.rb:161:in `get'
/var/www/discourse/lib/retrieve_title.rb:81:in `fetch_title'
/var/www/discourse/lib/retrieve_title.rb:7:in `crawl'
/var/www/discourse/lib/inline_oneboxer.rb:76:in `lookup'
/var/www/discourse/lib/cooked_processor_mixin.rb:310:in `process_inline_onebox'
/var/www/discourse/lib/cooked_processor_mixin.rb:39:in `block in post_process_oneboxes'
/var/www/discourse/lib/oneboxer.rb:213:in `block in apply'
/var/www/discourse/lib/oneboxer.rb:161:in `block in each_onebox_link'
nokogiri-1.14.2-x86_64-linux/lib/nokogiri/xml/node_set.rb:235:in `block in each'
nokogiri-1.14.2-x86_64-linux/lib/nokogiri/xml/node_set.rb:234:in `upto'
nokogiri-1.14.2-x86_64-linux/lib/nokogiri/xml/node_set.rb:234:in `each'
/var/www/discourse/lib/oneboxer.rb:161:in `each_onebox_link'
/var/www/discourse/lib/oneboxer.rb:212:in `apply'
/var/www/discourse/lib/cooked_processor_mixin.rb:9:in `post_process_oneboxes'
/var/www/discourse/lib/cooked_post_processor.rb:41:in `block in post_process'
/var/www/discourse/lib/distributed_mutex.rb:53:in `block in synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:34:in `synchronize'
/var/www/discourse/lib/cooked_post_processor.rb:38:in `post_process'
/var/www/discourse/app/jobs/regular/process_post.rb:28:in `block in execute'
/var/www/discourse/lib/distributed_mutex.rb:53:in `block in synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:34:in `synchronize'
/var/www/discourse/app/jobs/regular/process_post.rb:8:in `execute'
/var/www/discourse/app/jobs/base.rb:249:in `block (2 levels) in perform'
rails_multisite-4.0.1/lib/rails_multisite/connection_management.rb:80:in `with_connection'
/var/www/discourse/app/jobs/base.rb:236:in `block in perform'
/var/www/discourse/app/jobs/base.rb:232:in `each'
/var/www/discourse/app/jobs/base.rb:232:in `perform'
sidekiq-6.5.8/lib/sidekiq/processor.rb:202:in `execute_job'
sidekiq-6.5.8/lib/sidekiq/processor.rb:170:in `block (2 levels) in process'
sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:177:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:134:in `call'
sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:179:in `block in invoke'
sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:182:in `invoke'
sidekiq-6.5.8/lib/sidekiq/processor.rb:169:in `block in process'
sidekiq-6.5.8/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_retry.rb:113:in `local'
sidekiq-6.5.8/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq.rb:44:in `block in <module:Sidekiq>'
sidekiq-6.5.8/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/processor.rb:263:in `stats'
sidekiq-6.5.8/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_logger.rb:13:in `call'
sidekiq-6.5.8/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_retry.rb:80:in `global'
sidekiq-6.5.8/lib/sidekiq/processor.rb:124:in `block in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_logger.rb:39:in `prepare'
sidekiq-6.5.8/lib/sidekiq/processor.rb:123:in `dispatch'
sidekiq-6.5.8/lib/sidekiq/processor.rb:168:in `process'
sidekiq-6.5.8/lib/sidekiq/processor.rb:78:in `process_one'
sidekiq-6.5.8/lib/sidekiq/processor.rb:68:in `run'
sidekiq-6.5.8/lib/sidekiq/component.rb:8:in `watchdog'
sidekiq-6.5.8/lib/sidekiq/component.rb:17:in `block in safe_thread'

Any idea on what is going wrong based on this?

Thanks once again for your support.

:slight_smile:

Hi Denis,

This looks to me an issue with sockets/DNS that give timeout? Do you have anything set under the allowed internal hosts setting?

Looking at the stacktrace, when doing a lookup of the ips it seems this list is empty (see discourse/lib/final_destination/ssrf_detector.rb at main · discourse/discourse · GitHub), triggered at discourse/lib/final_destination/http.rb at tests-passed · discourse/discourse · GitHub, hence I’m tempted to say that this could be associated to your installation (that cannot reach the ip of the sidekiq pods?)

Or check whether you are using any NetworkPolicy installed in your cluster, this can be another reason.

Cheers

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.