Sidekiq 在一段时间后停止

大家好,

我有一个包含 3 个 Openshift 部署的安装,一个用于 Redis (7.0.10),一个用于 Postgress (13.10),另一个用于 discourse (stable 3.0.3),所有这些在部署时都能正常工作,但是,几个小时或几天后,sidekiq 进程 (UNICORN_SIDEKIQS=3) 会停止,我注意到了一些事情,在 /shared/log/rails 下,没有生成 sidekiq.log,我认为这就是 sidekiq 无法自动重启的原因:

root@discourse-b9f766dcf-52zjq:/var/www/discourse# ls -laF /shared/log/rails/
total 32
drwxr-xr-x. 2 nobody www-data  4096 Jun  9 08:57 ./
drwxr-xr-x. 3 root   root      4096 May 30 06:16 ../
-rw-r--r--. 1 nobody www-data 16082 Jun  9 09:28 production.log
-rw-r--r--. 1 nobody www-data  1345 Jun  9 09:02 unicorn.stderr.log
-rw-r--r--. 1 nobody www-data   204 Jun  9 09:02 unicorn.stdout.log

sidekiq 停止时,我在 host/logs 中看到以下消息:

Info:
Sidekiq is consuming too much memory (using: 530.35M) for 'discourse.internal.odencluster.com', restarting

backtrace:
config/unicorn.conf.rb:163:in `check_sidekiq_heartbeat'
config/unicorn.conf.rb:243:in `master_sleep'
unicorn-6.1.0/lib/unicorn/http_server.rb:295:in `join'
unicorn-6.1.0/bin/unicorn:128:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/3.2.0/bin/unicorn:25:in `load'
/var/www/discourse/vendor/bundle/ruby/3.2.0/bin/unicorn:25:in `<main>'

然后我在 discourse pod 日志中看到消息:

(48) Reopening logs
(48) Reopening logs
(48) Reopening logs

然而,由于 /shared/log/rails/ 下没有 sidekiq.log,它不会重启。

我对 rails 的了解几乎为零,因此很难进行故障排除,但我看到 sidekiq 没有被暂停:

[1] pry(main)> Sidekiq.paused?
=> false

当我手动启动它时,它就能工作:

2023-06-09T09:47:15.556Z pid=195386 tid=449q INFO: Booting Sidekiq 6.5.8 with Sidekiq::RedisConnection::RedisAdapter options {:host=>"redis", :port=>6379, :namespace=>"sidekiq"}
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: Booted Rails 7.0.4.3 application in production environment
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: See LICENSE and the LGPL-3.0 for licensing details.
2023-06-09T09:47:20.528Z pid=195386 tid=449q INFO: Upgrade to Sidekiq Pro for more features and support: https://sidekiq.org

有几件事我想可以帮助我解决这个问题:

  1. 如何让它创建 /shared/log/rails/sidekiq.log
  2. 如何让 sidekiq 使用超过 530M 的内存?

如果有人有建议,请告诉我,我提前感谢您抽出宝贵时间支持!

祝您有美好的一天! :slight_smile:

1 个赞

您好 Denis,

我将为您提供有关如何增加 Sidekiq 的 RSS 的信息。

为此,请查看 UNICORN_SIDEKIQ_MAX_RSS 环境变量(ffi:discourse/config/unicorn.conf.rb at 89d7b1861d1625352e82e82c19f93e7272c965ef · discourse/discourse · GitHub 1000),这将允许您分配更多内存。无论如何,我建议您将 UNICORN_SIDEKIQS` 的值稍微减少到 1 或 2,除非您有大量积压的任务。

我不知道导致您的 sidekiq 重启的原因,通常它只是在 OOM 后在后台重启(根据 discourse/config/unicorn.conf.rb at 89d7b1861d1625352e82e82c19f93e7272c965ef · discourse/discourse · GitHub your-forum.com/logs 以获取更多信息,希望这对您有帮助。

祝好,
Ismael

4 个赞

您好 @trobiyo,非常感谢您快速而直接的支持!

是的,我的 sidekiq 由于 OOM(内存不足)而重启,但我现在遵循了您的建议,我已将 UNICORN_SIDEKIQS=1 减少,并使用环境变量 UNICORN_SIDEKIQ_MAX_RSS 为 RSS 分配了更多内存。

我希望这能有所帮助并避免 sidekiq 重启。

您知道为什么 sidekiq 在 /shared/log/rails/sidekiq.log 中不生成任何日志吗?

再次感谢您,祝您一切顺利!:slight_smile:

此致,
Denis

您好,

如果我没记错的话,您需要将 DISCOURSE_LOG_SIDEKIQ 环境变量设置为 1,根据 discourse/app/jobs/base.rb at 7c768a2ff99f45ab5008c16cf6982652d576a0e2 · discourse/discourse · GitHub write_to_log 函数将返回而不会转储日志(请参阅 discourse/app/jobs/base.rb at 7c768a2ff99f45ab5008c16cf6982652d576a0e2 · discourse/discourse · GitHub

希望这有帮助。

2 个赞

您好,

是的,DISCOURSE_LOG_SIDEKIQ=1 有帮助,我看到了 /shared/log/rails/sidekiq.log。这太棒了!

我也注意到 sidekiq 已经运行了一段时间,自从我增加了内存限制并将其减少到只有一个进程后,它就没有因为 OOM 而重启过。

这似乎是我的 sidekiq 问题解决方案,我会继续监控它,如果我仍然看到与 sidekiq 相关的问题,我会在此更新。

与此同时,我非常感谢您的帮助 @trobiyo,您的支持非常棒!

祝您一切顺利!

:slight_smile:

2 个赞

太棒了 :clap: ,很高兴这有助于解决问题!

此致,
Ismael

3 个赞

再次问候 @trobiyo

不幸的是,我的 sidekiq 仍然停止,看起来这些更改还不够。=/

我在日志中看到以下错误:

info:
Job exception: FinalDestination: 所有解析的 IP 都被禁止

backtrace:
/var/www/discourse/lib/final_destination/ssrf_detector.rb:104:in `lookup_and_filter_ips'
/var/www/discourse/lib/final_destination/http.rb:13:in `connect'
/usr/local/lib/ruby/3.2.0/net/http.rb:1248:in `do_start'
/usr/local/lib/ruby/3.2.0/net/http.rb:1237:in `start'
/usr/local/lib/ruby/3.2.0/net/http.rb:687:in `start'
/var/www/discourse/lib/final_destination.rb:511:in `safe_session'
/var/www/discourse/lib/final_destination.rb:450:in `safe_get'
/var/www/discourse/lib/final_destination.rb:161:in `get'
/var/www/discourse/lib/retrieve_title.rb:81:in `fetch_title'
/var/www/discourse/lib/retrieve_title.rb:7:in `crawl'
/var/www/discourse/lib/inline_oneboxer.rb:76:in `lookup'
/var/www/discourse/lib/cooked_processor_mixin.rb:310:in `process_inline_onebox'
/var/www/discourse/lib/cooked_processor_mixin.rb:39:in `block in post_process_oneboxes'
/var/www/discourse/lib/oneboxer.rb:213:in `block in apply'
/var/www/discourse/lib/oneboxer.rb:161:in `block in each_onebox_link'
nokogiri-1.14.2-x86_64-linux/lib/nokogiri/xml/node_set.rb:235:in `block in each'
nokogiri-1.14.2-x86_64-linux/lib/nokogiri/xml/node_set.rb:234:in `upto'
nokogiri-1.14.2-x86_64-linux/lib/nokogiri/xml/node_set.rb:234:in `each'
/var/www/discourse/lib/oneboxer.rb:161:in `each_onebox_link'
/var/www/discourse/lib/oneboxer.rb:212:in `apply'
/var/www/discourse/lib/cooked_processor_mixin.rb:9:in `post_process_oneboxes'
/var/www/discourse/lib/cooked_post_processor.rb:41:in `block in post_process'
/var/www/discourse/lib/distributed_mutex.rb:53:in `block in synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:34:in `synchronize'
/var/www/discourse/lib/cooked_post_processor.rb:38:in `post_process'
/var/www/discourse/app/jobs/regular/process_post.rb:28:in `block in execute'
/var/www/discourse/lib/distributed_mutex.rb:53:in `block in synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:49:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:34:in `synchronize'
/var/www/discourse/app/jobs/regular/process_post.rb:8:in `execute'
/var/www/discourse/app/jobs/base.rb:249:in `block (2 levels) in perform'
rails_multisite-4.0.1/lib/rails_multisite/connection_management.rb:80:in `with_connection'
/var/www/discourse/app/jobs/base.rb:236:in `block in perform'
/var/www/discourse/app/jobs/base.rb:232:in `each'
/var/www/discourse/app/jobs/base.rb:232:in `perform'
sidekiq-6.5.8/lib/sidekiq/processor.rb:202:in `execute_job'
sidekiq-6.5.8/lib/sidekiq/processor.rb:170:in `block (2 levels) in process'
sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:177:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:134:in `call'
sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:179:in `block in invoke'
sidekiq-6.5.8/lib/sidekiq/middleware/chain.rb:182:in `invoke'
sidekiq-6.5.8/lib/sidekiq/processor.rb:169:in `block in process'
sidekiq-6.5.8/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_retry.rb:113:in `local'
sidekiq-6.5.8/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq.rb:44:in `block in <module:Sidekiq>'
sidekiq-6.5.8/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/processor.rb:263:in `stats'
sidekiq-6.5.8/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_logger.rb:13:in `call'
sidekiq-6.5.8/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_retry.rb:80:in `global'
sidekiq-6.5.8/lib/sidekiq/processor.rb:124:in `block in dispatch'
sidekiq-6.5.8/lib/sidekiq/job_logger.rb:39:in `prepare'
sidekiq-6.5.8/lib/sidekiq/processor.rb:123:in `dispatch'
sidekiq-6.5.8/lib/sidekiq/processor.rb:168:in `process'
sidekiq-6.5.8/lib/sidekiq/processor.rb:78:in `process_one'
sidekiq-6.5.8/lib/sidekiq/processor.rb:68:in `run'
sidekiq-6.5.8/lib/sidekiq/component.rb:8:in `watchdog'
sidekiq-6.5.8/lib/sidekiq/component.rb:17:in `block in safe_thread'

根据这个错误,您能想到是什么出了问题吗?

再次感谢您的支持。

:slight_smile:

您好 Denis,

在我看来,这是套接字/DNS 超时问题吗?您是否在“允许的内部主机”设置下设置了任何内容?

从堆栈跟踪来看,在查找 IP 时,此列表似乎为空(请参阅 discourse/lib/final_destination/ssrf_detector.rb at main · discourse/discourse · GitHub),在 discourse/lib/final_destination/http.rb at tests-passed · discourse/discourse · GitHub 触发,因此我倾向于认为这可能与您的安装有关(无法访问 sidekiq pod 的 IP?)

或者检查您是否在集群中使用了任何 NetworkPolicy,这可能是另一个原因。

祝好

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.