"Ensure sidekiq is running." when it is definitely running

clay · September 30, 2015, 1:31pm

In the past few weeks, The Admin page has shown:

Visiting the …/sidekiq page shows that it is definitely running and processing jobs. I just tried rebuilding the app, but it remains in a frowny-face status.

Any idea what’s going on here or why the admin page isn’t seeing Sidekiq?

cpradio · September 30, 2015, 1:45pm

Is this a dev instance or using docker? As I’ve come across that in my dev instance and it remains to always show that message, but on my prod docker instance it indicates it correctly as running.

clay · September 30, 2015, 1:47pm

It’s Docker. The last time this happened, I rebuilt Discourse and it started to find Sidekiq again. However, this time it won’t.

clay · September 30, 2015, 2:02pm

Oops. Wrong link. I meant this problem:

cpradio · September 30, 2015, 2:09pm

Seems unlikely unless you think it is running an old version of Discourse again. As sidekiq I think is stored within the docker image, so it should be running within the same container that is running Discourse.

However, we’re reaching my level of expertise on this…

neil · September 30, 2015, 2:13pm

Hmm… I’m not sure why you’re seeing that. Can you open the rails console and get the output of DiscourseUpdates.check_version.as_json for me?

root@dfp-neil:~$ cd /var/discourse/
root@dfp-neil:/var/discourse$ ./launcher enter app
root@dfp-neil-app:/$ rails c
[1] pry(main)> DiscourseUpdates.check_version.as_json
=> {"latest_version"=>"1.5.0.beta1", "critical_updates"=>false, "installed_version"=>"1.5.0.beta1", "installed_sha"=>"ecd93f7efb98c41e79077d025c2215c98f1c912d", "installed_describe"=>"v1.5.0.beta1 +121\n", "missing_versions_count"=>0, "updated_at"=>"2015-09-29T15:40:47.449Z"}
[2] pry(main)>

That command will show the result of the last check for updates.

clay · September 30, 2015, 2:15pm

Here’s what it shows:

root@moxie-app:/# rails c
[1] pry(main)> DiscourseUpdates.check_version.as_json
=> {"latest_version"=>"1.5.0.beta1",
 "critical_updates"=>false,
 "installed_version"=>"1.5.0.beta1",
 "installed_sha"=>"0f7aaf5ab19f593416a0012f223e2f91e6cb0329",
 "installed_describe"=>"v1.5.0.beta1 +128\n",
 "missing_versions_count"=>0,
 "updated_at"=>"2015-09-25T00:41:29.854Z"}
[2] pry(main)>

neil · September 30, 2015, 3:36pm

¯\_(ツ)_/¯

Ummm… What does the timestamp at the bottom of your dashboard say? Dashboard last updated: September 30, 2015 11:33 AM

clay · September 30, 2015, 3:48pm

Dashboard last updated: September 30, 2015 11:19 AM

clay · October 1, 2015, 1:44am

Any debugging thoughts here?

clay · October 1, 2015, 6:03pm

Could it be that a failed backup is killing the Sidekiq heartbeat test? I don’t even really know what the Sidekiq heartbeat test is. Anyhow, I’m seeing this now:

For the second, I see:

/var/www/discourse/lib/scheduler/schedule_info.rb:79:in `schedule!'
/var/www/discourse/lib/scheduler/manager.rb:221:in `schedule_next_job'
/var/www/discourse/lib/scheduler/manager.rb:199:in `block in tick'
/var/www/discourse/lib/scheduler/manager.rb:246:in `block in lock'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/scheduler/manager.rb:245:in `lock'
/var/www/discourse/lib/scheduler/manager.rb:198:in `tick'
/var/www/discourse/config/initializers/sidekiq.rb:35:in `block (2 levels) in <top (required)>'

For the heartbeat warning, I see:

/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/logster-1.0.0.3.pre/lib/logster/logger.rb:74:in `add_with_opts'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/logster-1.0.0.3.pre/lib/logster/logger.rb:35:in `add'
/usr/local/lib/ruby/2.0.0/logger.rb:445:in `warn'
config/unicorn.conf.rb:129:in `check_sidekiq_heartbeat'
config/unicorn.conf.rb:146:in `master_sleep'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/unicorn-4.9.0/lib/unicorn/http_server.rb:295:in `join'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/unicorn-4.9.0/bin/unicorn:126:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/2.0.0/bin/unicorn:23:in `load'
/var/www/discourse/vendor/bundle/ruby/2.0.0/bin/unicorn:23:in `<main>'

Sidekiq shows this on the Statistic tab. The orange was the Jobs::RunHeartbeat that no longer seems to be running.

Everything seems to be working normally except for failed backups.

Related to this?

codinghorror · October 1, 2015, 6:07pm

Are you running out of disk space? What does df -m report on the server for disk usage?

You can check to see how many backups you have and disk space at the bottom left of the /admin dashboard, e.g.

clay · October 1, 2015, 6:24pm

Here’s what we get with df -m (note that the final entry is an external hard drive we use for backup of other data; irrelevant to Discourse:

clayh@moxie:/var/discourse$ df -m
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda1          60004    3919     53015   7% /
none                   1       0         1   0% /sys/fs/cgroup
udev               16041       1     16041   1% /dev
tmpfs               3211       2      3209   1% /run
none                   5       0         5   0% /run/lock
none               16051       1     16050   1% /run/shm
none                 100       1       100   1% /run/user
/dev/sda5          59949    5507     51374  10% /usr
/dev/sda6          59950      65     56817   1% /tmp
/dev/sdb1         328612   68395    243503  22% /home
/dev/sdb5         610032  214308    364714  38% /var
/dev/sdc1        3755441 2706304    858349  76% /media/clayh/a876d7df-5839-4591-a7e0-61a42f03033c

As for disk space and backups (which work fine when done manually):

clay · October 1, 2015, 7:21pm

By the way, when I manually click the backup button from the admin panel, Discourse backs up as expected and the backup file is uploaded to Amazon, as expected. That whole process seems fine. It appears as if Sidekiq is ‘sick’ and failing. While the errors make it look related to the backup process, that in and of itself seems fine. Perhaps the messaging we’re sending to Sidekiq to tell it to perform the backup is corrupt?

I also want to add that this error is showing up in the logs once per second:

Job exception: undefined methodevery’ for Jobs::CreateBackup:Class`

/var/www/discourse/lib/scheduler/schedule_info.rb:79:in `schedule!'
/var/www/discourse/lib/scheduler/manager.rb:221:in `schedule_next_job'
/var/www/discourse/lib/scheduler/manager.rb:199:in `block in tick'
/var/www/discourse/lib/scheduler/manager.rb:246:in `block in lock'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/scheduler/manager.rb:245:in `lock'
/var/www/discourse/lib/scheduler/manager.rb:198:in `tick'
/var/www/discourse/config/initializers/sidekiq.rb:35:in `block (2 levels) in <top (required)>'

hostname	        moxie-app
process_id	        6732
application_version	2c9058ab00e191b729173e7a10cb2d54f7df29ed
current_db	        default
current_hostname	moxie.rtp.rti.org
message	                While ticking scheduling manager

clay · October 2, 2015, 6:07pm

@neil

It turns out that this also seems to be leading to Discourse not checking Gmail for “reply by email” posts. It’s definitely sending out emails, though…

Is there a way I can force Discourse to check for reply-by-email messages? Or test to see where/how that is failing? I’m very eager to fix this Sidekiq problem, but not sure what to do… backup Discourse, totally wipe it clean, and start fresh by restoring the backup?

Any other troubleshooting suggestions? We’ve worked very hard to get buy in from internal people who aren’t used to communicating in this manner and it’s backfiring now that their email replies aren’t being posted as they expected.

p.s. Clearly my understanding of Sidekiq is poor… now that I look on the Scheduler tab, I can see that none of the recurring jobs are running, probably because Jobs::RunHeartbeat is not running. I get a message that it is “Forbidden” to try to trigger them manually.

I’m thinking that this is related. I tried the Sidekiq fix, but so far nothing is happening.

codinghorror · October 2, 2015, 6:42pm

Something is broken about your install but it is unclear what it is.

I’ve seen failure of internal DNS to work (internally we can’t resolve, say google.com) cause all kinds of bizarre problems in the past. Have you checked that?

Did you check discourse.example.com/logs in the browser to see if there were any unusual errors?

clay · October 2, 2015, 6:59pm

Yeah, I checked that. I can ping from inside the container to any address.

The logs are full of errors about the backup scheduler failing - that seems to be choking the Jobs::RunHeartbeat task, causing all of the recurring tasks to fail.

I just turned off backups in the settings and rebuilt the app, but the same problem is happening. I also tried the trick the @sam mentioned here but it doesn’t seem to have helped.

clay · October 2, 2015, 7:27pm

@codinghorror @neil @sam

I fixed it. Here’s what I had to do:

cd /var/discourse
./launcher ssh app
rails c
Sidekiq.redis { |r| puts r.flushall }

Then I had to exit the container and ./launcher restart app

That cleared the Sidekiq Redis queue and everything started back up.

If I had to guess, I upgraded Discourse when there was a backup job in the queue in Redis and somehow, upon restarting the app, the name of a class or something associated with Redis, Sidekiq, or backing up had changed, so the item remaining in the Redis queue was invalid and triggering an error in the logs. That’s total speculation.

codinghorror · October 2, 2015, 8:32pm

Aha so rebooting would have also worked. Essentially clearing redis. Great detective work!

Topic		Replies	Views
Sidekiq not running. Sidekiq heartbeat test failed, restarting Installation unsupported-install	15	2728	June 10, 2020
"Sidekiq is not running" Installation	9	2980	May 4, 2024
Sidekiq heartbeat test failed, restarting Installation	12	1976	February 11, 2020
Sidekiq not launching Jobs Support	2	1706	April 25, 2016
Sidekiq not running Installation	22	5235	June 8, 2024

"Ensure sidekiq is running." when it is definitely running

Related topics