"Ensure sidekiq is running." when it is definitely running

In the past few weeks, The Admin page has shown:

Visiting the …/sidekiq page shows that it is definitely running and processing jobs. I just tried rebuilding the app, but it remains in a frowny-face status.

Any idea what’s going on here or why the admin page isn’t seeing Sidekiq?

Is this a dev instance or using docker? As I’ve come across that in my dev instance and it remains to always show that message, but on my prod docker instance it indicates it correctly as running.

It’s Docker. The last time this happened, I rebuilt Discourse and it started to find Sidekiq again. However, this time it won’t.

Oops. Wrong link. I meant this problem:

Seems unlikely unless you think it is running an old version of Discourse again. As sidekiq I think is stored within the docker image, so it should be running within the same container that is running Discourse.

However, we’re reaching my level of expertise on this…

Hmm… I’m not sure why you’re seeing that. Can you open the rails console and get the output of DiscourseUpdates.check_version.as_json for me?

root@dfp-neil:~$ cd /var/discourse/
root@dfp-neil:/var/discourse$ ./launcher enter app
root@dfp-neil-app:/$ rails c
[1] pry(main)> DiscourseUpdates.check_version.as_json
=> {"latest_version"=>"1.5.0.beta1", "critical_updates"=>false, "installed_version"=>"1.5.0.beta1", "installed_sha"=>"ecd93f7efb98c41e79077d025c2215c98f1c912d", "installed_describe"=>"v1.5.0.beta1 +121\n", "missing_versions_count"=>0, "updated_at"=>"2015-09-29T15:40:47.449Z"}
[2] pry(main)>

That command will show the result of the last check for updates.

Here’s what it shows:

root@moxie-app:/# rails c
[1] pry(main)> DiscourseUpdates.check_version.as_json
=> {"latest_version"=>"1.5.0.beta1",
 "critical_updates"=>false,
 "installed_version"=>"1.5.0.beta1",
 "installed_sha"=>"0f7aaf5ab19f593416a0012f223e2f91e6cb0329",
 "installed_describe"=>"v1.5.0.beta1 +128\n",
 "missing_versions_count"=>0,
 "updated_at"=>"2015-09-25T00:41:29.854Z"}
[2] pry(main)> 

¯\_(ツ)_/¯

Ummm… What does the timestamp at the bottom of your dashboard say? Dashboard last updated: September 30, 2015 11:33 AM

Dashboard last updated: September 30, 2015 11:19 AM

Any debugging thoughts here?

Could it be that a failed backup is killing the Sidekiq heartbeat test? I don’t even really know what the Sidekiq heartbeat test is. Anyhow, I’m seeing this now:

For the second, I see:

/var/www/discourse/lib/scheduler/schedule_info.rb:79:in `schedule!'
/var/www/discourse/lib/scheduler/manager.rb:221:in `schedule_next_job'
/var/www/discourse/lib/scheduler/manager.rb:199:in `block in tick'
/var/www/discourse/lib/scheduler/manager.rb:246:in `block in lock'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/scheduler/manager.rb:245:in `lock'
/var/www/discourse/lib/scheduler/manager.rb:198:in `tick'
/var/www/discourse/config/initializers/sidekiq.rb:35:in `block (2 levels) in <top (required)>'

For the heartbeat warning, I see:

/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/logster-1.0.0.3.pre/lib/logster/logger.rb:74:in `add_with_opts'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/logster-1.0.0.3.pre/lib/logster/logger.rb:35:in `add'
/usr/local/lib/ruby/2.0.0/logger.rb:445:in `warn'
config/unicorn.conf.rb:129:in `check_sidekiq_heartbeat'
config/unicorn.conf.rb:146:in `master_sleep'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/unicorn-4.9.0/lib/unicorn/http_server.rb:295:in `join'
/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/unicorn-4.9.0/bin/unicorn:126:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/2.0.0/bin/unicorn:23:in `load'
/var/www/discourse/vendor/bundle/ruby/2.0.0/bin/unicorn:23:in `<main>'

Sidekiq shows this on the Statistic tab. The orange was the Jobs::RunHeartbeat that no longer seems to be running.

Everything seems to be working normally except for failed backups.

Related to this?

Are you running out of disk space? What does df -m report on the server for disk usage?

You can check to see how many backups you have and disk space at the bottom left of the /admin dashboard, e.g.

Here’s what we get with df -m (note that the final entry is an external hard drive we use for backup of other data; irrelevant to Discourse:

clayh@moxie:/var/discourse$ df -m
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda1          60004    3919     53015   7% /
none                   1       0         1   0% /sys/fs/cgroup
udev               16041       1     16041   1% /dev
tmpfs               3211       2      3209   1% /run
none                   5       0         5   0% /run/lock
none               16051       1     16050   1% /run/shm
none                 100       1       100   1% /run/user
/dev/sda5          59949    5507     51374  10% /usr
/dev/sda6          59950      65     56817   1% /tmp
/dev/sdb1         328612   68395    243503  22% /home
/dev/sdb5         610032  214308    364714  38% /var
/dev/sdc1        3755441 2706304    858349  76% /media/clayh/a876d7df-5839-4591-a7e0-61a42f03033c

As for disk space and backups (which work fine when done manually):

By the way, when I manually click the backup button from the admin panel, Discourse backs up as expected and the backup file is uploaded to Amazon, as expected. That whole process seems fine. It appears as if Sidekiq is ‘sick’ and failing. While the errors make it look related to the backup process, that in and of itself seems fine. Perhaps the messaging we’re sending to Sidekiq to tell it to perform the backup is corrupt?

I also want to add that this error is showing up in the logs once per second:

Job exception: undefined methodevery’ for Jobs::CreateBackup:Class`

/var/www/discourse/lib/scheduler/schedule_info.rb:79:in `schedule!'
/var/www/discourse/lib/scheduler/manager.rb:221:in `schedule_next_job'
/var/www/discourse/lib/scheduler/manager.rb:199:in `block in tick'
/var/www/discourse/lib/scheduler/manager.rb:246:in `block in lock'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/scheduler/manager.rb:245:in `lock'
/var/www/discourse/lib/scheduler/manager.rb:198:in `tick'
/var/www/discourse/config/initializers/sidekiq.rb:35:in `block (2 levels) in <top (required)>'
hostname	        moxie-app
process_id	        6732
application_version	2c9058ab00e191b729173e7a10cb2d54f7df29ed
current_db	        default
current_hostname	moxie.rtp.rti.org
message	                While ticking scheduling manager

@neil

It turns out that this also seems to be leading to Discourse not checking Gmail for “reply by email” posts. It’s definitely sending out emails, though…

Is there a way I can force Discourse to check for reply-by-email messages? Or test to see where/how that is failing? I’m very eager to fix this Sidekiq problem, but not sure what to do… backup Discourse, totally wipe it clean, and start fresh by restoring the backup?

Any other troubleshooting suggestions? We’ve worked very hard to get buy in from internal people who aren’t used to communicating in this manner and it’s backfiring now that their email replies aren’t being posted as they expected.

p.s. Clearly my understanding of Sidekiq is poor… now that I look on the Scheduler tab, I can see that none of the recurring jobs are running, probably because Jobs::RunHeartbeat is not running. I get a message that it is “Forbidden” to try to trigger them manually.

I’m thinking that this is related. I tried the Sidekiq fix, but so far nothing is happening.

Something is broken about your install but it is unclear what it is.

I’ve seen failure of internal DNS to work (internally we can’t resolve, say google.com) cause all kinds of bizarre problems in the past. Have you checked that?

Did you check discourse.example.com/logs in the browser to see if there were any unusual errors?

Yeah, I checked that. I can ping from inside the container to any address.

The logs are full of errors about the backup scheduler failing - that seems to be choking the Jobs::RunHeartbeat task, causing all of the recurring tasks to fail.

I just turned off backups in the settings and rebuilt the app, but the same problem is happening. I also tried the trick the @sam mentioned here but it doesn’t seem to have helped.

@codinghorror @neil @sam

I fixed it. Here’s what I had to do:

cd /var/discourse
./launcher ssh app
rails c
Sidekiq.redis { |r| puts r.flushall }

Then I had to exit the container and ./launcher restart app

That cleared the Sidekiq Redis queue and everything started back up.

If I had to guess, I upgraded Discourse when there was a backup job in the queue in Redis and somehow, upon restarting the app, the name of a class or something associated with Redis, Sidekiq, or backing up had changed, so the item remaining in the Redis queue was invalid and triggering an error in the logs. That’s total speculation.

8 Likes

Aha so rebooting would have also worked. Essentially clearing redis. Great detective work!