Visiting the …/sidekiq page shows that it is definitely running and processing jobs. I just tried rebuilding the app, but it remains in a frowny-face status.
Any idea what’s going on here or why the admin page isn’t seeing Sidekiq?
Is this a dev instance or using docker? As I’ve come across that in my dev instance and it remains to always show that message, but on my prod docker instance it indicates it correctly as running.
Seems unlikely unless you think it is running an old version of Discourse again. As sidekiq I think is stored within the docker image, so it should be running within the same container that is running Discourse.
However, we’re reaching my level of expertise on this…
Could it be that a failed backup is killing the Sidekiq heartbeat test? I don’t even really know what the Sidekiq heartbeat test is. Anyhow, I’m seeing this now:
By the way, when I manually click the backup button from the admin panel, Discourse backs up as expected and the backup file is uploaded to Amazon, as expected. That whole process seems fine. It appears as if Sidekiq is ‘sick’ and failing. While the errors make it look related to the backup process, that in and of itself seems fine. Perhaps the messaging we’re sending to Sidekiq to tell it to perform the backup is corrupt?
I also want to add that this error is showing up in the logs once per second:
Job exception: undefined methodevery’ for Jobs::CreateBackup:Class`
/var/www/discourse/lib/scheduler/schedule_info.rb:79:in `schedule!'
/var/www/discourse/lib/scheduler/manager.rb:221:in `schedule_next_job'
/var/www/discourse/lib/scheduler/manager.rb:199:in `block in tick'
/var/www/discourse/lib/scheduler/manager.rb:246:in `block in lock'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/scheduler/manager.rb:245:in `lock'
/var/www/discourse/lib/scheduler/manager.rb:198:in `tick'
/var/www/discourse/config/initializers/sidekiq.rb:35:in `block (2 levels) in <top (required)>'
It turns out that this also seems to be leading to Discourse not checking Gmail for “reply by email” posts. It’s definitely sending out emails, though…
Is there a way I can force Discourse to check for reply-by-email messages? Or test to see where/how that is failing? I’m very eager to fix this Sidekiq problem, but not sure what to do… backup Discourse, totally wipe it clean, and start fresh by restoring the backup?
Any other troubleshooting suggestions? We’ve worked very hard to get buy in from internal people who aren’t used to communicating in this manner and it’s backfiring now that their email replies aren’t being posted as they expected.
p.s. Clearly my understanding of Sidekiq is poor… now that I look on the Scheduler tab, I can see that none of the recurring jobs are running, probably because Jobs::RunHeartbeat is not running. I get a message that it is “Forbidden” to try to trigger them manually.
I’m thinking that this is related. I tried the Sidekiq fix, but so far nothing is happening.
Something is broken about your install but it is unclear what it is.
I’ve seen failure of internal DNS to work (internally we can’t resolve, say google.com) cause all kinds of bizarre problems in the past. Have you checked that?
Yeah, I checked that. I can ping from inside the container to any address.
The logs are full of errors about the backup scheduler failing - that seems to be choking the Jobs::RunHeartbeat task, causing all of the recurring tasks to fail.
I just turned off backups in the settings and rebuilt the app, but the same problem is happening. I also tried the trick the @sammentioned here but it doesn’t seem to have helped.
cd /var/discourse
./launcher ssh app
rails c
Sidekiq.redis { |r| puts r.flushall }
Then I had to exit the container and ./launcher restart app
That cleared the Sidekiq Redis queue and everything started back up.
If I had to guess, I upgraded Discourse when there was a backup job in the queue in Redis and somehow, upon restarting the app, the name of a class or something associated with Redis, Sidekiq, or backing up had changed, so the item remaining in the Redis queue was invalid and triggering an error in the logs. That’s total speculation.