Goodbye Sidetiq

(Sam Saffron) #1

We have been plagued with problems related to our scheduler sidetiq.

In particular we have had a few major recurring issues related to high CPU

We have been working to address them upstream, however even once addressed there are some design issues that we need fixed as well.

Recently, I was debugging erratic perf issue with @supermathie here and found a second serious issue.

Sidetiq always schedules every job at exactly the same time, this leads to periods where the same job on every site we host runs at the same time leading to network saturation.

I have been building a light weight scheduler to address the perf and scheduling issues. I expect to have it done today.

So, this is just an upfront warning that the scheduler is changing, if you notice any new issues with scheduled jobs please let me know.

Commit that activates this is:

(Sam Saffron) #2

This topic is now pinned. It will appear at the top of its category until it is either unpinned by a moderator, or the Clear Pin button is pressed.

(Sam Saffron) #3

Note: the UI is much improved over what sidetiq was giving us:

(Kane York) #4

Wow, that looks really nice! Will you be adding a link to it from the admin panel?

(Sam Saffron) #5

Sure, open to that, where would you link this ?

(Ambethia) #6

This looks like something that’s a candidate for extracting into it’s own library. Thoughts?

(Régis Hanol) #7

Will require some real world usage first :wink:

(Robin Ward) #8

Nice, I’m glad to ditch a freedom patch too.

(Sam Saffron) #9

I am still holding back to see if tobias takes in all the features we need.

(Jordan Hofker) #10

What’s the upgrade impact (if any) of this change and should the install/upgrade doc(s) be updated?

(Jeff Atwood) #11

We’re going to switch over to the Docker install instructions as the new standard next week. @sam I know that is on your list, but make sure.

The no-Docker, old style install will be available but not our primary supported or recommended install any more. So we’ll move it around a bit.

(Jordan Hofker) #12

What’s the transition from an environment running no-Docker to Docker like?

(Sorry if I’m being annoying. Just wondering how much time I need to budget for this transition. :smile: (I have a Bitnami-created VM hosted on Azure))

(Jeff Atwood) #13

Yes. We also want to have a “howto convert from regular to Docker install” posted soon. @sam has done a number of conversions from old-style Discourse install to new Docker installs now and has documented the process internally.

@zogstrip and @sam are working on necessary import/export improvements and once that is ready, it should go up.

(Michael - #14

We’re having issues.

The sidekiq/scheduler page shows that all recurring jobs are ‘Not scheduled yet’

Hitting the ‘Trigger’ button on for instance ‘Version Checks’ results in a screen showing “Internal Server Error” and this in the logs:

    ArgumentError - comparison of Class with Class failed:
            /var/www/discourse/lib/scheduler/web.rb:10:in `sort'
            /var/www/discourse/lib/scheduler/web.rb:10:in `block (2 levels) in registered'
/var/www/discourse/vendor/gems/rails_multisite/lib/rails_multisite/connection_management.rb:45:in `with_connection'
            /var/www/discourse/lib/scheduler/web.rb:8:in `block in registered'

From that moment on, the scheduler screen is broken (keeps showing ‘Internal Server Error’ until Sidekiq is restarted.

(Sam Saffron) #15

What version of ruby are you on ?

(Michael - #16

ruby 2.0.0p353 (2013-11-22 revision 43784) [x86_64-linux]

(Sam Saffron) #18

is there anything custom in your initializer, you are going to need to trace through the sidekiq initializer to see why its not doing an initial enqueue of the jobs

(Michael - #19

(Your redis question apparently disappeared, but flushing redis does make it recover from the “Internal Server Error” on sidekiq/scheduler)

The initializer is unmodified, actually this is a clean v0.9.8.4 checkout.

(Sam Saffron) #20

Can you do a clean checkout of master?

(Michael - #21

Sidekiq does seem to work, I see a ‘stat’ job coming by, it’s only the scheduled jobs that are failing.

Hitting the “Trigger” button on “Version check” does seem to run the version check, the following keys appear in redis

1) "default:last_installed_version"
 4) "default:missing_versions_count"
 5) "default:critical_updates_available"
 8) "default:last_version_check_at"
 9) "default:discourse_latest_version"
10) "default:_scheduler_Jobs::VersionCheck"
14) "default:_scheduler_queue_"

Removing the key “default:_scheduler_Jobs::VersionCheck” fixes the Internal Server Error> get "default:_scheduler_Jobs::VersionCheck"
"{\"next_run\":1391979510,\"prev_run\":1391893854,\"prev_duration\":327,\"prev_result\":\"OK\"}"> del "default:_scheduler_Jobs::VersionCheck"
(integer) 1