Sidekiq has a lot of errors and queued jobs

hendersj · February 26, 2024, 7:05pm

We had some advice show up in the admin console regarding our Sidekiq queued jobs - we have nearly 200,000 jobs enqueued, and are unsure how to resolve the issue. We’re running version 3.2.0.

The jobs all are in the ultra_low queue, and look to be the job Jobs::ProcessPost. They all appear to be similar to this:

	[{"bypass_bump"=>true, "cooking_options"=>nil, "new_post"=>false, "post_id"=>729508, "skip_pull_hotli... 
{"bypass_bump"=>true, "cooking_options"=>nil, "new_post"=>false, "post_id"=>729508, "skip_pull_hotlinked_images"=>false, "current_site_id"=>"default"}

How can we get these jobs to process, and how do we do it in a way that ensures that they don’t impact performance significantly (or is that inevitable)?

hendersj · February 28, 2024, 7:19pm

Two days ago, it was about 195K; now it’s up to 211K. Could really use some advice on what to look at and how to resolve this issue.

We’ve noticed 5 dead jobs that look like this (IP addresses obfuscated):

Jobs::HandledExceptionWrapper: Wrapped Errno::ENETUNREACH: Failed to open TCP connection to x.x.x.x:443 (Network is unreachable - connect(2) for "x.x.x.x" port 443)

The address is for a server that’s in the infrastructure but not used by the forums (it’s a source code repository server). I don’t see anywhere in the configuration where this host is even referenced, so not sure why it thinks it needs to connect to that host. But I wonder if this is related to the growing number of enqueued jobs in the ultra_low queue.

hendersj · February 29, 2024, 5:59pm

Another 7,000 jobs added to the queue in the last ~23 hours.

This is also holding up a few implementation changes - auto-close, and adding the solutions plugin (just don’t want to make changes that may or may not be affected by this, or when this might be compounded by changing something so significant in the system setup).

The 7K jobs added doesn’t align with how many new posts we’ve had in the last day, so I’m not sure what’s driving this at all.

What other information would be helpful to get this issue resolved?

Ed_S · February 29, 2024, 6:19pm

I can’t help, except by suggesting that more replies and more likes to posts on this thread might bring it the attention it seems to deserve!

Edit: I’ve also mentioned this problem in the thread about the default order-by-hotness, which is I think hiding this thread very effectively, and very unhelpfully.

Falco · February 29, 2024, 8:47pm

Can you share the full backtrace? I assume that’s coming from our onebox code, but would like to be sure.

hendersj · February 29, 2024, 9:28pm

I assume we have to do this from the console - I’ll have to ask someone with access to do that; can you tell me where to pull it from so I can pass that along to the person with access?

Thanks!

Falco · February 29, 2024, 10:01pm

No, no, the backtrace is show in a tab on n the /logs page when you select the error.

hendersj · February 29, 2024, 10:30pm

Ah, thanks for the pointer. Here’s what it shows (IP address obfuscated):

Message (10 copies reported)

Job exception: Failed to open TCP connection to x.x.x.x:443 (Network is unreachable - connect(2) for "x.x.x.x" port 443)


Backtrace

/usr/lib64/ruby/gems/3.2.0/gems/net-http-0.4.1/lib/net/http.rb:1603:in `initialize'
/usr/lib64/ruby/gems/3.2.0/gems/net-http-0.4.1/lib/net/http.rb:1603:in `open'
/usr/lib64/ruby/gems/3.2.0/gems/net-http-0.4.1/lib/net/http.rb:1603:in `block in connect'
/usr/lib64/ruby/gems/3.2.0/gems/timeout-0.4.1/lib/timeout.rb:186:in `block in timeout'
/usr/lib64/ruby/gems/3.2.0/gems/timeout-0.4.1/lib/timeout.rb:193:in `timeout'
/usr/lib64/ruby/gems/3.2.0/gems/net-http-0.4.1/lib/net/http.rb:1601:in `connect'
/srv/www/vhosts/discourse/lib/final_destination/http.rb:27:in `block in connect'
/srv/www/vhosts/discourse/lib/final_destination/http.rb:17:in `each'
/srv/www/vhosts/discourse/lib/final_destination/http.rb:17:in `each_with_index'
/srv/www/vhosts/discourse/lib/final_destination/http.rb:17:in `connect'

I did notice that this is different than the output in the “backtrace” tab, but that output doesn’t look like a backtrace, but what is here as a copy (using the copy button) does. Let me know if other info is needed for this one.

I’ll also see if I can get more info about the 220K+ enqueued tasks this way as well - I was unaware of the /logs endpoint.

Falco · February 29, 2024, 10:31pm

No extra lines on the bottom part of the backtrace?

hendersj · February 29, 2024, 10:37pm

That was all the Copy button picked up - here’s the full output from the backtrace tab:

net-http-0.4.1/lib/net/http.rb:1603:in `initialize' 
net-http-0.4.1/lib/net/http.rb:1603:in `open' 
net-http-0.4.1/lib/net/http.rb:1603:in `block in connect' 
timeout-0.4.1/lib/timeout.rb:186:in `block in timeout' 
timeout-0.4.1/lib/timeout.rb:193:in `timeout' 
net-http-0.4.1/lib/net/http.rb:1601:in `connect' 
/srv/www/vhosts/discourse/lib/final_destination/http.rb:27:in `block in connect' 
/srv/www/vhosts/discourse/lib/final_destination/http.rb:17:in `each' 
/srv/www/vhosts/discourse/lib/final_destination/http.rb:17:in `each_with_index' 
/srv/www/vhosts/discourse/lib/final_destination/http.rb:17:in `connect' 
net-http-0.4.1/lib/net/http.rb:1580:in `do_start' 
net-http-0.4.1/lib/net/http.rb:1569:in `start' 
net-http-0.4.1/lib/net/http.rb:1029:in `start' 
/srv/www/vhosts/discourse/lib/final_destination.rb:551:in `safe_session' 
/srv/www/vhosts/discourse/lib/final_destination.rb:486:in `safe_get' 
/srv/www/vhosts/discourse/lib/final_destination.rb:170:in `get' 
/srv/www/vhosts/discourse/lib/retrieve_title.rb:90:in `fetch_title' 
/srv/www/vhosts/discourse/lib/retrieve_title.rb:12:in `crawl' 
/srv/www/vhosts/discourse/lib/inline_oneboxer.rb:76:in `lookup' 
/srv/www/vhosts/discourse/lib/cooked_processor_mixin.rb:310:in `process_inline_onebox' 
/srv/www/vhosts/discourse/lib/cooked_processor_mixin.rb:39:in `block in post_process_oneboxes' 
/srv/www/vhosts/discourse/lib/oneboxer.rb:214:in `block in apply' 
/srv/www/vhosts/discourse/lib/oneboxer.rb:162:in `block in each_onebox_link' 
nokogiri-1.16.0/lib/nokogiri/xml/node_set.rb:235:in `block in each' 
nokogiri-1.16.0/lib/nokogiri/xml/node_set.rb:234:in `upto' 
nokogiri-1.16.0/lib/nokogiri/xml/node_set.rb:234:in `each' 
/srv/www/vhosts/discourse/lib/oneboxer.rb:162:in `each_onebox_link' 
/srv/www/vhosts/discourse/lib/oneboxer.rb:213:in `apply' 
/srv/www/vhosts/discourse/lib/cooked_processor_mixin.rb:9:in `post_process_oneboxes' 
/srv/www/vhosts/discourse/lib/cooked_post_processor.rb:42:in `block in post_process' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:53:in `block in synchronize' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:49:in `synchronize' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:49:in `synchronize' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:34:in `synchronize' 
/srv/www/vhosts/discourse/lib/cooked_post_processor.rb:38:in `post_process' 
/srv/www/vhosts/discourse/app/jobs/regular/process_post.rb:28:in `block in execute' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:53:in `block in synchronize' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:49:in `synchronize' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:49:in `synchronize' 
/srv/www/vhosts/discourse/lib/distributed_mutex.rb:34:in `synchronize' 
/srv/www/vhosts/discourse/app/jobs/regular/process_post.rb:8:in `execute' 
/srv/www/vhosts/discourse/app/jobs/base.rb:297:in `block (2 levels) in perform' 
rails_multisite-5.0.0/lib/rails_multisite/connection_management.rb:82:in `with_connection'
/srv/www/vhosts/discourse/app/jobs/base.rb:284:in `block in perform' 
/srv/www/vhosts/discourse/app/jobs/base.rb:280:in `each' 
/srv/www/vhosts/discourse/app/jobs/base.rb:280:in `perform' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:202:in `execute_job' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:170:in `block (2 levels) in process' 
sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:177:in `block in invoke' 
/srv/www/vhosts/discourse/lib/sidekiq/pausable.rb:132:in `call' 
sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:179:in `block in invoke' 
sidekiq-6.5.12/lib/sidekiq/middleware/chain.rb:182:in `invoke' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:169:in `block in process' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch' 
sidekiq-6.5.12/lib/sidekiq/job_retry.rb:113:in `local' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch' 
sidekiq-6.5.12/lib/sidekiq/rails.rb:14:in `block in call' 
activesupport-7.0.8/lib/active_support/execution_wrapper.rb:92:in `wrap' 
activesupport-7.0.8/lib/active_support/reloader.rb:72:in `block in wrap' 
activesupport-7.0.8/lib/active_support/execution_wrapper.rb:92:in `wrap' 
activesupport-7.0.8/lib/active_support/reloader.rb:71:in `wrap' 
sidekiq-6.5.12/lib/sidekiq/rails.rb:13:in `call' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:263:in `stats' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch' 
sidekiq-6.5.12/lib/sidekiq/job_logger.rb:13:in `call' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch' 
sidekiq-6.5.12/lib/sidekiq/job_retry.rb:80:in `global' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:124:in `block in dispatch' 
sidekiq-6.5.12/lib/sidekiq/job_logger.rb:39:in `prepare' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:123:in `dispatch' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:168:in `process' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:78:in `process_one' 
sidekiq-6.5.12/lib/sidekiq/processor.rb:68:in `run' 
sidekiq-6.5.12/lib/sidekiq/component.rb:8:in `watchdog' 
sidekiq-6.5.12/lib/sidekiq/component.rb:17:in `block in safe_thread'

Sorry to have not grabbed the whole output. I thought it was strange that the Copy button didn’t grab more, but I made a bad assumption that it grabbed what was needed.

Falco · February 29, 2024, 10:42pm

Oh that’s much better!

This means that there is one/many inline link with a URL that resolves to the IP address of your server. If that server can’t be reached, we do log the error, but the markdown cook process should proceed without doing the “link prettify” that the inline onebox would trigger.

Another thing, is how did you install Discourse? This doesn’t look like a install that followed our official install guide.

hendersj · February 29, 2024, 10:51pm

I didn’t actually perform the installation - I believe our infrastructure team has deployed it using a CI/CD pipeline, but I don’t know the details.

Sounds like the failed messages aren’t an issue - the large number of enqueued tasks in the ultra_low queue seems to be the bigger issue here. Not finding anything in the logs (makes sense to me, actually, because the job hasn’t actually run yet).

Firepup650 · February 29, 2024, 10:54pm

If you manually force a few to run, what happens?

hendersj · February 29, 2024, 10:56pm

I don’t see an option to force them to run in the web UI - I have an option to “show all” arguments, or to delete the tasks individually.

Firepup650 · February 29, 2024, 11:05pm

Hm. We don’t have any queued ones on the forum I help run, so I apologize for trying to suggest a button that isn’t there.

Reference of the retries page, for what I thought was on the queued page

hendersj · February 29, 2024, 11:09pm

No worries - yeah, these here show up in the Enqueued page (at /sidekiq/queues/ultra_low).

Looking at the queues page itself, I see that the latency of the ultra_low queue is about a year. So this apparently has gone on for a while, and we only recently got an alert about it in the dashboard.

hendersj · February 29, 2024, 11:27pm

Digging a little further into what I see on the Jobs::ProcessPost jobs, it looks like the post_id values are counting down - I pulled the post IDs from the first and last jobs in the queue, and the dates correspond with a post from 2012 (pre-migration to Discourse - but with an ‘updated_at’ timestamp from 2022, which might have been our migration date - I’ll have to check that) and the most recent was from January 2023.

I do see occasional jobs for generating topic thumbnails, but those are pretty rare. With over 8,800 pages of queue, I can’t check everything, but it looks like it’s mostly Jobs::ProcessPost stuff that’s getting added in, and it’s going back in time through what looks to be every single reply and initial post to a topic (the earliest one was mid-thread, the latest was a root post in a topic).

hendersj · March 1, 2024, 8:50pm

I’ve pullled a backup from the production system and loaded it into a test environment set up locally. The queue doesn’t seem to be getting backed up at all - in fact, I don’t see Job::ProcessPost show up in that queue at all while watching it (I’m going to set it up and let the system just run on its own while recording the queue screen to see if I am just missing it come through).

This leads me to two questions:

What causes the ProcessPost job to be triggered?
What would cause the ultra_low queue to not be processed?

I’m no expert on any of the technologies used in Discourse (redis, sidekiq, or rails), but I’m very comfortable with learning and trying things in my sandboxed environment in order to understand what I need to have someone look at in production.

The good news is that this issue doesn’t seem to be causing any problems for our users (yet). A third question would be whether just purging that queue would help, or if not allowing those jobs to run would harm the system in some way.

ETA: I’ve just seen a batch of the jobs come through in my test environment, and they are processing there. So it just seems that the queue is not processing on the production server for some reason.

hendersj · March 1, 2024, 9:26pm

It looks like we’ve come to a resolution. The installation we’re using is from the package built for openSUSE (the instance we’ve been talking about is the instance for the openSUSE forums) - so it’s essentially using the “build from source” installation process.

We have a development and production environment, and the ultra_low queue was mistakenly excluded from the production environment’s sidekiq configuration. This has been corrected, and it looks like the queue is emptying out, and the latency has dropped to 4 weeks from 1 year.

I think we can consider this resolved now. Appreciate the input from the folks who provided some - that helped put me on the right track to figure out what the issue was.

system · March 31, 2024, 9:26pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Could sidekiq queue be reason for 500 errors? Installation server-resources	31	3801	July 13, 2018
Very slow Sidekiq issue with large queue due to massive numbers of unread user notifications Installation	69	7219	May 11, 2020
Errors on a forum I volunteer with Support	2	660	December 22, 2023
Sidekiq not running Installation	22	5235	June 8, 2024
"Ensure sidekiq is running." when it is definitely running Installation	19	7679	October 24, 2015

Sidekiq has a lot of errors and queued jobs

Related topics