Hi! I’ve the same problem. I searched and found this post a few days ago, so according to your feedback @sam we decided to wait before doing any action in the forum, but after 4 days the convert process is killing our machine and pulling down the forum (it starts going slower and then starts throwing 502 for any request).
Our setup is a one-click DigitalOcean image and the problem started last Friday 2019-02-22 10:00 after upgrading the forum to the last version from 2.2.0beta1 +20 to 2.3.0.beta2.
Thanks for answering. “A lot” is relative, but I think yes, it have some images-based threads by the community.
I supposed that’s the reason but it looks strange to me after 4 days hitting 100% CPU and I supposed that this kind of background jobs are normally running in queues or someting similar to control it.
I just thinking loud and speculating, I don’t know (yet) how Discourse internals works.
Thanks for answering @sam. Changing rebake to 40 helps a little bit, now throws less 502 but still failing after 20-30 minutes from the last server restart.
Regarding the jobs, yes, there are a lot of jobs and are decreasing very slowly or just stuck in the same number of pending tasks.
This is odd as the convert process runs at a very low priority. I had this setting cranked way up and had our 12 core server maxed out on CPU for close to two weeks without any noticeable slowdown. Is your server short on RAM, or do you have a very slow hard disk?
Now, I’ve updated the rebake to 10 and let’s see. I want to rebake all as before the crisis and then back to my previous droplet.
I think now is under control so thank you so much, everyone, for your time and suggestions, it has put me in the right way, and I learned a little bit more about how to manage this cases with Discourse.
Is it necessary to restart unicorn, sidekiq, and/or redis for the changed value to take effect? I turned rebake_old_posts_count down to 1 to try to clear sidekiq but the enqueued count is going up, not down, so it’s not clear that the setting is being honored. Or is there some other reason for over 14K Jobs::CrawlTopicLink jobs enqueued, and growing? I don’t know whether that’s the right job for that setting.
I made this change because we’re seeing something that looks superficially like this on forum.makerforums.info (hosted on DO) after importing about 37.5K topics with about 260K total posts, many of which are image-heavy with a total of about 33GB images. We had CDN configured and functional before the import; it looks like making the posts at import time didn’t use the CDN configuration and maybe is slowly re-baking to point at the CDN? The reprocessing is definitely taking much longer than the initial import, which really surprised me.
Update: It took over 12 hours to recover, but the sidekiq queue has cleared.