"convert" process eating CPU

Hi.

I’ve upgraded to the latest version of discourse.
I’ve changed the CDN url and rebake the posts.

After that, I have lots of processes running one after the other with this command:

convert jpeg:/var/www/discourse/tmp/download_cache/002ca318720dd3e60e31eddddf2c12fca64df1d3.jpg[0] -auto-orient -gravity center -background transparent -thumbnail 1035x502^ -extent 1035x502 -interpolate catrom -unsharp 2x0.5+0.7+0 -interlace none -quality 98 -profile /var/www/discourse/vendor/data/RT_sRGB.icm jpeg:/tmp/discourse-thumbnail20181105-5725-s2szhw.jpeg

any thoughts?

should stop in a few hours , we had to re-download gravatars.

3 Likes

Hi! I’ve the same problem. I searched and found this post a few days ago, so according to your feedback @sam we decided to wait before doing any action in the forum, but after 4 days the convert process is killing our machine and pulling down the forum (it starts going slower and then starts throwing 502 for any request).

This is the top at this moment:

%CPU %MEM     TIME+ COMMAND
51.5  3.0   0:07.73 ruby
27.9  7.2   3:42.61 postmaster
14.3  7.4   0:09.51 convert
13.6  7.3   1:52.68 postmaster
13.0  3.6   0:02.20 convert
11.3  3.8   0:02.41 convert
10.3  3.9   0:02.20 convert
 8.3  6.3   0:17.23 convert
 8.3  9.5   0:09.11 convert
 8.3  1.0   0:10.39 convert
 8.3  3.7   0:01.96 convert
 7.3  0.7   0:09.91 convert
 7.0  7.1   0:01.69 convert
 4.0  3.3   0:05.59 convert
 1.3  0.0   0:53.90 kswapd0
 0.7  0.0   0:00.40 kworker/u4:1ert
 1.3  0.0   0:53.90 kswapd0
 0.7  0.0   0:00.40 kworker/u4:1

Our setup is a one-click DigitalOcean image and the problem started last Friday 2019-02-22 10:00 after upgrading the forum to the last version from 2.2.0beta1 +20 to 2.3.0.beta2.

Taken actions:

  • Waiting for the process to end
  • Clearing old containers + images and rebuilding
  • Restarting 3-4 times the machine
  • Crying

Any idea?

Thank you very much and sorry for bothering you.

Does your forum have a lot of images?

1 Like

Thanks for answering. “A lot” is relative, but I think yes, it have some images-based threads by the community.

I supposed that’s the reason but it looks strange to me after 4 days hitting 100% CPU and I supposed that this kind of background jobs are normally running in queues or someting similar to control it.

I just thinking loud and speculating, I don’t know (yet) how Discourse internals works.

Maybe try halving the “rebake old posts count” to 40 and see how you are doing? Have a look at /sidekiq are you backed up with tons of jobs?

2 Likes

Thanks for answering @sam. Changing rebake to 40 helps a little bit, now throws less 502 but still failing after 20-30 minutes from the last server restart.

Regarding the jobs, yes, there are a lot of jobs and are decreasing very slowly or just stuck in the same number of pending tasks.


(it’s in spanish, if someone needs a translation just ping me and I’ll update this message with the translation)

This is the top at this moment after the last change & restarting the server 20 min ago.

%CPU %MEM     TIME+ COMMAND
43.9 16.5   0:11.38 convert
20.3  9.0   0:11.19 convert
19.9  4.1   0:12.63 convert
13.3  7.5   0:11.52 ruby
 8.0  7.6   0:11.69 ruby
 7.0  5.6   0:01.75 postmaster
 6.3  7.6   0:10.88 ruby
 2.7  3.4   0:14.12 redis-server
 2.7  5.0   0:01.11 postmaster
 1.0  3.8   0:00.80 postmaster
 0.7  4.8   0:13.04 ruby
 0.7  8.1   0:06.26 ruby
 0.3  0.0   0:01.88 rcu_sched
 0.3  0.3   0:02.34 dockerd
 0.3  0.1   0:05.23 nginx
 0.3  0.2   0:00.77 postmaster
 0.3  0.2   0:00.01 jpegoptim
 0.0  0.0   0:01.19 init
 0.0  0.0   0:00.00 kthreadd
 0.0  0.0   0:01.08 ksoftirq
2 Likes

Change “rebake old posts count” to 10 or lower to let sidekiq clear the queue and then slowly increase it.

2 Likes

This is odd as the convert process runs at a very low priority. I had this setting cranked way up and had our 12 core server maxed out on CPU for close to two weeks without any noticeable slowdown. Is your server short on RAM, or do you have a very slow hard disk?

2 Likes

It’s a DO droplet with 4 GB Memory / 60 GB Disk / Ubuntu Discourse on 14.04

This is the performance of the last 24h.


(those breakdowns are me restarting the server or just the site down when I was not able to restart it)

More performance screenshots with details 📈



I would like to mention we’ve never had performance issues (memory and disk) until now with the convert process.

I followed your suggestion and the performance is better (as expected, fewer things to do), but the queue still increasing instead of going to less.

Thanks everyone for your time.

What’s in the queue? Have you tried setting it to 0 to let it cool down?


(translation: the value should be between 1 and 2000000000)

It’s 1 now.

And at 1 everything is fine?

2 Likes

Sorry for the delay, I was waiting to see if the last changes make an effect, and it does.

Update:

  • I’ve upgraded the DO droplet to 16 GB Memory / 60 GB Disk / LON1 - Ubuntu Discourse on 14.04
  • Change rebake to 1
  • Increased workers from 5 to 8 following your suggestion in another post

10 hours later :blush:

Now, I’ve updated the rebake to 10 and let’s see. I want to rebake all as before the crisis and then back to my previous droplet.

I think now is under control so thank you so much, everyone, for your time and suggestions, it has put me in the right way, and I learned a little bit more about how to manage this cases with Discourse.

2 Likes

Is it necessary to restart unicorn, sidekiq, and/or redis for the changed value to take effect? I turned rebake_old_posts_count down to 1 to try to clear sidekiq but the enqueued count is going up, not down, so it’s not clear that the setting is being honored. Or is there some other reason for over 14K Jobs::CrawlTopicLink jobs enqueued, and growing? I don’t know whether that’s the right job for that setting. :grimacing:

I made this change because we’re seeing something that looks superficially like this on forum.makerforums.info (hosted on DO) after importing about 37.5K topics with about 260K total posts, many of which are image-heavy with a total of about 33GB images. We had CDN configured and functional before the import; it looks like making the posts at import time didn’t use the CDN configuration and maybe is slowly re-baking to point at the CDN? The reprocessing is definitely taking much longer than the initial import, which really surprised me.

Update: It took over 12 hours to recover, but the sidekiq queue has cleared.

1 Like