Strange CPU usage since latest upgrade

Ever since the latest update to v2.3.0.beta6+148, CPU usage has went up by around 30%. No idea what’s causing it.

Network traffic:

Disk activity:

Any ideas?

Quick puzzle: When did I upgrade?

2 Likes

More detailed CPU data:

Seems to happen exactly 15min after the hour every hour, for almost exactly 20min.

EDIT: The peak CPU usage periods coincide with high disk READ activity (gigabytes of disk read).

So:

(1) Something is reading GB’s off the disk every hour

(2) Something is doing a lot of processing every hour

(3) For around 15min.

1 Like

When I did a glances during high CPU activity, the process shows 98% CPU with main: discourse discourse.

However, when I did a top, it shows postmaster with user lxd.

There was an upgrade a while back that caused all images to be re-processed. That’s the likely cause.

3 Likes

You can visit /sidekiq/scheduler/history and see which jobs cover the same area of the CPU spikes.

6 Likes

Could be!

I guess I"ll wait for a few more days to see…

The only thing that is hundreds of ms is Jobs::PeriodicalUpdates, and that’s only 400-500ms. Everything else is <100ms.

Well, I get it. During the exact time of the CPU spike, only one Sidekiq task is active, Jobs::CleanUpUploads.

And the next task does not appear until exactly at end of the CPU spike.

And the duration field in the history is blank, probably overflew the number field?

Can this be running on an infinite loop?

(1) Seems like Jobs::CleanUpUploads is the culprit

(2) it runs for 20min. non-stop, holding off all other Sidekiq tasks

(3) It reads 1-2GB worth of data from the disk

(4) It doesn’t write much data to the disk

(5) It does NOT incur any network traffic (all my uploads are stored in an Azure Blob storage)

(6) It keeps running for 20min. EVERY SINGLE HOUR. I don’t have that many uploads.

It almost feels like the task is reading the list of uploads from the database, decides that all of them requires processing, then tries to process each upload one by one, only to fail every time because it can’t find the file(s) on the disk. One hour later, repeat.

3 Likes

Oh something in the plugin can trigger an odd code path here.

https://github.com/discourse/discourse/blob/master/app/jobs/scheduled/clean_up_uploads.rb

As we are not running this plugin in production, it’s not as rock solid as our S3 code.

I recommend paying attention to the list of server processes when this is happening. It should be either a PostgreSQL query or it trying to find the non-local uploads in the disk.

5 Likes

This is something @tgxworld should maybe have a peek at if it’s an Azure edge case?

1 Like

Hmm the clean up uploads job doesn’t attempt to process any uploads since it only deletes orphaned uploads. Can you help me to run the following manually to see if it triggers the same spike?

cd /var/discourse
./launcher enter app
rails c
Jobs::CleanUpUploads.new.execute({})
6 Likes

Yes, it runs for a long time occupying CPU:

Strangely it looks like it magically sorted itself out… @schungx should we close this?

2 Likes

OK!

Yeah, just like MAGIC!

2 Likes