Postgres using 100% of CPU


#1

Hi there, I’m seeing an issue on my docker install. It’s on Linode on Ubuntu 14.04.

Postgres is using 100% of the CPU. The forum itself is working just fine, version v1.6.0.beta7 +156. Doesn’t seem any slower or anything.

Here’s a screen capture from “htop”:

Any pointers on how I can diagnose this and remedy it?

I’ve tried rebuilding with ./launcher rebuild and even cleaning up with ./launcher clean. I’ve tried rebooting the server as well.

I have plenty of disk space left (10GB free), and Discourse is the only thing running on this server.

Thanks in advance for any pointers.


Problem with Postmaster CPU load after upgrade to v1.6.0.beta7
(Dean Taylor) #2

Are there any jobs running?

Try opening the following URL for your site:
/sidekiq


#3

Doesn’t look like anything else is running.


#4

Ok, there is one job in the “scheduler” tab that is running.

Jobs::EnsureDbConsistency 41 minutes ago RUNNING -1ms _scheduler_localhost-app:29119:1:29c83b7cc82551f6a2b801e82a1acb38

Should I just wait that out? Seems like it’s checking the database?


(Dean Taylor) #5

To see exactly what queries are being executed you can run something like the following from the console:

cd /var/discourse/
./launcher enter app
sudo -u postgres psql -c "select * from pg_stat_activity WHERE state='active'";

Try running this a couple of times with a 3/4 minute wait in-between, capturing the output each time.


#6

Ok, looks like it’s running 3 copies of this query (with 3 different pids)

DELETE FROM user_actions USING user_actions ua2 
WHERE (  user_actions.action_type = ua2.action_type AND
  user_actions.user_id = ua2.user_id AND
  user_actions.acting_user_id = ua2.acting_user_id AND
  user_actions.target_post_id = ua2.target_post_id AND
  user_actions.target_post_id > 0 AND 
  user_actions.id > ua2.id 
  )  

Full copy/paste here: datid | datname | pid | usesysid | usename | application_name - Pastebin.com

I’ll let it run for a while and see what happens. Maybe I’m just being impatient :slight_smile:


(Matt Palmer) #7

Yes, you’ll need to let those queries run, because otherwise the scheduler will just keep spawning them and you’ll be playing whack-a-mole with them forever.


#8

Do they start every x minutes or something like that? I currently have 5 of them running … Up from 3 a few hours ago.


#9

Well, this morning we’re up to a load average of 12 :slight_smile:

But it does appear to have moved on from the “DELETE” queries, and is running others. I assume this is some sort of database check.

The forum is still responsive with this load, so I’ll let it proceed for a while. Seems a bit crazy though.


(Sam Saffron) #10

Are you running on our latest docker image? How many posts does your forum have?


#11

Yes, I updated the docker image yesterday, which is when this started happening.

I did

git pull
./launcher cleanup
./launcher rebuild app
./launcher cleanup

As for posts and topics, it has 17001 topics, with 178297 posts. They were imported from a phpBB install last year.

Anyway, it eventually got up to 16 instances of that query running, so I just killed them off. They haven’t returned in a few hours. I’ll check if it starts using CPU when that scheduled task starts again.


#12

So it was running fine for most of yesterday after I killed the processes. This morning they’re back, with 8 of those queries running, and a load average of 8.

Is there something I can do about this? Either not have that scheduled task run, or somehow run it manually? If it’s trying to verify the integrity of the database, can I just do that manually so it doesn’t spawn all these processes when the scheduled task runs?


(Rafael dos Santos Silva) #13

If the deletes are taking this long, your tables may be bloated.

EXPLAIN ANALYZE
SELECT * FROM user_actions , user_actions ua2 
WHERE (  user_actions.action_type = ua2.action_type AND
  user_actions.user_id = ua2.user_id AND
  user_actions.acting_user_id = ua2.acting_user_id AND
  user_actions.target_post_id = ua2.target_post_id AND
  user_actions.target_post_id > 0 AND 
  user_actions.id > ua2.id 
  )

The query above may show you where time is being spent the most:

"Nested Loop  (cost=0.84..52037.91 rows=174200175 width=96) (actual time=8612.056..8612.056 rows=0 loops=1)"
"  ->  Index Scan using index_user_actions_on_target_post_id on user_actions  (cost=0.42..3752.75 rows=68381 width=48) (actual time=6.663..38.637 rows=74415 loops=1)"
"        Index Cond: (target_post_id > 0)"
"  ->  Index Scan using index_user_actions_on_user_id_and_action_type on user_actions ua2  (cost=0.42..0.70 rows=1 width=48) (actual time=0.115..0.115 rows=0 loops=74415)"
"        Index Cond: ((user_id = user_actions.user_id) AND (action_type = user_actions.action_type))"
"        Filter: ((user_actions.id > id) AND (user_actions.acting_user_id = acting_user_id) AND (user_actions.target_post_id = target_post_id))"
"        Rows Removed by Filter: 224"
"Total runtime: 8612.127 ms"

The above, ran on a 130k posts Discourse in 8 seconds.

Maybe running this can help:

VACUUM VERBOSE ANALYZE user_actions;```

#14

That seems to have done the trick.

I killed the (16!) running processes and then ran that query, and it then manually triggered the Jobs::EnsureDbConsistency job from the sidekiq web interface. This time it took 5 seconds to run.

Fingers crossed.

Thanks everyone, for your assistance. Your time and attention is definitely appreciated!


(Alexander Semyoshin) #15

Hi to everyone,

I have the same problem, and the trick with VACUUM did’t help.

I have a lot of connections running “DELETE FROM user_actions…” and it seems that this SQL query never ends.

The table is rather big, over 9 millions,

Any help or advice is highly appreciated,


#16

So I’m getting the exact same problem again, 2 years later :slight_smile:

It’s still getting stuck on the EnsureDbConsistency task, and spawning new ones until the server load is very high.

All of the connections are the same as before,

DELETE FROM user_actions USING user_actions ua2    
WHERE (user_actions.action_type = ua2.action_type AND
 user_actions.user_id = ua2.user_id AND user_actions.acting_user_id = ua2.acting_user_id AND 
 user_actions.target_post_id > 0 AND user_actions.id > ua2.id
 )

I’ve tried the previous solution, the vacuum command… except now it tells me this:

postgres=# VACUUM VERBOSE ANALYZE user_actions;
ERROR:  relation "user_actions" does not exist

Any help would be appreciated. Is there a command I can run on the Postgres command line that will clean up the whole database?

This keeps spawning new jobs but they never seem to finish.


#17

Replying to my own post, because I realized my mistake :slight_smile:

VACUUM VERBOSE ANALYZE only works if I select the discourse database, of course :slight_smile: