Extreme load times - low disk perf?


(James Figgle) #1

I experience very high page load times on my Discourse install. During times of high load, the site dies completely until I do a restart of the app via the launcher. My instance has 14 available processors and 48 gigs of ram. During times of low load, I’m still looking at very high page view times in the thousands of milliseconds, peaking at around 6700ms. I’ve read everything I can about tuning discourse for performance, and tried to think of everything possible. Not sure what to do.

Some values I’ve changed in the yaml:

db_shared_buffers: "10GB"
I have this set but it doesn’t seem to affect redis-- it still uses an utterly inconsequential amount of memory (there’s enough free ram to keep my whole DB in ram)

db_work_mem: "40MB"
UNICORN_WORKERS: 28

RUBY_GC_MALLOC_LIMIT: 90000000

The output of my miniprofiler for my homepage:

What my flamegraph looks like:
http://imgur.com/nF0uHZ0

Numbers from profile-gc:
http://pastebin.com/bBGaaZ7c

profile-gc-time and times plural did not seem to work, just returned the same output as profile-gc.

Any ideas?


(Rafael dos Santos Silva) #2

Can you share some numbers about your load?

How many active visitors a week, api requests per day, etc…

I mean, this should be enough raw power for most loads.

And if you can share the load from mini-profiler in a slow request would be good too (the numbers from upper left)


(James Figgle) #3

~50k API requests a day.
~565 users a day. about 4k in the past 7 days.

will wait for a good slow request and post the miniprofiler numbers.


(James Figgle) #4

here’s a 1297.3ms request:

i will post better examples when they come, seems to be running better than usual right now


(Kane York) #5

Errr, I don’t think that’s the setting you want to change.

Have you tried increasing the UNICORN_WORKERS count?

  ## TODO: How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  ## If you have lots of memory, use one or two workers per logical CPU core
  #UNICORN_WORKERS: 3

(James Figgle) #6

yes, I currently have increased it to 28, that’s the current setting as per the current level of problems. should I go higher?

also keep in mind that 33% to 45% of any given request on my site is spent in SQL-- some sort of database optimization here would take out a significant source of bottleneck


(Rafael dos Santos Silva) #7

491ms to render the index is a lot.

Click on the 39 sql link to show what’s the query that’s slow, is the one with the red background.


(James Figgle) #8

This is the one that’s consistently red:

lib/freedom_patches/active_record_base.rb:7:in `exec_sql'
lib/sql_builder.rb:67:in `exec'
lib/sql_builder.rb:104:in `map_exec'
app/models/topic_tracking_state.rb:136:in `report'
app/controllers/application_controller.rb:345:in `preload_current_user_data'
app/controllers/application_controller.rb:199:in `preload_json'
lib/middleware/anonymous_cache.rb:129:in `call'
config/initializers/100-quiet_logger.rb:10:in `call_with_quiet_assets'
config/initializers/100-silence_logger.rb:26:in `call'
lib/middleware/request_tracker.rb:73:in `call'
lib/scheduler/defer.rb:85:in `process_client'
lib/middleware/unicorn_oobgc.rb:95:in `process_client'
SELECT
u.id AS user_id,
topics.id AS topic_id,
topics.created_at,
highest_post_number,
last_read_post_number,
c.id AS category_id,
tu.notification_level
FROM topics
JOIN users u on u.id = 2
JOIN user_stats AS us ON us.user_id = u.id
JOIN user_options AS uo ON uo.user_id = u.id
JOIN categories c ON c.id = topics.category_id
LEFT JOIN topic_users tu ON tu.topic_id = topics.id AND tu.user_id = u.id
WHERE u.id = 2 AND
topics.archetype <> 'private_message' AND
((1=0) OR ("topics"."deleted_at" IS NULL AND topics.created_at >= GREATEST(CASE
WHEN COALESCE(uo.new_topic_duration_minutes, 2880) = -1 THEN u.created_at
WHEN COALESCE(uo.new_topic_duration_minutes, 2880) = -2 THEN COALESCE(u.previous_visit_at,u.created_at)
ELSE ('2016-03-06 14:31:20.986157'::timestamp - INTERVAL '1 MINUTE' * COALESCE(uo.new_topic_duration_minutes, 2880))
END, us.new_since, '2016-03-01 23:25:20.000000') AND tu.last_read_post_number IS NULL AND COALESCE(tu.notification_level, 2) >= 2)) AND
(topics.visible OR u.admin OR u.moderator) AND
topics.deleted_at IS NULL AND
( NOT c.read_restricted OR u.admin OR category_id IN (
SELECT c2.id FROM categories c2
JOIN category_groups cg ON cg.category_id = c2.id
JOIN group_users gu ON gu.user_id = 2 AND cg.group_id = gu.group_id
WHERE c2.read_restricted )
)
AND NOT EXISTS( SELECT 1 FROM category_users cu
WHERE last_read_post_number IS NULL AND
cu.user_id = 2 AND
cu.category_id = topics.category_id AND
cu.notification_level = 0)


UNION ALL

SELECT
u.id AS user_id,
topics.id AS topic_id,
topics.created_at,
highest_post_number,
last_read_post_number,
c.id AS category_id,
tu.notification_level
FROM topics
JOIN users u on u.id = 2
JOIN user_stats AS us ON us.user_id = u.id
JOIN user_options AS uo ON uo.user_id = u.id
JOIN categories c ON c.id = topics.category_id
LEFT JOIN topic_users tu ON tu.topic_id = topics.id AND tu.user_id = u.id
WHERE u.id = 2 AND
topics.archetype <> 'private_message' AND
(("topics"."deleted_at" IS NULL AND tu.last_read_post_number < topics.highest_post_number AND COALESCE(tu.notification_level, 1) >= 2) OR (1=0)) AND
(topics.visible OR u.admin OR u.moderator) AND
topics.deleted_at IS NULL AND
( NOT c.read_restricted OR u.admin OR category_id IN (
SELECT c2.id FROM categories c2
JOIN category_groups cg ON cg.category_id = c2.id
JOIN group_users gu ON gu.user_id = 2 AND cg.group_id = gu.group_id
WHERE c2.read_restricted )
)
AND NOT EXISTS( SELECT 1 FROM category_users cu
WHERE last_read_post_number IS NULL AND
cu.user_id = 2 AND
cu.category_id = topics.category_id AND
cu.notification_level = 0)

frequently the category list queries like this one also show up a lighter red background, but never using as much time as the previous one:

app/models/category_list.rb:48:in `find_relevant_topics'
app/models/category_list.rb:17:in `initialize'
app/controllers/categories_controller.rb:22:in `new'
app/controllers/categories_controller.rb:22:in `index'
lib/middleware/anonymous_cache.rb:129:in `call'
config/initializers/100-quiet_logger.rb:10:in `call_with_quiet_assets'
config/initializers/100-silence_logger.rb:26:in `call'
lib/middleware/request_tracker.rb:73:in `call'
lib/scheduler/defer.rb:85:in `process_client'
lib/middleware/unicorn_oobgc.rb:95:in `process_client'
SELECT "topics".* FROM "topics" WHERE ("topics"."deleted_at" IS NULL) AND "topics"."id" IN (10320, 10322, 10, 8192, 9, 10364, 5291, 10273, 7, 9897, 8897, 4645, 10361, 9372, 9678, 9327, 8195, 5104, 10307, 10311, 10365, 6, 10314, 10319, 9395, 5, 4290, 10303, 3775, 10313, 7782, 4, 10027, 8504, 10116, 10286, 9988, 9851, 9964, 10317, 10316, 7614, 10321, 10324, 5961, 10212, 562, 10302, 8655, 10323, 9314, 4547, 10198, 10290, 10181, 3255, 9695, 9570, 10283, 10292, 5102, 10315, 10312, 996, 4004, 3849, 10141, 10309, 10326, 8652, 2296, 9841, 10304, 10299, 10271, 10325, 10024, 10310, 1276, 10228, 94, 10298, 9959, 10194, 2656)

(Rafael dos Santos Silva) #9

How many users and topics you have?

Looks like your database is slow, maybe a Vacuum can solve this.

Are your disks good?


(James Figgle) #10

3688 topics all time. 1419 users. not a lot. can get a good thousand replies on a topic pretty often though.

I’m not sure on the disk quality, but I think they’re not SSD. seeing a SCSI PERC 5/i on lshw. any way to get the database to cache more in memory? there’s plenty of room for it to never hit the disk.


(Dean Taylor) #11

This query has been reported here:


(Rafael dos Santos Silva) #12

The linux will cache and use all idle memory.

Can you run a VACUUM VERBOSE ANALYZE?

You have a very small topics and users table to get this timings…


(James Figgle) #13

what’s the best way to do that and capture the output?


(Rafael dos Santos Silva) #14

Are you using the docker install?

You need to enter the docker (launcher enter or launcher ssh) and then connect with pqsl. Are you familiar with those concepts?


(James Figgle) #15

sorry yes, I was unsure of the best way to write psql interactive output to file, but just used screen’s copy and paste function:
here’s the results for the topics table, which I assume is the one I should have performed it on:

INFO:  vacuuming "public.topics"
INFO:  scanned index "topics_pkey" to remove 467 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "idx_topics_front_page" to remove 467 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  scanned index "idx_topics_user_id_deleted_at" to remove 467 row versions
DETAIL:  CPU 0.01s/0.02u sec elapsed 0.05 sec.
INFO:  scanned index "index_topics_on_bumped_at" to remove 467 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "index_topics_on_created_at_and_visible" to remove 467 row versions
DETAIL:  CPU 0.01s/0.00u sec elapsed 0.02 sec.
INFO:  scanned index "index_topics_on_id_and_deleted_at" to remove 467 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  scanned index "index_topics_on_lower_title" to remove 467 row versions
DETAIL:  CPU 0.00s/0.03u sec elapsed 0.03 sec.
INFO:  scanned index "index_topics_on_pinned_at" to remove 467 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "index_topics_on_pinned_globally" to remove 467 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "topics": removed 467 row versions in 126 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "topics_pkey" now contains 6031 row versions in 38 pages
DETAIL:  192 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "idx_topics_front_page" now contains 6031 row versions in 180 pages
DETAIL:  467 index row versions were removed.
1 index pages have been deleted, 1 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "idx_topics_user_id_deleted_at" now contains 6012 row versions in 116 pages
DETAIL:  463 index row versions were removed.
37 index pages have been deleted, 36 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_topics_on_bumped_at" now contains 6031 row versions in 35 pages
DETAIL:  30 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_topics_on_created_at_and_visible" now contains 3691 row versions in 56 pages
DETAIL:  433 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_topics_on_id_and_deleted_at" now contains 6031 row versions in 54 pages
DETAIL:  113 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_topics_on_lower_title" now contains 6031 row versions in 200 pages
DETAIL:  467 index row versions were removed.
1 index pages have been deleted, 1 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_topics_on_pinned_at" now contains 5 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_topics_on_pinned_globally" now contains 0 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "topics": found 576 removable, 5421 nonremovable row versions in 746 out of 925 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 10645 unused item pointers.
0 pages are entirely empty.
CPU 0.09s/0.19u sec elapsed 0.32 sec.
INFO:  vacuuming "pg_toast.pg_toast_40056"
INFO:  index "pg_toast_40056_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_40056": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.topics"
INFO:  "topics": scanned 925 of 925 pages, containing 6031 live rows and 0 dead rows; 6031 rows in sample, 6031 estimated total rows
VACUUM

(James Figgle) #16

In addition to topics, I did categories and category_users because they’re also referenced in the slow query. No improvement to page load afterwards, same query still causing issue.

INFO:  vacuuming "public.categories"
INFO:  scanned index "categories_pkey" to remove 26 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "index_categories_on_email_in" to remove 26 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "index_categories_on_topic_count" to remove 26 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "unique_index_categories_on_name" to remove 26 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "categories": removed 26 row versions in 2 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "categories_pkey" now contains 8 row versions in 2 pages
DETAIL:  26 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_categories_on_email_in" now contains 8 row versions in 2 pages
DETAIL:  26 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "index_categories_on_topic_count" now contains 8 row versions in 2 pages
DETAIL:  26 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "unique_index_categories_on_name" now contains 8 row versions in 2 pages
DETAIL:  26 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "categories": found 5 removable, 8 nonremovable row versions in 2 out of 2 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 85 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  vacuuming "pg_toast.pg_toast_40001"
INFO:  index "pg_toast_40001_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_40001": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.categories"
INFO:  "categories": scanned 2 of 2 pages, containing 8 live rows and 0 dead rows; 8 rows in sample, 8 estimated total rows
VACUUM
INFO:  vacuuming "public.category_users"
INFO:  scanned index "category_users_pkey" to remove 1 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "idx_category_users_u1" to remove 1 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  scanned index "idx_category_users_u2" to remove 1 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "category_users": removed 1 row versions in 1 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "category_users_pkey" now contains 7 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "idx_category_users_u1" now contains 7 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "idx_category_users_u2" now contains 7 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "category_users": found 1 removable, 7 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.category_users"
INFO:  "category_users": scanned 1 of 1 pages, containing 7 live rows and 0 dead rows; 7 rows in sample, 7 estimated total rows
VACUUM

(Rafael dos Santos Silva) #17

Okay so this is trickier than expected:

First let’s get some more concrete info:

  1. Whats the CPU model? And what’s your host? VPS?
  • 28 Unicorns are a lot of unicorns, and they may put too much load on your (already slow) DB. Let’s try 10.

  • Let’s test this disk perf:

dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda

Also, at work I use a server to host Nginx (web), a server to host rails, redis & sidekiq (app) and a separate server to host PostgreSQL (DB). I think separating data and web containers would be a good start.


#18

He has 14 cores 28 threads. There’s only one xeon that have so many


(Rafael dos Santos Silva) #19

Many slow cores aren’t discourse type, but I don’t think this is whats holding his instance.


#20

It’s true, guy might find higher performance out of a dual-core g3258 running 4ghz+.