we’re facing huge transaction loads on our storage from time to time. We couldn’t find any schedule or time-pattern for this event but it happens at least once a day. Even the duration changes from 10min to several hours.
While these loads our whole installation behaves slight strange, e.g. reading topics won’t be recognised so they still appear in “New” and/or “Unread”.
It seems like Discourse is moving huge amounts of files. Especially READ - Operations are increasing. We’ve already checked if external traffic is increasing as well but it doesn’t. Only Discourse <-> Storage traffic is affected.
We’ve recognized this behaviour the first time after upgrading from Discourse 2.4.0.beta9 to 2.4.0.beta10 but we’re not sure if it already happened before. We’re now running 2.5.0.beta4.
Our Discourse installation is running in a Azure environment with a Premium Storage attached via SMBv3 which normally runs pretty well.
Can someone explain what is going on? At first we’ve suspected the sidekiq-job MigrateUploadScheme but if this job is responsible for these transactions we should see these high loads way more often than we do. Besides we haven’t found any other job which could be responsible.
Because of “burst IOPS” you can see this peek at ~800k Transactions/30min. After these credits are used up it gets throttled down to ~250k Transaction/30min. So please don’t mind this peek as it’s just a limited/credited bonus of the azure storage tier.
Normally we have 5k - 40k Transaction per 30min.
At this point we don’t know where to look at and any idea/hint is appreciated.
WITH x AS (SELECT
u.id user_id,
SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 2 THEN 1 ELSE 0 END) likes_received,
SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 1 THEN 1 ELSE 0 END) likes_given,
COALESCE((SELECT COUNT(topic_id) FROM topic_views AS v WHERE v.user_id = u.id AND v.viewed_at > '2019-10-28 23:52:24.911261'), 0) topics_entered,
COALESCE((SELECT COUNT(id) FROM user_visits AS uv WHERE uv.user_id = u.id AND uv.visited_at > '2019-10-28 23:52:24.911261'), 0) days_visited,
COALESCE((SELECT SUM(posts_read) FROM user_visits AS uv2 WHERE uv2.user_id = u.id AND uv2.visited_at > '2019-10-28 23:52:24.911261'), 0) posts_read,
SUM(CASE WHEN t2.id IS NOT NULL AND ua.action_type = 4 THEN 1 ELSE 0 END) topic_count,
SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 5 THEN 1 ELSE 0 END) post_count
FROM users AS u
LEFT OUTER JOIN user_actions AS ua ON ua.user_id = u.id AND COALESCE(ua.created_at, '2019-10-28 23:52:24.911261') > '2019-10-28 23:52:24.911261'
LEFT OUTER JOIN posts AS p ON ua.target_post_id = p.id AND p.deleted_at IS NULL AND p.post_type = 1 AND NOT p.hidden
LEFT OUTER JOIN topics AS t ON p.topic_id = t.id AND t.archetype = 'regular' AND t.deleted_at IS NULL AND t.visible
LEFT OUTER JOIN topics AS t2 ON t2.id = ua.target_topic_id AND t2.archetype = 'regular' AND t2.deleted_at IS NULL AND t2.visible
LEFT OUTER JOIN categories AS c ON t.category_id = c.id
WHERE u.active
AND u.silenced_till IS NULL
AND u.id > 0
GROUP BY u.id)
UPDATE directory_items di SET
likes_received = x.likes_received,
likes_given = x.likes_given,
topics_entered = x.topics_entered,
days_visited = x.days_visited,
posts_read = x.posts_read,
topic_count = x.topic_count,
post_count = x.post_count
FROM x
WHERE
x.user_id = di.user_id AND
di.period_type = 5 AND (
di.likes_received <> x.likes_received OR
di.likes_given <> x.likes_given OR
di.topics_entered <> x.topics_entered OR
di.days_visited <> x.days_visited OR
di.posts_read <> x.posts_read OR
di.topic_count <> x.topic_count OR
di.post_count <> x.post_count )
May I provide some context so that you can asses it better:
We have ~430k users, 1.6mio topics (without deleted) with 8.4mio posts (without deleted) in 241 categories and 12mio user_actions.
But I still don’t understand why slow queries should cause this high amount of READ operations on the storage (/uploads). May I miss something?
This does not sound right. I am confused, if you are on Azure how are you storing files? Is this a single container setup? How are uploads setup?
Directory update is very slow. If you can not afford the cost of updating it you could disable directory https://meta.discourse.org/u we have some very concrete plans to add user search to full page search so you can get away without directory.
Sorry for the confusion. I’ll try to explain how we’ve setup discourse.
So, first of all, it’s not a single container setup. We’ve split it up to use azure’s own services for redis, postgresql and storage.
There are 3 VMs running discourse + nginx. A separate Azure File Share is mounted via SMBv3 on these 3 VMs and this mount-point is attached to the discourse-containers as a volume.
That’s where /public/uploads, /tmp/javascript-cache and /tmp/stylesheet-cache will be stored.
VM Disks, Storage and Database are separated from each other. Therefore DB load won’t (shouldn’t) affect storage or VM performance and we can make use of the benefits of those services (like database statistic on the PostgreSQL instance you’ve mentioned above and performance recommendations).
This setup also allows us to monitor each service/part on it’s own and we’ve seen that our Azure File Share, where uploads is located, receives a very high amount of transactions (like you can see in my first post). These transactions are mostly READ operations.
So as this storage (File Share) is only used by Discourse itself we’ve tried to figure out which process/job is responsible for these events which occur 1-2 times a day for several minutes up to several hours.
Besides those huge transaction counts this setup works pretty well except some slow running queries which affects performance only in a few cases (e.g. a small amount of users activity summary pages can take up to 15sek to load).
I hope I could explain why I was wondering how DB performance should have an effect on transaction-counts of static files.
Kind Regards and thanks for your efforts so far
Sascha
P.S.
We’re using a custom docker-image on our setup and I totally understand that you can’t/won’t offer support for custom solutions.
The only thing we like to know is which process/job/setting could cause these storage transaction counts which partially slows down the whole setup and what we can do to avoid it.
I think your best bet perf wise is to switch upload storage to s3 or an s3 compatible storage engine + cdn, using an smb share for uploads is not something we ever tested, my guess is that we are checking uploads size daily, which is lightning fast on local, and very slow on smb
Thanks for clarification and your advice. indeed SMB can be very slow if it comes to accessing a lot of files. Most of the time it makes no difference as frequently accessed files got cached by nginx (we frequently apply the changes made to the discourse nginx sample config).
But when it comes to those peeks the performance drops.
We’re looking for other storage solutions for some time now. Using an external s3 (compatible) storage would may break parts of our security concept. Every instance/service involved (db, vm, storage, …) is bound to a private network and inaccessible from public internet. All public traffic is managed by an Azure Application Gateway.
Unfortunately Azure Blob Storage is not S3 compatible but we may should invest some time making use of it.
Current possible solutions are the Discourse Blob Storage plugin or using blobfuse directly inside the container.
Anyway. Thanks for your time and help. Is there a reason why upload size will be checked daily and is there a way to turn that off?
Thank you very much, this will help a lot. du will be very expensive using an SMB share at least we’re hosting ~800k files (38GB) on this file share.
We’ve disabled pulling_hotlinked_images already due to potential legal/copyright issues.
I think it’s a little bit to invasive to alias du while it’s a good idea to patch that using a plugin. May we just apply a git-patch while building the image with something like: