Huge amount of storage transactions

Hi,

we’re facing huge transaction loads on our storage from time to time. We couldn’t find any schedule or time-pattern for this event but it happens at least once a day. Even the duration changes from 10min to several hours.
While these loads our whole installation behaves slight strange, e.g. reading topics won’t be recognised so they still appear in “New” and/or “Unread”.

It seems like Discourse is moving huge amounts of files. Especially READ - Operations are increasing. We’ve already checked if external traffic is increasing as well but it doesn’t. Only Discourse <-> Storage traffic is affected.
We’ve recognized this behaviour the first time after upgrading from Discourse 2.4.0.beta9 to 2.4.0.beta10 but we’re not sure if it already happened before. We’re now running 2.5.0.beta4.

Our Discourse installation is running in a Azure environment with a Premium Storage attached via SMBv3 which normally runs pretty well.

Can someone explain what is going on? At first we’ve suspected the sidekiq-job MigrateUploadScheme but if this job is responsible for these transactions we should see these high loads way more often than we do. Besides we haven’t found any other job which could be responsible.


Because of “burst IOPS” you can see this peek at ~800k Transactions/30min. After these credits are used up it gets throttled down to ~250k Transaction/30min. So please don’t mind this peek as it’s just a limited/credited bonus of the azure storage tier.
Normally we have 5k - 40k Transaction per 30min.

At this point we don’t know where to look at and any idea/hint is appreciated.

Kind Regards
Sascha

Do you have automated backups enabled? Check the automatic backups enabled and backup frequency site settings

Hi.
No, backups are disabled completely. We’re making use of backup-retention of the PSQL-Instance itself and (not automated) storage snapshots.

Can you enable stats in Postgres, hunting for long running queries or repeated queries?

You need to enable pg stats statement and dig in to the stats it creates

2 Likes

We’ve enabled stats a while ago to iron out some bottlenecks. That’s where my previous Database performance recommendations (by Azure PSQL) post came from.

Here are the top 10 long running queries of the past week:

If you need the full queries, please let me know and it would be interesting why this affects the storage usage.

2 Likes

Probably the first full query for sure, that one is big at 1:00 duration and 14 executions.

Hi. The first query will be triggered/executed via DirectoryItem.refresh_period (i guess).

So, here is the actual query:

Summary
WITH x AS (SELECT
			u.id user_id,
			SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 2 THEN 1 ELSE 0 END) likes_received,
			SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 1 THEN 1 ELSE 0 END) likes_given,
			COALESCE((SELECT COUNT(topic_id) FROM topic_views AS v WHERE v.user_id = u.id AND v.viewed_at > '2019-10-28 23:52:24.911261'), 0) topics_entered,
			COALESCE((SELECT COUNT(id) FROM user_visits AS uv WHERE uv.user_id = u.id AND uv.visited_at > '2019-10-28 23:52:24.911261'), 0) days_visited,
			COALESCE((SELECT SUM(posts_read) FROM user_visits AS uv2 WHERE uv2.user_id = u.id AND uv2.visited_at > '2019-10-28 23:52:24.911261'), 0) posts_read,
			SUM(CASE WHEN t2.id IS NOT NULL AND ua.action_type = 4 THEN 1 ELSE 0 END) topic_count,
			SUM(CASE WHEN p.id IS NOT NULL AND t.id IS NOT NULL AND ua.action_type = 5 THEN 1 ELSE 0 END) post_count
			FROM users AS u
			LEFT OUTER JOIN user_actions AS ua ON ua.user_id = u.id AND COALESCE(ua.created_at, '2019-10-28 23:52:24.911261') > '2019-10-28 23:52:24.911261'
			LEFT OUTER JOIN posts AS p ON ua.target_post_id = p.id AND p.deleted_at IS NULL AND p.post_type = 1 AND NOT p.hidden
			LEFT OUTER JOIN topics AS t ON p.topic_id = t.id AND t.archetype = 'regular' AND t.deleted_at IS NULL AND t.visible
			LEFT OUTER JOIN topics AS t2 ON t2.id = ua.target_topic_id AND t2.archetype = 'regular' AND t2.deleted_at IS NULL AND t2.visible
			LEFT OUTER JOIN categories AS c ON t.category_id = c.id
			WHERE u.active
			AND u.silenced_till IS NULL
			AND u.id > 0
			GROUP BY u.id)
	UPDATE directory_items di SET
		 likes_received = x.likes_received,
		 likes_given = x.likes_given,
		 topics_entered = x.topics_entered,
		 days_visited = x.days_visited,
		 posts_read = x.posts_read,
		 topic_count = x.topic_count,
		 post_count = x.post_count
	FROM x
	WHERE
	x.user_id = di.user_id AND
	di.period_type = 5 AND (
	di.likes_received <> x.likes_received OR
	di.likes_given <> x.likes_given OR
	di.topics_entered <> x.topics_entered OR
	di.days_visited <> x.days_visited OR
	di.posts_read <> x.posts_read OR
	di.topic_count <> x.topic_count OR
	di.post_count <> x.post_count )

May I provide some context so that you can asses it better:
We have ~430k users, 1.6mio topics (without deleted) with 8.4mio posts (without deleted) in 241 categories and 12mio user_actions.

But I still don’t understand why slow queries should cause this high amount of READ operations on the storage (/uploads). May I miss something?

1 Like

This does not sound right. I am confused, if you are on Azure how are you storing files? Is this a single container setup? How are uploads setup?

Directory update is very slow. If you can not afford the cost of updating it you could disable directory https://meta.discourse.org/u we have some very concrete plans to add user search to full page search so you can get away without directory.

3 Likes

Sorry for the confusion. I’ll try to explain how we’ve setup discourse.

So, first of all, it’s not a single container setup. We’ve split it up to use azure’s own services for redis, postgresql and storage.

There are 3 VMs running discourse + nginx. A separate Azure File Share is mounted via SMBv3 on these 3 VMs and this mount-point is attached to the discourse-containers as a volume.
That’s where /public/uploads, /tmp/javascript-cache and /tmp/stylesheet-cache will be stored.

Additionally we make use of Azure Cache for Redis and Azure Database for PostgreSQL.

VM Disks, Storage and Database are separated from each other. Therefore DB load won’t (shouldn’t) affect storage or VM performance and we can make use of the benefits of those services (like database statistic on the PostgreSQL instance you’ve mentioned above and performance recommendations).

This setup also allows us to monitor each service/part on it’s own and we’ve seen that our Azure File Share, where uploads is located, receives a very high amount of transactions (like you can see in my first post). These transactions are mostly READ operations.
So as this storage (File Share) is only used by Discourse itself we’ve tried to figure out which process/job is responsible for these events which occur 1-2 times a day for several minutes up to several hours.

Besides those huge transaction counts this setup works pretty well except some slow running queries which affects performance only in a few cases (e.g. a small amount of users activity summary pages can take up to 15sek to load).

I hope I could explain why I was wondering how DB performance should have an effect on transaction-counts of static files.

Kind Regards and thanks for your efforts so far
Sascha

P.S.
We’re using a custom docker-image on our setup and I totally understand that you can’t/won’t offer support for custom solutions.
The only thing we like to know is which process/job/setting could cause these storage transaction counts which partially slows down the whole setup and what we can do to avoid it.

I think your best bet perf wise is to switch upload storage to s3 or an s3 compatible storage engine + cdn, using an smb share for uploads is not something we ever tested, my guess is that we are checking uploads size daily, which is lightning fast on local, and very slow on smb

2 Likes

Thanks for clarification and your advice. indeed SMB can be very slow if it comes to accessing a lot of files. Most of the time it makes no difference as frequently accessed files got cached by nginx (we frequently apply the changes made to the discourse nginx sample config).
But when it comes to those peeks the performance drops.

We’re looking for other storage solutions for some time now. Using an external s3 (compatible) storage would may break parts of our security concept. Every instance/service involved (db, vm, storage, …) is bound to a private network and inaccessible from public internet. All public traffic is managed by an Azure Application Gateway.
Unfortunately Azure Blob Storage is not S3 compatible but we may should invest some time making use of it.
Current possible solutions are the Discourse Blob Storage plugin or using blobfuse directly inside the container.

Anyway. Thanks for your time and help. Is there a reason why upload size will be checked daily and is there a way to turn that off?

Kind Regards

1 Like

This is likely coming from here:

I guess you can disable pulling hotlinked images or create a monkey patch in a plugin that disables this:

Or just alias du in you container to be a no-op by amending your container config.

2 Likes

Thank you very much, this will help a lot. du will be very expensive using an SMB share at least we’re hosting ~800k files (38GB) on this file share.

We’ve disabled pulling_hotlinked_images already due to potential legal/copyright issues.

I think it’s a little bit to invasive to alias du while it’s a good idea to patch that using a plugin. May we just apply a git-patch while building the image with something like:

def self.used(path)
    output = Discourse::Utils.execute_command('df', '-Pk', path)
    size_line = output.split("\n")[1]
    size_line.split(/\s+/)[2].to_i * 1024
end

As du may be more reliable/accurate I think df should fit our needs and shouldn’t break any other functionality.

4 Likes