Topic.reset_all_highest exhausts all available disk space

pgcd · October 31, 2024, 2:00pm

While working on a local container where I have imported data from a SMF2 forum with 20+ years of activity, I have stumbled across a showstopper bug with Topic.reset_all_highest.
After the data import, my db shows around 60k regular topics and around 400k private_message topics, and the queries in Topic.reset_all_highest cause some sort of geometric growth of the rows, with the result that my disk runs out of space (with 120Gb free to start with).
I am currently trying to split the queries in manageable chunks and running them directly in Postgres but that’s of course suboptimal (and I’m not sure that it even works and ends up having the correct results).
I have tried to see if anybody else had this kind of problem but I couldn’t find anything so I’m wondering if this could be somehow related to my own setup - I’m using the latest docker version, for the record.

pfaffman · October 31, 2024, 3:42pm

I have done a modest sized import lately that is also hanging apparently indefinitely onm Topic.reset_all_highest and I’ve had to kill the query in Postgres to continue. I haven’t had this problem before and thought that perhaps it was just that my postgres server was overloaded (it’s got a bunch of sites connected).

My next step was to move to another postgres server, but I haven’t gotten around to it yet.

pgcd · October 31, 2024, 4:44pm

After the first two bits of my “split query” experiment went smoothly (X and Y for public topics), I tried with the Z one, and it froze - that is, the query was active according to postgres activity, and top showed the process running at 100%.
So I looked again at the SQL and I found the problem: both queries end like this

      WHERE
        topics.archetype <> 'private_message' AND
        X.topic_id = topics.id AND
        Y.topic_id = topics.id AND 
          (
          topics.highest_staff_post_number <> X.highest_post_number OR
          topics.highest_post_number <> Y.highest_post_number OR
          topics.last_posted_at <> Y.last_posted_at OR
          topics.posts_count <> Y.posts_count OR
          topics.word_count <> Z.word_count
        )

(the other has ‘private_message’ as archetype, of course)
Which means the query’s missing
Z.topic_id = topics.id - which causes the whole geometric increase.

Changing the queries WHERE clause to

      WHERE
        topics.archetype <> 'private_message' AND
        X.topic_id = topics.id AND
        Y.topic_id = topics.id AND 
        Z.topic_id = topics.id AND 
          (
          topics.highest_staff_post_number <> X.highest_post_number OR
          topics.highest_post_number <> Y.highest_post_number OR
          topics.last_posted_at <> Y.last_posted_at OR
          topics.posts_count <> Y.posts_count OR
          topics.word_count <> Z.word_count
        )

fixed the problem for me.

Should I open a PR?

pfaffman · October 31, 2024, 6:14pm

I’d think so. If you can find a commit that broke that, it’d be even more complelling.

pgcd · November 1, 2024, 6:00am

I have opened a PR for this, with some unfortunate limitations (ie. I can’t imagine how to test this change).

github.com/discourse/discourse

FIX: Prevents Topic.reset_all_highest from consuming all resources

discourse:main ← pgcd:fixes_reset_all_highest_geometric_growth

opened 05:57AM - 01 Nov 24 UTC

pgcd

+4 -2

This commit fixes an undesirable geometric query resource usage introduced in b…908abe35a0121e7df69ab3bbced259b6a2b7104 I have been unable to figure out a way of testing this usefully (as I fear it would require creating several thousands of objects), but existing tests pass and a manual test with ~400k topics succeeds after the fix, while it would hang indefinitely and/or consume all disk space before the fix. I have reported the initial problem and my findings in https://meta.discourse.org/t/topic-reset-all-highest-exhausts-all-available-disk-space/333837

sam · November 3, 2024, 9:59pm

This change looks correct to, merging it.

(test wise there should be coverage and a simple test would be plenty to validate it, we just need to confirm no regression)

Topic		Replies	Views
"Resetting topic counters" takes forever and crashes Migration	1	30	July 13, 2024
Improving Instance Performance (Megatopics, Database Size and Extreme Load) Installation	60	4831	October 13, 2020
PostgreSQL runaway IO Installation	37	2947	March 17, 2021
Performance improvements on long topics? Support	30	5423	August 20, 2018
Slow (over 3 sec) SQL query Hosting	4	1370	January 20, 2017

Topic.reset_all_highest exhausts all available disk space

Related topics