Update takes ~20 minutes and progress update breaks (temporarily) at step multisite:migrate

I just did the weekly update for our Discourse site and found that instead of taking the regular 1-2 minutes, this time about 20 minutes were needed. Also, progress update temporarily broke with timeout errors in the Chrome console at step $ SKIP_POST_DEPLOYMENT_MIGRATIONS=1 bundle exec rake multisite:migrate (screenshot below)

I could see on the server that something was happening (load on postmaster and occasionally on ruby), and eventually progress updates recovered, but overall it seems like there might be an issue with the current migrations.

From the update logs, it seems that the migration responsible for the long runtime was DropTrgmIndexesOnUsers. I have included the respective part of the log directly in this post below (full log is attached as a text file in case it is needed for further analysis).

********************************************************
*** Please be patient, next steps might take a while ***
********************************************************
...
$ LOAD_PLUGINS=0 bundle exec rake plugin:pull_compatible_all
discourse-bbcode is already at latest compatible version
discourse-data-explorer is already at latest compatible version
docker_manager is already at latest compatible version
$ SKIP_POST_DEPLOYMENT_MIGRATIONS=1 bundle exec rake multisite:migrate
Multisite migrator is running using 1 threads

Migrating default
== 20240912061702 DropUserSearchSimilarResultsSiteSetting: migrating ==========
-- execute("DELETE FROM site_settings WHERE name = 'user_search_similar_results';\n")
   -> 0.0006s
== 20240912061702 DropUserSearchSimilarResultsSiteSetting: migrated (0.0015s) =

== 20240912061806 DropTrgmIndexesOnUsers: migrating ===========================
-- execute("DROP INDEX IF EXISTS index_users_on_username_lower_trgm;\nDROP INDEX IF EXISTS index_users_on_name_trgm;\n")
   -> 1290.7163s
== 20240912061806 DropTrgmIndexesOnUsers: migrated (1290.7169s) ===============

== 20240912212253 IncreaseExternalAvatarUrlLimitTo2000: migrating =============
-- change_column(:single_sign_on_records, :external_avatar_url, :string, {:limit=>2000})
   -> 0.0011s
== 20240912212253 IncreaseExternalAvatarUrlLimitTo2000: migrated (0.0017s) ====

Seeding default
*** Bundling assets. This will take a while *** 
$ bundle exec rake themes:update assets:precompile
Building
Environment: production
building...
...

Full log: upgrade-log-2024-09-15.txt (124.7 KB)

I am not sure there is anything we can do now.

We need to drop the index and if your forum happened to be super busy at the time it can take a while.

Assume all is well now?

Yes, things are back to normal now, so no immediate problem.

The main reason I reported it here was that the progress indicator was temporarily broken, and that’s maybe something that could lead people to believe the update has crashed (not sure how much can really be done here).

This was not the case, though. The server is a test server (we need to do quite a bit of prep) and I usually run updates after midnight, so I think I was probably the only user.

So the other question that came to my mind afterwards is: does Discourse effectively require an SSD for communities like ours (~6k users, ~2.5M posts)? Because the current server is decent (AMD Ryzen 5 3600, 64GB RAM), but it does have classical spinning disks and if that’s a problem, we would need to switch to one with SSDs for before we go live.

Dropping that index on a server with only 6k users should take milliseconds, even on a spinning metal disk; it’s a strong indicator that your DB is starved for resources

Ok, thanks! I wasn’t aware that DB tuning would be required (we’re using the single container setup) and there’s not much else running on the machine (Discourse, Mailcow, Traefik, Crowdsec). But I’ll have a closer look.
Any pointers to relevant documentation in this regard? The only thing I’m aware of right now is Configure Discourse Docker on servers with more RAM and CPU, and our db_shared_buffers is already at 4096MB.

I suspect this is because we didn’t drop the index concurrently.

A normal DROP INDEX acquires an ACCESS EXCLUSIVE lock on the table, blocking other accesses until the index drop can be completed.

This should be fixed in PERF: Update migration to drop index concurrently. by tgxworld · Pull Request #28955 · discourse/discourse · GitHub.

@schneeland Thank you for taking the time to report the issue :+1:

2 Likes