Just a FYI for people-- don’t try this upgrade during prime time.
The 2.6.0b2 upgrade has been running on our server for well over 40 minutes when normally it takes just a couple of minutes, usually over before you check back. I was concerned it was broken, but logging into postgres I can see a huge update running, looks like it’s changing the post search data for private messages.
Hopefully it isn’t broken. I guess I’ll find out. Really don’t want to kill it or restart the container mid-upgrade.
Query running:
postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
pid | age | usename | query
-------+-----------------+-----------+---------------------------------------------------------------------------
-------------------
698 | | |
701 | | postgres |
699 | | |
697 | | |
696 | | |
14572 | 00:10:31.484201 | discourse | UPDATE post_search_data
+
| | | SET private_message = X.private_message
+
| | | FROM
+
| | | (
+
| | | SELECT post_id,
+
| | | CASE WHEN t.archetype = 'private_message' THEN TRUE ELSE FALSE END pri
vate_message +
| | | FROM posts p
+
| | | JOIN post_search_data pd ON pd.post_id = p.id
+
| | | JOIN topics t ON t.id = p.topic_id
+
| | | WHERE pd.private_message IS NULL OR
+
| | | pd.private_message <> CASE WHEN t.archetype = 'private_message' THEN T
RUE ELSE FALSE END+
| | | LIMIT 3000000
+
| | | ) X
+
| | | WHERE X.post_id = post_search_data.post_id
+
| | |
14573 | 00:47:02.814489 | discourse | SELECT pg_try_advisory_lock(2859260972035668690)
(7 rows)
Yep, I suggest putting a note in the changelog warning people this one will probably take much longer than most updates, and not to kill it or do anything drastic because that’s expected.
My local site didn’t have that many posts and it was still quite slow. Not 40 minutes slow but noticeably slower than previous upgrades by 3x-4x maybe?
Thanks @Wingtip , I thought it was just happening with us!
Actually, I had to cancel the rebuild and restart the app because I thought it had got stuck on that query you mention. We have 6 million posts and after about 45 minutes was still not completed, so I guess I’ll have to be prepared for at least one hour of rebuild and warn our users before it.
I also don’t use the docker manager and prefer to use the rebuild from command line. It’s better this way to see the log in case anything goes wrong. I also think it’s faster.
I had a large site that repeatedly failed to bootstrap. It’s a 2-container install, so the old container continued to run while the bootstrap was migrating. I eventually solved the problem by turning on SKIP_POST_DEPLOYMENT_MIGRATIONS=1 for the bootstrap and then running the migrations after the new container got cranked up. With SKIP_POST_DEPLOYMENT_MIGRATIONS=1 in the ENV section, the migration was very fast, the site was then able to function normally (though perhaps more slowly) while the migrations ran, which took > 20 minutes, I think
I think, but haven’t yet tested, that using this same trick would work to minimize down time during a single-container installation as well. If I’m right, what you’d do is
add SKIP_POST_DEPLOYMENT_MIGRATIONS=1 to your app.yml
./launcher rebuild app
./launcher enter app
SKIP_POST_DEPLOYMENT_MIGRATIONS=0 rake db:migrate
un-do the edit in app.yml unless you plan to remember to to the migrations after every upgrade
maybe rebuild again to make sure that you don’t have any problem while you still remember what might have broken your site, as in 4 months when you try again, you’ll have no idea and it’ll be hard for anyone to guess what the problem might be
If there were a way to have ./launcher pass SKIP_POST_DEPLOYMENT_MIGRATIONS=1 into things without requiring an edit to the app.yml it’d make things less fiddly for the editor-challenged.
If I do manage to do the work on this that I think I will, I’ll create a new topic reporting what I’ve found. Though the smoke has me sequestered into a room where I don’t have my big monitor. (The pandemic wasn’t enough?! We have to have smoke too? And I’m not even especially close to the darn fire.)
That is fantastic news! (Two other projects got in the way today and somehow my multisite instance is no longer working and playing well with S3. ) Thanks very much.
Is there a technical reason why post-upgrade DB changes need to be blocking by default? Is there any way to change that behavior so future upgrades get the site back up quickly and then run the post-upgrade stuff in the background?
IMO anything essential for the upgraded application to function like DDL should really be part of the upgrade itself, not in post-upgrade scripts.