2.6.0b2 upgrade VERY slow

Just a FYI for people-- don’t try this upgrade during prime time.

The 2.6.0b2 upgrade has been running on our server for well over 40 minutes when normally it takes just a couple of minutes, usually over before you check back. I was concerned it was broken, but logging into postgres I can see a huge update running, looks like it’s changing the post search data for private messages.

Hopefully it isn’t broken. I guess I’ll find out. Really don’t want to kill it or restart the container mid-upgrade.

Query running:

postgres=# SELECT pid, age(clock_timestamp(), query_start), usename, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' 
ORDER BY query_start desc;
  pid  |       age       |  usename  |                                            query                          
                   
-------+-----------------+-----------+---------------------------------------------------------------------------
-------------------
   698 |                 |           | 
   701 |                 | postgres  | 
   699 |                 |           | 
   697 |                 |           | 
   696 |                 |           | 
 14572 | 00:10:31.484201 | discourse | UPDATE post_search_data                                                   
                  +
       |                 |           | SET private_message = X.private_message                                   
                  +
       |                 |           | FROM                                                                      
                  +
       |                 |           | (                                                                         
                  +
       |                 |           |   SELECT post_id,                                                         
                  +
       |                 |           |     CASE WHEN t.archetype = 'private_message' THEN TRUE ELSE FALSE END pri
vate_message      +
       |                 |           |   FROM posts p                                                            
                  +
       |                 |           |   JOIN post_search_data pd ON pd.post_id = p.id                           
                  +
       |                 |           |   JOIN topics t ON t.id = p.topic_id                                      
                  +
       |                 |           |   WHERE pd.private_message IS NULL OR                                     
                  +
       |                 |           |     pd.private_message <> CASE WHEN t.archetype = 'private_message' THEN T
RUE ELSE FALSE END+
       |                 |           |   LIMIT 3000000                                                           
                  +
       |                 |           | ) X                                                                       
                  +
       |                 |           | WHERE X.post_id = post_search_data.post_id                                
                  +
       |                 |           | 
 14573 | 00:47:02.814489 | discourse | SELECT pg_try_advisory_lock(2859260972035668690)
(7 rows)
8 Likes

The upgrade just finished successfully as my panic started to hit its peak. Good times, good times.

4 Likes

The upgrade was quite slow for me on fast colocated hardware as well. Not entirely sure why, but definitely something to be aware of.

4 Likes

Yep, I suggest putting a note in the changelog warning people this one will probably take much longer than most updates, and not to kill it or do anything drastic because that’s expected.

2 Likes

@Wingtip Can I check the number of posts that you have on your forum? Unfortunately, this is going to be slow on sites with a large number of posts.

4 Likes

Yes, we have over 5 million.

1 Like

My local site didn’t have that many posts and it was still quite slow. Not 40 minutes slow but noticeably slower than previous upgrades by 3x-4x maybe?

2 Likes

FWIW:

Just rebuilt and now running 2.6.0.beta2 ( 2aa1482421 )

The build process was not noticeably slower on our server.

2 Likes

Thanks @Wingtip , I thought it was just happening with us!

Actually, I had to cancel the rebuild and restart the app because I thought it had got stuck on that query you mention. We have 6 million posts and after about 45 minutes was still not completed, so I guess I’ll have to be prepared for at least one hour of rebuild and warn our users before it.

3 Likes

A note on the extended Docker Manager and/or rebuild via SSH time has been added to the release notes.

4 Likes

Just pulled out a stop watch and tested a rebuild (site with around 1M posts only) from 2.6.0b1 to b2; and from start to finish, it took 170 seconds.

Screen Shot 2020-08-21 at 6.34.31 PM

This is my second rebuild today, going from b1 to b2 and it seems very fine with no noticeable difference in build speed.

Note: We always upgrade from the command line and do not use the UI.

5 Likes

I also don’t use the docker manager and prefer to use the rebuild from command line. It’s better this way to see the log in case anything goes wrong. I also think it’s faster.

3 Likes

Yes, seems it’s mostly a problem on forums with a lot of posts.

1 Like

I (foolishly) upgraded via the web console, so I didn’t have an up to date log the whole time. Last time I make that mistake.

2 Likes

I had a large site that repeatedly failed to bootstrap. It’s a 2-container install, so the old container continued to run while the bootstrap was migrating. I eventually solved the problem by turning on SKIP_POST_DEPLOYMENT_MIGRATIONS=1 for the bootstrap and then running the migrations after the new container got cranked up. With SKIP_POST_DEPLOYMENT_MIGRATIONS=1 in the ENV section, the migration was very fast, the site was then able to function normally (though perhaps more slowly) while the migrations ran, which took > 20 minutes, I think

I think, but haven’t yet tested, that using this same trick would work to minimize down time during a single-container installation as well. If I’m right, what you’d do is

  • add SKIP_POST_DEPLOYMENT_MIGRATIONS=1 to your app.yml
  • ./launcher rebuild app
  • ./launcher enter app
  • SKIP_POST_DEPLOYMENT_MIGRATIONS=0 rake db:migrate
  • un-do the edit in app.yml unless you plan to remember to to the migrations after every upgrade
  • maybe rebuild again to make sure that you don’t have any problem while you still remember what might have broken your site, as in 4 months when you try again, you’ll have no idea and it’ll be hard for anyone to guess what the problem might be

If there were a way to have ./launcher pass SKIP_POST_DEPLOYMENT_MIGRATIONS=1 into things without requiring an edit to the app.yml it’d make things less fiddly for the editor-challenged.

If I do manage to do the work on this that I think I will, I’ll create a new topic reporting what I’ve found. Though the smoke has me sequestered into a room where I don’t have my big monitor. (The pandemic wasn’t enough?! We have to have smoke too? And I’m not even especially close to the darn fire.)

7 Likes

Good idea, I went and implemented that:

https://github.com/discourse/discourse_docker/pull/481

10 Likes

That is fantastic news! (Two other projects got in the way today and somehow my multisite instance is no longer working and playing well with S3. :crying_cat_face:) Thanks very much.

3 Likes

Is there a technical reason why post-upgrade DB changes need to be blocking by default? Is there any way to change that behavior so future upgrades get the site back up quickly and then run the post-upgrade stuff in the background?

IMO anything essential for the upgraded application to function like DDL should really be part of the upgrade itself, not in post-upgrade scripts.

2 Likes

We started our rebuild from the command line seven hours ago! and is still going… It’s still here:

Any ideas?

Edit: meanwhile I’ve killed the process to have the site live again for our users. But there has to be a better way to do this update.

3 Likes

Do you have a big database?

1 Like