Migrate:multisite fails on second site when migrating from scratch

I’ve just started to spin up our discourse instance, and when using multisite:migrate it failed on the second database:

https://gist.github.com/wryun/e1e38f7065e75d46edde10d399bd194c

This may be caused by the changes in 40fa96777ddad4698df20d6b8a61afcf87743ed7, where a refactoring removed a lot of Discourse.reset_active_record_cache, which would clear the leftover column information from the first migration (it appears to think a whole bunch of columns exist that don’t).

Or maybe I just did something wrong!

For now, I manually changed the failing migration to add a reset_active_record_cache and it succeeded, but I suspect this will recur when I next add a site.

Discourse version used: de47b35b2d096d0fc9ac22e8cb85a38bba39bd70

1 Like

This migration from 2013 is a problem.

https://github.com/discourse/discourse/blob/1cacfddda45c414216e148ea72cc40c897271ace/db/migrate/20130221215017_add_description_to_categories.rb#L11-L15

I am patching it so it is less obnoxious:

Can you re-test? There may be more of these which we need to patch

3 Likes

Why are we carrying migrations from six years ago?

1 Like

That is how it works … in fact there is no simple tool for flattening migrations in existence afaik, it would be brand new territory for us. If we did not add the description column on the category table in 2013 it simply would not be there.

We are an old project now and it may make sense to fold a big pile of migrations into one but we are talking weeks of high risk work to get it perfectly right.

4 Likes

You don’t want to do that. It’s a LOT of work and it would bring you nothing. The beauty of the RoR migrations mechanism is that you can upgrade /any/ version of Discourse to the latest. That would break as well.

Pointless; statistically nobody is running Discourse 1.0 out there. We know because we see the version pings…

1 Like

Maybe not running - but last month we did upgrade someone with an archived 1.5 export which they kept in a drawer on the attic. And the beauty is that we just could do that.

1 Like

Yes but 1.5 is years ahead of 1.0. We see the version ping metrics so we know what’s in use. And trust me, 1.0 is … not… in use.

2 Likes

Note, I am not in principle agains making our No database -> Yes database process somewhat faster and take say 200ms instead of 3 seconds like it is today. What I don’t want to do though is spend weeks on making this happen, so many other :lion:s to tame.

2 Likes

Thanks for your prompt fix! I should get a chance to test this later today, but no guarantees (running on Australian time here).

However, I’m not entirely convinced that patching every migration is the best option - though you almost certainly should ignore me, since I have no Rails experience. Isn’t it a principled decision to reset the cache at the start of every multisite migration to make sure that previous runs can’t interfere? Or - even better - disable the cache entirely for the migration (possible?), or somehow make every migration entirely separate? It also seems particularly dangerous to me to patch ‘old’ migrations that you know have worked properly (for some value of properly); you really don’t want to make a mistake…

The other fun thing with the current solution is that a random migration is resetting the cache in certain paths, which could easily cause inconsistent failures (i.e. you have old category data, you get the cache reset, everything works later, and if you don’t it fails…).

We got to catch every occurrence of corruption of our schema cache, cause its like a little timebomb waiting to explode.

  • migration 1 … corrupts schema cache
    3 years pass
  • migration N … incorrectly uses active record models (which is not something we should be doing) and then explodes.

The best thing is simply never ever to touch AR models in migrations. I would be far more open to a fix that hard bans that from migrations and deal with the pain of fixing this in the existing migrations we have.

4 Likes

True, a hard ban would be nice, but in practice I think what happened here is more likely to recur: that it works fine to spin the first db from scratch (more likely to be exercised by tests/caught by the first person attempting the migration?), but the multisite migration is corrupted (and it seems clear that this is not run frequently by either tests or users…). Note that the multisite migration fails even if the first migration is completely finished, so I assume there is a source of cache corruption outside actual migration steps.

EDIT: will it ever fail in non-multisite? I would have assumed that ActiveRecord would correctly detect changes in column information caused via the connection (rather than db switches, which seem inherently unexpected)?

2 Likes

Ran the multisite:migrate with that patched in. Worked fine, so no other bad activerecord uses, I guess? Thanks.

3 Likes

The more we use bad patterns the higher the risk that it will fail, even in a single site.

Looks like we are safe for now, so closing… thanks for reporting!

2 Likes