What's the fastest way to replace strings with multiple regexes in 1 million posts?

After a vBulletin import, I need fix a lot of things in the imported messages.

I need to modify, replace, or delete many old BBcode tags.

I was looking at this: Replace a string in all posts

I surely don’t want to do any mistakes since I have 1.6M posts.

  1. Is there a way to target only 1 specific post do some tests before replacing the strings in all the posts?
    I created a “test” post in my forum with a bunch of BBtags in various contexts.

  2. How long would be a replacement of string in 1M+ posts?
    If it takes a lot of time, is there a faster way? Maybe editing directly texts in the database?

  3. Is there a way to make multiple replacements at once (for example, adding new lines before and after [quote] tags, replacing [b] and [i] by ** and *, removing [color] and [indent], etc)?

edit: would apply these modifications on raw post contents via rails, then rebake all posts would to the trick?

1 Like

Looking for the same solution as well.

Hi @Canapin

FYI from our 1M post migration from vB3 to Discourse around six months ago:

When we migrated from vB we did all this data cleansing and bbcode tag refactoring with a lot of custom Ruby code in the migration script.

We found that worked the best for us; to clean everything up by running a lot Ruby gsub REGEX expressions on the vB posts before they were inserted into the Discourse DB.

Otherwise, you will need to run a lot of PostgreSQL queries again the Discourse raw posts and recook the posts.

After extensive testing, we decided to do all the preprocessing during migration (and not after migration). We found this to be the “faster” way to get a perfect migration done.

HTH

1 Like

Do you know if there would be a way to reimports only the posts but without impacting already imported Discourse data, especially data that is linked to posts IDs (users, topics, attachments, permalinks, etc…)?

The import is so long that if I forgot something and I must reimport ALL again, it will takes days again.

When the script was importing the posts, it was creating about 50000 Discourse posts by hour. Why does a rebake takes so much more time?

Hi @Canapin

I know your pain. After our original migration, we wrote a lot of custom Ruby code to clean up over a decade of code copy-and-pasted from every corner of the planet; not to mention all the mojibake and strange char sets; and not to mention all the various bbcodes which were nested which needed a lot of Ruby-ism to clean up nicely during migration.

It was non-trivial to do and took a long time (and countless migration attempts); but were / are primarily a coding forum, so if your forum is mostly only text (and not code), it should be more straight-forward.

Our team had to write a lot of “somewhat complex escape this and that” REGEX as well; because converting all that legacy nested bbcode to markdown is not trivial when people posted so much nested bbcode over the years. We also stripped out a lot of bbcode which we thought added little value years after posted.

To answer your original question above:

Yes, you can just comment out the methods you want to skip in the migration script.

We actually completely rewrote much of the migration methods, but they were a very good starting point and so very useful, considering vB3 migration was and remains unsupported.

1 Like

I know that, but if I simply start the migration script, it will skip existing posts.

If I delete all my Discourse imported posts before restarting the migration, I suppose that importing them again will set a different ID (discourse post id) than the previous migration and all references to these posts will likely be broken.

1 Like

Yes, there are many ways to “skin this cat”, as they say.

I am very happy we only had to do it once in my life and that it is done!

:slight_smile:

1 Like

If it’s at all possible, it’s much, much easier to get the importer to fix up those things. If you’ve gone live already, it’s just going to be long and painful. You could contrive to have an import script that modifies the already-imported posts, but I’m not aware of any models for how to do it.

1 Like

So I guess I’ll modify the importer to do my regexes before the import and be as patient as possible then!

Thanks for your replies.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.