Import Script High Memory Usage

I’m trying to do an import test from our existing board. We have about 25 million posts to import (normal posts + private conversations), so to speed this up, I have created multiple copies of the import script to run simultaneously and split up the topic load. This worked fine for a couple of days, and over time I noticed the memory usage for each process increasing slowly up to around 2 GB each. Then, the server finally ran out of memory and killed the MySQL source database around the 16 million post mark.

I’ve increased the system memory from 24 GB to 32 GB, but now when I attempt to restart even 1 import process and pick up where it left off, that process is consuming about 10 GB of memory out of the gate before it even starts importing posts. Where before I was able to run 8 simultaneous import processes, I can now only fit 2 into a larger memory pool. Why is there this huge discrepancy between memory usage from a clean install and memory usage when restarting an import after a failure? Is there any way I can reduce this memory footprint so I can speed up the import process again? A server with 128 GB - 256 GB of memory will be prohibitively expensive (and not needed after the import) and running with only 2 import processes will mean the import will take weeks to complete.

1 Like

That sounds like a regex stuck in a loop or something like that. Print debug messages and skip the problematic row for the import.

1 Like

The memory usage appears to all occur during the “Loading existing posts…” (or topics) sections of the import script startup, not during the actual post processing. From what I can see, that section is pulling post and topic info from the database and there shouldn’t be any regex involved.

The @posts and @topics variables appear to be used for things like the “topic_lookup_from_imported_post_id” method. This makes sense except that when the script was in the initial run, the memory usage never got anywhere close to what I’m seeing now, yet those methods still work.

Did you look at the bulk import scripts? They might have a smaller memory footprint.

But it is true that the import scripts keep in memory a map of old user, topic, and post ID to the new one so that’s a bunch of ram, especially if you want multiple copies.

You understand that after you run the initial import you’ll run it again to import just the new data and it’ll run much faster, right? So after you wait a month to do the initial import the final one won’t take as long.

1 Like

I’m not aware of any bulk import scripts. Are these in the import_scripts directory somewhere?

Yes, this is where I’m having trouble. Things were comparatively fine during the initial import running 8 import processes, up until the system ran out of memory. Now when I attempt to restart the import process and pick up where it left off, each process is using about 5 times the memory that it was when it crashed the first time.

We need to get a full import complete to have a proper test and set expectations for when this migration may happen for real. Right now I still don’t have a clear understanding of what to expect regarding things like performance. And I’ve also noticed that even at the 16 million post mark, the database size is already over 50% larger than our current database - that is a bit of a surprise. The long import time doesn’t make this impossible to do, but it would certainly be much more convenient if the expectation was framed in days instead of weeks.

See https://github.com/discourse/discourse/tree/master/script/bulk_import.

For topics and posts it’s not really feasible to run parallel imports anyway, as you can’t import a post in a topic if the topic and all previous posts are not already imported. I suppose you could have parallel processes for users and topics, but not for posts unless you re-write the script to pull in all the posts for a topic, which would allow parallel imports; that is certainly do-able, but not the way that any of the scripts that I’ve used works. But you’ll still have the problem with them keeping in ram an old-to-new ID map.

Importing 25M posts isn’t convenient. :slight_smile:

1 Like

This is what I’m doing. I’ve split up the topics so that each topic is only handled by one process. It’s not a perfect split, but several times faster than a single linear process.

This is the most confusing part. I assume it is doing this as it goes along, and that’s why I saw the memory usage for each process increase over 3 days. They were getting up around 2 - 2.5 GB of memory each before database connection was lost.

Would each process only be maintaining the map for the posts it has imported? If that’s the case then it might make sense why the memory usage has exploded after restarting the import.

I think so. And the others won’t work correctly because they don’t have those days that were imported by other processes. I don’t think that what you’re doing will work.

You’ll need to either look at the bulk import scripts or rewrite base.rb to keep up with links to the import ids some other way.

The likelihood that you’ll spend more weeks debugging your code than just waiting is high. Single process cpu speed is your best way to speed things up.

I haven’t seen any issues with this approach so far, though I could see some issues happening with some of the methods that happen after the post import. I’ll probably want to make sure everything stops after that and then run a single-threaded execution to make sure the rest is handled cleanly.

That said, your suggestion to handle the import ID links differently is probably good in any case. Holding an arbitrarily large amount of data in some variables for the duration of the script isn’t very efficient.