Guys, i’m just trying to migrate my phpBB3 forum to discourse. That forum has over 1 milion of posts and import is gonna take about 50 hours. That’s a huge down-time for my forum. You think that, for example, running several import tasks in parallell would speed things even a bit? Will that even work ?
Why don’t you run an initial import and then run an incremental backup for the data created in the meantime?
I’m not sure what you exactly mean. Are those features you’ve provided (
initial import and
incremental backup) something that comes with the importer? Is that documented anywhere? (or ideally - is such attempt to importing documented anywhere?)
Sorry, i’ve installed discourse yesterday to check how the import will work and don’t know much about it nor it’s importers yet.
Most of our importers support incremental imports, meaning you can run a first import (which I called initial import) and then you can run another imports using the same database dump but with new data and it will be smart enough to only import new stuff. And it’ll be much faster too
That way, you can take multiple days for the initial import without having to put your current forum in readonly mode. You only have to put it in readonly mode for the duration of an incremental import.
Hmm, sounds good but i still feel quite unconfident. Is that a scenario we’re talking about?:
- Initial database dump of phpbb
- “initial import” (what’s actually that thing? Is that some kind of import of just the crucial data or something like that?)
- Regular import described in original post in that thread (without disabling phpbb)
- When regular import is done, i’m making another dump from original phpbb database
- I’m running importer again and it imports just the new stuff
Is that what you’re describing? I feel like i don’t know about something Especially why that would be faster than 50 hours in my case.
- You take a database dump of your phpbb
- You run an import (this is your initial import). It can take as long as it needs since your current forum is still operating
- You put your phpbb in readonly mode
- You take another database dump of your phpbb
- You run another import (this is your incremental import). Since you already imported most of the data during your initial import, it’ll go much much faster
- You put some redirections in place to redirect your users to your new forum
You can do steps 4 & 5 as many times as you want
With decent hardware it will be a lot faster. I get more than 1000 posts/minute on my development machine. Not sure how it will scale with a larger forum, since I’ve never done a large import, but it shouldn’t be that much slower.
Be warned: There’s a small risk of data loss when you do an incremental import, because existing posts and users aren’t updated during incremental imports. You’ll lose post edits that occurred between database dumps as well as changes made to user profiles.
oh, right, didn’t realised that. Profile changes lost is not a big deal, post edits is worst but still, i think i can live with that. Thanks for a warning. Actually, i think that it wouldn’t be too hard to implement a new edits check for importer (by comparing updated_at dates). But i don’t know much about the importing framework so i’m not sure that’s even possible. I need to digg it more some day.
I have just a small concern about that whole importing thing - My forum has 12869 users. I know that discourse sends a digest mails for users that was not online for some time. Does it mean that, after turning discourse/sidekiq on after import, it will send 12k digest mails to everyone each week? Or is there any “last active at” limit for user digests (i mean, user will receive digest only if he was active in less than, say, 2 months) ?
The thing is that forum was pretty active some time ago but it died a bit (mainly because of using a phpbb hehe) so the user base is huge but there are not so many active users right now. I don’t want to waste my money and mandrill usage limits just to send digest to users that was not online since 2006.
After 2 days of importing, importer has imported 45% of posts and topics and it seems that the longer it runs, the slower it gets. At the beginning, 0.01% was taking about minute or so to be imported.
Right now, when i’m at 45%, 0.01% takes about…an hour. Yes, an hour (even after restarting the importer and restarting the whole VM. That’s a horrible thing, it let’s me think it would take weeks to import that forum.
I’ve just started a local virtual machine (virtualbox) which is stronger than my web server, hopefully import will be faster. But, so far, it seems that the speed is more or less the same.
Is there ANYTHING i can do to make that process faster? Some code changes in importer, params passed, moving to mruby or anything like that? I would really appreciate any help and ideas.
ps. another horrible thing is that even restarting importer takes a huge amount of time. Running importer again and waiting for skipped 1000’s takes about 2 hours each time.
- Try running MYISAM over INNODB
- GIve PG more memory, consider disabling fsync and stuff like that
- Run it on a faster computer … Ideally on bare metal … not on a VM
It would be great to know where all the time is spent during the import. Unfortunately my database dumps are all quite small. The largest one has 15,000 posts.
I’d really like to look into the performance issues, but I can’t do much unless someone gives me access to a larger database.
It’s possible that we’ve got a Shlemiel The Painter algorithm going on here.
Can you share your DB in confidence with @gerhard? We trust him if that helps
If the MySQL queries are slowing down than it could be the
LIMIT ... OFFSET in the queries.
Using OFFSET is quite bad for query performance.
I have an experimental version of the importer that doesn’t use it anymore. However, on my small database replacing
OFFSET didn’t make a difference on runtime…
I’m already on MyISAM
I’m gonna try some stuff from this doc and let you guys know: PostgreSQL: Documentation: 10: 14.5. Non-Durable Settings
My production machine has 2 GB of RAM and 2k MHz processor. My VirtualMachine is running on MacBook and i gave it ~6 GB of RAM and 2 cores. Anyway, importing process speed on both machines seems to be exactly the same. At least on 4% progress.
Running it on faster computer is my last hope if everything will fail so i’m keeping it for later
Can i debug that somehow? Turning on some verbose mode or something like that?
Oh yes, i would love to do that but i’m not the guy who can make such decision. I mean, i’m “working” (it’s rather a friendly favor) for some guys who runs that phpBB forum. It’s the biggest forum about the topic it is about in the entire country so it would be kind of risky to share it’s users database just like that. Would that be even legal? (trust me, i would love to make things simple and just send it to gerhard but, as you hopefully understand, i just can’t simply do that. Will speak with forum owner and let you know. Maybe replacing user emails (and other data) with some random ones will do the trick)
Is that available anywhere?
edit: ok, assume it’s here https://github.com/gschlager/discourse/tree/phpbb3-importer-no-offset
Pretty sure there’s something we can do here to improve that part. Skipping already imported content should take minutes, not hours… Unfortunately, without the database dump, it’s hard to do. Hopefully you’ll be able to share it.
Yeah, I’m not interested in any private data, so removing email addresses, names, IP addresses and private messages would be fine. I’m assuming that the forum is public: IANAL, but I guess there shouldn’t be a legal issue with me seeing the topics and posts in the database.
Yes, that’s it.
What about completely empty’ing the users table? Is it even necessary?
Woah! I need to say that using your fork improved “skipping” part dramatically. It took about ~10 minutes instead of 2 hours, awesome!
If you empty the users table, you’ll end up with all your posts being affected to the
system user. While it’s fine for testing purposes, it will not exercice the code that is dealing with importing users and will give you slightly wrong time estimates.