Importing from phpBB3


(Gerhard Schlager) #145

You could do the import on your local machine (maybe by installing Discourse in a virtual machine) and upload the backup afterwards…

(Michael Corliss) #146

Of course, duh! Thanks @gerhard, I will give that a try.

(Benjamin Freeman) #147

You could also upload your sql dump in another base, and with sql, remove every post older than …

(Jay Pfaffman) #148

I sometimes build a development environment in digital ocean and do imports there. That way you can have a “big” server for just the time you need it

(Greg Gurr) #149

What’s the best way to restart the import process from scratch (ie., start over) and not invoke an incremental import? Initially I’ve got a lot of false starts and errors I need to work through. I’ve got a baseline discourse backup I can return to at any time, but I’m also leaving my discourse app container STOPPED for the time-being (which I assume means no sidekiq activity?). So maybe I should delete the import container or maybe I can just delete some …/shared files ? I know I saw some mention of this in a thread, but I can’t find the search terms to pull it up again.

(Jay Pfaffman) #150

Something like this:

export RAILS_ENV
rm -fr public/uploads
sleep 5
RAILS_ENV=development bundle exec rake db:drop db:create db:migrate

I think those instructions may have you using a production database, so you’d change development to production, maybe.

(Gerhard Schlager) #151

I’d execute the following commands in order to restore a Discourse backup and start a new import.

# make sure neither the app nor the import container are running
/var/discourse/launcher stop import
/var/discourse/launcher stop app

# delete the data that is used by the app and import container
rm -r /var/discourse/shared/standalone

# recreate the app container
/var/discourse/launcher rebuild app

Now you can use the Discourse web interface to restore your backup.

# recreate the import container
/var/discourse/launcher rebuild app

Then, start the import as described in 1.5. Executing the import script

(Greg Gurr) #152

Awesome, and simple. I’ve gone through those steps and it’s successfully chugging away on a clean import.:success:

I’d probably recommend:

  • go back as far as 1.3 Configuring the importer to review your import configuration, source files, etc.
  • remember to have a copy of your backup somewhere other than .../standalone/backups
  • keep a copy of your setting.yml handy as well, since it won’t be in .../standalone/import either

Thanks Gerhard !
Thanks Jay ! (RAILS_ENV is production for this by the way).

(Leo Davidson) #153

Hi, I’m new to Discourse and have been using “v1.8.0.beta2 +72” and this importer and guide to import a phpBB 3.1.10 forum over.

I’ve run into three problems, which I think are bugs. One I think I’ve solved & have a suggested change for. Another I understand, and one I’m not sure about yet.

  1. I think there is a bug in a recent commit of text_processor.rb which means any URLs which are re-written for the new site become corrupted, because they swallow up the ) or ] after the URL, as well as any word or punctuation following it, incorrectly. There’s a ) if using markup and a ] if bbcode-to-markup conversion is disabled, and the commit made it so that the regex goes past them.

    I’m not a Ruby dev but I think this fixes it, changing this line near the bottom:
    link_regex = "http(?:s)?://#{host}/viewtopic\\.php\\?(?:\\S*)(?:t=(\\d+)|p=(\\d+)(?:#p\\d+)?)(?:\\S*)"
    link_regex = "http(?:s)?://#{host}/viewtopic\\.php\\?(?:\\S*)(?:t=(\\d+)|p=(\\d+)(?:#p\\d+)?)(?:[^\\s\\)\\]]*)"

    I’m happy with that so not asking for help there, just want to share the possible fix.

  2. The URL re-writing only seems to succeed for URLs which point to topics older than the one being processed. Our forum has topics which were later edited to add links to topics which did not exist when they were first created, and all of these are left pointing to the old phpBB URLs.

    My Ruby skills and knowledge of Discourse aren’t strong enough to fix this myself, but it looks like the import code only re-writes the URLs if it can find the new topic IDs in a map, and I’m guessing the map has only been filled in up to the post being converted, when it needs to be filled for all the threads first, and then the conversion done?

    Or perhaps I can re-run the conversion at a later stage, as a post-import/post-process thing?

  3. Whether or not I enable bbcode-to-markup, any post that uses nested “[ u l ]” lists becomes a big mess. It seems that the code involved just doesn’t cope with them.

    If I edit everything by hand I can see that Discourse supports nested lists OK, so it seems the importer just goes wrong somewhere.

Any advice about 2 and 3 would be greatly appreciated, and I apologise if I’ve missed an obvious answer somewhere. Most of my time so far has been spent on 1, which I think is now solved (for me, at least).

Thanks for your time!

(Sebastian) #154

What’s the status on phpbb 3.2 compatibility?

(Gerhard Schlager) #155

Thanks for the improved regex!

Yes, that’s an unfortunate restriction. We’d need to add an additional step to the importer that rewrites unmapped URLs at the end of the import process. However, it shouldn’t be a problem as long as permalinks are generated.

Yeah, the BBCode to Markdown conversion isn’t perfect yet. The ruby-bbcode-to-md gem will need lots of changes in order to handle nested BBCodes properly.

(Gerhard Schlager) #156

I haven’t tested it with phpBB 3.2 yet. Compatibility depends on how much the database schema changed between 3.1 and 3.2

(Sebastian) #157

this goes a bit beyong my skills, but I think this list to be a good place to start phpBB • Changelog

(Gerhard Schlager) #158

I’m sorry to say but importing from phpBB 3.2 is currently not supported.
The latest release of phpBB changed the storage format of post’s raw text which leads to a lot of wrongly formatted posts after the import.

:bulb: I recommend you don’t upgrade to 3.2 if you are thinking about migrating your phpBB forum to Discourse and your forum is still running with phpBB 3.0 or 3.1

(Mitchell Krog) #159

Well done, sure a lot of people are going to need this.

(Sebastian) #160

Damn, too late :slight_smile: Well maybe I can downgrade first or something…

(Leo Davidson) #161

(Not sure if this is specific to the phpBB importer, or something that could affect all of them. I only have knowledge/experience of the phpBB importer, so I thought I’d mention it here first.)

Executive summary:

  • If the site-wide setting clean orphan uploads grace period hours is shorter than the time it takes to do the import, you can wind up losing a lot of attachments and avatar images in the process.

That’s my theory for what happened, at least. I reduced the grace period from the default 48 hours to just 4 hours (figuring it’d be enough, oops!). Then I ran the import, which took about 10 hours with our hardware and amount of data.

The next day we found a huge number of posts with missing attachments, and users with broken avatars, with thousands of files in the uploads/tombstone directory.

I think what happens is the import script creates all the attachments at the start (very quick), then builds the posts (can take many hours). There’s also the long sidekiq processing done once the import script is done and the forum restarts, which may be the real culprit. (If so, this probably does affect other importers.)

If the background task that looks for orphaned attachments kicks off before all the posts are in the database, all the attachments that are for pending posts will look orphaned, and they get deleted if they’re older than the grace period. Then the posts are added, and their attachments are broken.

Mea culpa on my part for messing with a setting before doing the import instead of afterwards. That was silly of me; imports are complex and it’s best to do them under vanilla conditions, then start changing things. I’ve repaired everything now (and learned a lot more SQL and Ruby in the process!). But I wanted to feed back my theory in case it can help avoid the same happening to someone else.

(Gerhard Schlager) #162

Yeah, that’s a general problem that affects all importers. And that’s probably why the default was changed to 2 days in August 2016 :slight_smile:

I guess I should add a warning to the Howto anyway.

BTW: Deleted files are stored in a different directory (uploads/tombstone) before they get removed completely. You have 30 days before a file gets deleted from that directory too. You can always move those files back to the uploads directory if a background job deletes some files during the import,

(Leo Davidson) #163

That makes sense. I wonder if the attachment-clean-up job should automatically postpone itself until the sidekiq queue is idle or something? Or import scripts could maybe show an error and exit if the setting is too low. (I guess few forums will take more than 2 days!)

Part of what made fixing it hard was the clean-up job also removes attachments from the uploads table, at least as I understand things. I think if the files were moved back they would end up being tombstoned again. unless the clean-up only considers the uploads table itself?

Discourse comes with a rake script for repairing images broken this way, discussed in Old image uploads become broken images. That is quite a slow process but will move the tombstone images back over, add them back to the uploads table and re-bake the posts to make sure they point to the images again.

Unfortunately, it doesn’t do avatars or non-image file attachments, and it’s probably impossible for it to completely fix file attachments as their original filenames are (AFAIK) lost once they are removed from the uploads table. The files in uploads (live or tombstone) all have sha1 checksums as their filenames on disk, and only the database table makes them download to the correct name when the user clicks on them.

Making the rake script put file attachments back without their proper names was a fairly easy mod. (Something I need to clean up and feed back to that thread. I’m slowly working through a list of feedback.) I did that before noticing the name problem, then wrote some custom code to get a mapping from the phpBB database of SHA1 -> real file names, then some more custom code to generate a database insert script to fix all the filenames in Discourse.

(Alberto) #172

I obtained a similar problem… the solution was remove import folder

rm -R /var/discourse/shared/standalone/import