I’ve successfully imported my vBulletin 5 forum into Discourse. While the overall import script works fine, I’m encountering errors when importing attachments. The attachments are stored in my database and include the following extensions: java, html, jpg, png, txt, rtf, zip, js, and xml.
Upon debugging the import_attachment action, I discovered that only attachments with the java extension are being imported correctly. The script fails for attachments with other extensions.
Has anyone else in the community faced issues when importing attachments with these file extensions? Does anyone have insights on why the script might be failing with these particular file types?
Here’s a brief overview of the issue:
The first three files in my database have the java extension and are imported without problems.
The script fails when it encounters a file with the jpg extension.
@pfaffman Any advice or solutions would be greatly appreciated!
begin
upl_obj = create_upload(post.user.id, filename, real_filename)
if upl_obj&.persisted?
html = html_for_upload(upl_obj, real_filename)
if !post.raw[html]
post.raw += "\n\n#{html}\n\n"
post.save!
UploadReference.ensure_exist!(upload_ids: [upl_obj.id], target: post)
end
else
puts "Failed to create upload for #{filename}: #{upl_obj.errors.full_messages.join(", ")}"
next
end
rescue => e
puts "Error processing file #{filename}: #{e.message}"
next
end
My guess, which could be wrong, is that a problem with newline encoding makes the data in the binary files wrong because a newline character is encoded as data. If the only files that work are ascii, it’s a good bet.
Almost - not newline encoding, but treated as text, and therefore corrupted.
EF BF BD is the UTF-8 byte sequence for ‘REPLACEMENT CHARACTER’ (U+FFFD). This is indicative of a file being treated as text instead of binary.
A JPEG image starts with ff d8 ff e0 xx xx 4a 46 49 46 00
You can see that the first four bytes have each been replaced with EF BF BD.
So your images are indeed corrupted. This is not a problem with the importer, this is a problem with the database, as @pfaffman already said. If you have copied this database from another server, you might want to check if this is already an issue in the original database. This could also only be happening on the oldest images (if this happened a long time ago). Just remove the exitline and see what happens.
@RGJ Thanks for the help. I tried importing a new database with the correct images, and while it imported, not all attachments were fully imported. I’m encountering errors like this:
If I recall correctly, all [ATTACH] tags are removed from the posts since they are superfluous. That probably doesn’t work here because it does not expect JSON data in it. It would be matter of looking up the place where they are being removed and modifying that code to account for the JSON data inside of the tag.
Before importing attachments, I notice that posts with images contain [ATTACH] tags. After the import, some of these tags are correctly filled while others are left empty. Why is that?
I think some vBulletin attaches them by adding them to the database and some include bbcode like that. I think that I’ve modified the imported to handle those before.
Oh, I didn’t notice the json before. Do you expect these json files to be attachments embedded into the posts? What do those posts look like in vBulletin?
I believe the other errors are because those posts were not imported for some reason (like the parent topic was deleted or otherwise not imported)
I don’t think these are json files, it’s json metadata.
It looks like vBulletin changed their encoding of attachment locations from
[attach]123[/attach]
to [attach=json]{"data-attachmentid":123}[/attach]
and the importer cannot handle that. It should attach the attachments anyway, these tags are only for positioning them within the post. But the deletion of the tag only happens when they contain a numeric id.
A lot of other errors in the screenshot above are independent of this issue.
I thought that I’d seen that sometimes the database linked them to the post and sometimes the bbcode did, and I guess sometimes they both do? And sometimes they live in the database and sometimes they are external files (but i might be remembering some other system on that).