将 vBulletin 4 论坛迁移到 Discourse

I’m only a recent discourse convert, so after a lot of trial and error I’ve combined everything above into a full command by command list (thanks @titusca and @enigmaty).

Hopefully this will help (or at least accelerate) fellow newcomers go from start to finish. Would like to incorporate this into the first post given the updates to mysql->mariadb that I think have thrown a lot of confusion into the process.

Background:

  • 1.6 million post transfer.
  • Utilized Digital Ocean Droplet (CPU Optimized 4 vCPU/8GB)

#1 - Install Digital Ocean Discourse 1-click droplet

#2 - Finish discourse install through SSH by following prompts

Open SSH console
root
(yourrootpassword)
(enter)
(yourdomain).com
(etc…)

#3 - Login to SFTP to upload database dump

sftp root@XXX.XXX.XX.XX
y
yes
(yourrootpassword)
put db.sql /var/discourse/shared/standalone/db.sql

#4 - Login to new discourse website to setup admin account

#5 - Login to SSH - begin process

ssh root@XXX.XXX.XX.XX
cd /var/discourse
./launcher start app
docker exec -it app bash
sudo apt-get update
sudo apt-get upgrade
y

#6 - Install MariaDB (replacement for mysql)

apt-get update && apt-get install mariadb-server-10.3 libmariadbd-dev
y

#7 - Mysql Database Setup

service mysql start
mysql -u root -p
password
create database vbulletin;
exit;

#8 - Vbulletin → Mysql Database Transfer

mysql -u root -p vbulletin < /shared/db.sql
password

#9 - GEM File

echo “gem ‘mysql2’” >>Gemfile
echo “gem ‘mysql2’, require: false” >> /var/www/discourse/Gemfile
echo “gem ‘php_serialize’, require: false” >> /var/www/discourse/Gemfile
cd /var/www/discourse
su discourse -c ‘bundle install --no-deployment --without test --without development --path vendor/bundle’
(Ignore red text result)

#10 - Configure install script

vi /var/www/discourse/script/import_scripts/vbulletin.rb

#10.a - Make edits to text file as needed

DB_HOST ||= ENV[‘DB_HOST’] || “localhost”
DB_NAME ||= ENV[‘DB_NAME’] || “vbulletin”
DB_PW ||= ENV[‘DB_PW’] || “password”
DB_USER ||= ENV[‘DB_USER’] || “root”
TIMEZONE ||= ENV[‘TIMEZONE’] || “America/Los_Angeles”
TABLE_PREFIX ||= ENV[‘TABLE_PREFIX’] || “”
ATTACHMENT_DIR ||= ENV[‘ATTACHMENT_DIR’] || ‘/shared/attachments/’

#10.c - End edits

:wq

#11 - Bundle Config

bundle config set path ‘vendor/bundle’
bundle config set without ‘development:test’
bundle config unset deployment
su discourse -c ‘bundle install’

#12 - Mysql config (may be possible to do this with previous)

mysql --version
sudo mysql -u root -p
password
ALTER USER 'root'@'localhost' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
exit

#13 - Install Script

su discourse -c ‘bundle exec ruby script/import_scripts/vbulletin.rb’

Good luck!

8 个赞

Just wanted to leave feedback after our migration from vB4:

  • FIXED Soft-Deleted posts where not properly hidden: https://github.com/discourse/discourse/pull/12057
  • [ul] + [li] and nested [LIST] were not migrated properly and the BBcode plugin doesn’t seem to handle this either → This seems to be expected: CommonMark testing started here! (Quote: Core will not implement [ul] [ol] and [li] support for BBCode cause it is a recipe for failure.) → I will need to build some RegEx magic post-fixup for this.
  • We made an initial migration using the normale importer (took > 3 days) and restarted the migration with newer DB snapshots a couple of times to keep the import “fresh” and reduce the downtime to effectively 30 minutes. This procedure worked quite well, except for everything that was edited after we initially imported the threads, posts. We need to manually rework this information now.
  • Creating Plugins for Discourse is really hard due to lack of documentation and a big picture of how the folder structure works. Though it is getting nicer and better after you understand how it works.

Questions that i have left:

  • I not not sure how the importer maps already imported posts and how to match the old vB4 post_id to the new Discourse post_id to hide those “soft-deleted” post. If someone can give me a hint that would be very welcome! Found it: import_id inside the post_custom_fields table. Nice. Now i need to write some handy script to fix this :slight_smile: → Edit: An even better way is to use the importer script, which maps all imported id’s for easy use.
2 个赞

Unfortunately I can’t edit my previous post :slight_smile:

I found another issue: Every attachment that is not linked into a post, will not be available to Discourse.

My draft PR for fixing this issue: https://github.com/discourse/discourse/pull/12187

Thanks!

3 个赞

Just a quick followup on my issue list. I fixed the visibility problem.

Dump all affected posts from your old vBulletin database:

SELECT postid
FROM `vb4_post`
WHERE `visible` > '1'
ORDER BY postid

Make an imported_post_ids.txt file which has all the postid’s line by line

Create a new file for the fixing script:

nano script/import_scripts/fix_visibility.rb 

Content:

require_relative '../../config/environment'
require_relative 'base/lookup_container'

@lookup = ImportScripts::LookupContainer.new

broken_postids = []
broken_real_postids = []

File.foreach("imported_post_ids.txt") do |line|
  broken_postids.append(line.to_i)
end

broken_postids.each do |id|
  broken_real_postids.append(@lookup.post_id_from_imported_post_id(id))
end

broken_real_postids.each do |id|
  puts id
  Post.find(id).trash!
end

Run the script:

su discourse -c 'bundle exec ruby script/import_scripts/fix_visibility.rb'

The script will use the logic from the importer to map the imported post_id’s to the read discourse post_id’s which we want to hide.

4 个赞

大家好,

我有一个脚本正在运行 vb3 迁移。我一次只做一个步骤,目前正在以每分钟 330 个用户的速度处理 122k 用户。然后我们还有 250 万个帖子要处理。

我们正在生产服务器上进行此操作。没有人使用 discourse 站点,我们刚刚设置好它,并且它位于一个匿名的 URL。如果我登录,我可以看到新的用户通知在增加。这可能是一个愚蠢的问题,但我想知道如果我们暂停或以某种方式禁用实时站点,迁移是否会处理得更快?

1 个赞

这取决于您的生产服务器上的负载和 CPU 数量。您可以尝试停止 Web 服务器 5 分钟,看看导入是否会更快。

3 个赞

导入需要很长时间。据我所知,批量导入器应该更快。我们从备份中进行了第一次导入,然后在我们的强大开发机器上进行了增量导入,从另一个备份进行了增量导入,以便将 Discourse 的停机时间缩短到半小时。请注意增量更新时可能出现的错误 :slight_smile: (请参见:Migrate a vBulletin 4 forum to Discourse - #132 by paresy

paresy

3 个赞

我看到一个核心被占用,我认为这是服务器正在摄取更新后的数据,另一个核心在运行导入脚本时被占用。我确实没有领域知识来判断这两个进程争夺数据库资源是否会减慢导入器的速度,我也无法判断在不关闭容器的情况下是否可以停止摄取。摄取无论如何都必须发生,所以我想最安全的方法就是让它继续运行。

给未来读者的一点建议是,我看到我们有 27k(22%!)的用户是被禁止的垃圾邮件机器人。我们将在源端清除它们,然后再进行最终导入。

[补充] 我上面没有提到的一项必要编辑:

--- a/script/import_scripts/vbulletin.rb
+++ b/script/import_scripts/vbulletin.rb
@@ -134,6 +133,7 @@ EOM
        , usertitle
        , usergroupid
        , joindate
+       , lastvisit
        , email
        , password
        , salt

以及一项可能是 vb3 特定的编辑:

--- a/script/import_scripts/vbulletin.rb
+++ b/script/import_scripts/vbulletin.rb
@@ -987,7 +989,7 @@ EOM
   end

   def parse_timestamp(timestamp)
-    Time.zone.at(@tz.utc_to_local(timestamp))
+    Time.zone.at(@tz.utc_to_local(Time.at(timestamp)))
   end

[补充] 导入正在 Oracle Cloud 的 4 核 Ampere 实例上运行。作为比较,我在 M1 MacBook Air 上本地安装了一个 Discourse 开发服务器,令我惊讶的是,导入过程运行得明显更慢。

6 个赞

你是否遇到了预先存在的脚本错误?我因为那个丢失了我们所有旧 vBulletin 4 帖子的日期和时间信息。如果这是一个修复,我很想知道是否值得在所有帖子都复制过来后重新导入。

2 个赞

是的,脚本会因为将一个整数馈送给时间函数而报错。

3 个赞

不。脚本会跳过已经导入的帖子。

3 个赞

你好,

你找到如何修复这个问题了吗?

我们两个主要的/底部的论坛的 parentid = -1(我认为这是因为我们以前从 v3 转换过来的)。

我不确定如何进行,我是否应该在转换脚本中将它们设置为 0(如果它们是 -1)?假设 0 是主要的 discourse 分类?

实际上,现在查看 discourse 网站;这两个似乎是唯一被导入的?

 importing top level categories...
         2 / 2 (100.0%)  [211 items/min]  in]
 importing children categories...
 Traceback (most recent call last):
         5: from script/import_scripts/vbulletin.rb:1003:in `<main>'
         4: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
         3: from script/import_scripts/vbulletin.rb:84:in `execute'
         2: from script/import_scripts/vbulletin.rb:287:in `import_categories'
         1: from script/import_scripts/vbulletin.rb:287:in `each'
script/import_scripts/vbulletin.rb:289:in `block in import_categories': undefined method `[]' for nil:NilClass (NoMethodError)
1 个赞

可能吧。从那以后我做了很多 vBulletin 导入。 :person_shrugging:

你只需要试试看会发生什么。它看起来和我描述的一样。

我只会修改脚本来…… 做某事 ……如果那个东西是 nil。

1 个赞

当然可以,但我对论坛的运作方式了解不够,不知道应该设置为多少。
如果我将它们设置为像 0 这样的随机数,论坛会怎么做?或者我应该找到数据库中已有的某个分类编号并将其设置为该编号?

我对 Ruby 不太熟悉,你觉得这个可以吗?

        if categories.detect { |c| c["forumid"] == cc["parentid"] }["parentid"].nil?
          cc["parentid"] = 52
        else
          cc["parentid"] = categories.detect { |c| c["forumid"] == cc["parentid"] }["parentid"]
        end

实际上,似乎有很多已删除的论坛,它们的 parentid 已不存在。

编辑
我刚刚将它们全部设置为一个父主题,以后再修复。

1 个赞

终于到了附件导入部分,进度大约是 1.9%,现在出现了这个错误

    67406 / 3550728 (  1.9%)  Traceback (most recent call last):
        23: from script/import_scripts/vbulletin.rb:1006:in `
        22: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
        21: from script/import_scripts/vbulletin.rb:88:in `execute'
        20: from script/import_scripts/vbulletin.rb:610:in `import_attachments'
        19: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/querying.rb:22:in `find_each'
        18: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:70:in `find_each'
        17: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:137:in `find_in_batches'
        16: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:229:in `in_batches'
        15: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:229:in `loop'
        14: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:245:in `block in in_batches'
        13: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:138:in `block in find_in_batches'
        12: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:71:in `block in find_each'
        11: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:71:in `each'
        10: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-6.1.4.1/lib/active_record/relation/batches.rb:71:in `block (2 levels) in find_each'
         9: from script/import_scripts/vbulletin.rb:651:in `block in import_attachments'
         8: from script/import_scripts/vbulletin.rb:651:in `each'
         7: from script/import_scripts/vbulletin.rb:659:in `block (2 levels) in import_attachments'
         6: from /var/www/discourse/script/import_scripts/base.rb:873:in `html_for_upload'
         5: from /var/www/discourse/script/import_scripts/base/uploader.rb:40:in `html_for_upload'
         4: from /var/www/discourse/lib/upload_markdown.rb:10:in `to_markdown'
         3: from /var/www/discourse/lib/upload_markdown.rb:19:in `image_markdown'
         2: from /var/www/discourse/app/models/upload.rb:206:in `short_url'
         1: from /var/www/discourse/app/models/upload.rb:534:in `short_url_basename'
/var/www/discourse/app/models/upload.rb:270:in `base62_sha1': undefined method `hex' for nil:NilClass (NoMethodError)

undefined method `hex’ for nil:NilClass (NoMethodError)

有人知道怎么修复这个问题吗?

是不是在尝试读取 short_url_basename,它返回 nil;所以 .hex 失败了?

1 个赞

我猜,在不看代码的情况下,文件丢失了,或者可能有一个 filename 字段但它是空的?我可能会在 import_attachments 中放一个 puts,看看它试图导入的记录中有哪些内容。

1 个赞

谢谢你的帮助!我是 Ruby 新手,这样做对吗?

      unless mapping[post.id].nil? || mapping[post.id].empty?
        mapping[post.id].each do |attachment_id|
          upload, filename = find_upload(post, attachment_id)
          unless upload
            fail_count += 1
            next
          end

          puts "{short_url_basename}"

          # internal upload deduplication will make sure that we do not import attachments again
          html = html_for_upload(upload, filename)
          if !new_raw[html]
            new_raw += "\n\n#{html}\n\n"
          end
        end
      end

啊哈,short_url_basename 是一个函数,所以行不通。

是仅仅 puts “{post}” 吗?它会输出 post 对象的所有内容吗?

这似乎是 upload.rb 中导致崩溃的那一行:

upload_markdown 19
“![#{@upload.original_filename}|#{@upload.width}x#{@upload.height}](#{@upload.short_url})”

upload.rb 534
“#{Upload.base62_sha1(sha1)}#{extension.present? ? ".#{extension}" : ""}"

upload.rb 270
Base62.encode(sha1.hex)

所以要么是 upload.original_filename、upload.width、upload.height 或者是 upload.short_url。

那么,如果在 upload_markdown 中进行 nil 检查,是否可以防止错误?

它需要 shortURL 才能工作吗?我能自己创建一个随机的 shortURL 吗?

2 个赞

我认为问题就在这里。它找不到上传的文件,所以返回 nil。也许文件丢失或无效。

1 个赞

但这不会捕获它吗?

unless upload
  fail_count += 1
  next
end

还是说 unless 不检查 nil?

或者是因为它创建了 upload 对象,但 upload 对象中的 upload.short_url 属性丢失了?

1 个赞

抱歉。对了。这样就能抓到了。恐怕这就是为什么这种程度的调试不适合在论坛上进行。:person_shrugging:

不过你思路是对的。继续努力。看起来你已经知道足够多的东西可以弄明白了。我在学会 Ruby 之前,至少写了几个导入器。

1 个赞