大型 Drupal 论坛迁移，导入器错误和限制

rahim123 · 2022 年11 月 28 日 14:26

您好，此主题提供了我正在缓慢规划和测试的迁移的一些背景信息。上周五，我终于在测试环境 VPS 上尝试了 Drupal 导入器，结合使用了这个和这个。导入器在我输入时仍在运行，所以我还无法实际测试测试站点的功能，但它很快就会完成。

我面临的最大问题是，在约 80,000 个节点（Discourse 中的主题等效项）中，有 8 个似乎是随机的节点出现了“重复键值”错误。以防万一存在某种非常奇怪的 Y2K 式数学错误，这里是具体的 nid 号码：

42081, 53125, 57807, 63932, 66756, 76561, 78250, 82707

每次重新运行导入器时，都会在这些相同的 nid 上发生相同的错误：

Traceback (most recent call last):
	19: from script/import_scripts/drupal.rb:537:in `<main>'
	18: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	17: from script/import_scripts/drupal.rb:39:in `execute'
	16: from script/import_scripts/drupal.rb:169:in `import_forum_topics'
	15: from /var/www/discourse/script/import_scripts/base.rb:916:in `batches'
	14: from /var/www/discourse/script/import_scripts/base.rb:916:in `loop'
	13: from /var/www/discourse/script/import_scripts/base.rb:917:in `block in batches'
	12: from script/import_scripts/drupal.rb:195:in `block in import_forum_topics'
	11: from /var/www/discourse/script/import_scripts/base.rb:224:in `all_records_exist?'
	10: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/transactions.rb:209:in `transaction'
	 9: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/database_statements.rb:316:in `transaction'
	 8: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:317:in `within_new_transaction'
	 7: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `synchronize'
	 6: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `handle_interrupt'
	 5: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `block in synchronize'
	 4: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `handle_interrupt'
	 3: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:319:in `block in within_new_transaction'
	 2: from /var/www/discourse/script/import_scripts/base.rb:231:in `block in all_records_exist?'
	 1: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec': ERROR:  duplicate key value violates unique constraint "import_ids_pkey" (PG::UniqueViolation)
DETAIL:  Key (val)=(nid:42081) already exists.
	20: from script/import_scripts/drupal.rb:537:in `<main>'
	19: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	18: from script/import_scripts/drupal.rb:39:in `execute'
	17: from script/import_scripts/drupal.rb:169:in `import_forum_topics'
	16: from /var/www/discourse/script/import_scripts/base.rb:916:in `batches'
	15: from /var/www/discourse/script/import_scripts/base.rb:916:in `loop'
	14: from /var/www/discourse/script/import_scripts/base.rb:917:in `block in batches'
	13: from script/import_scripts/drupal.rb:195:in `block in import_forum_topics'
	12: from /var/www/discourse/script/import_scripts/base.rb:224:in `all_records_exist?'
	11: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/transactions.rb:209:in `transaction'
	10: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/database_statements.rb:316:in `transaction'
	 9: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:317:in `within_new_transaction'
	 8: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `synchronize'
	 7: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `handle_interrupt'
	 6: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `block in synchronize'
	 5: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `handle_interrupt'
	 4: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:319:in `block in within_new_transaction'
	 3: from /var/www/discourse/script/import_scripts/base.rb:243:in `block in all_records_exist?'
	 2: from /var/www/discourse/script/import_scripts/base.rb:243:in `ensure in block in all_records_exist?'
	 1: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec': ERROR:  current transaction is aborted, commands ignored until end of transaction block (PG::InFailedSqlTransaction)

我唯一能让它继续进行的方法是修改 SQL 条件：

...
	 LEFT JOIN node_counter nc ON nc.nid = n.nid
         WHERE n.type = 'forum'
           AND n.status = 1
AND n.nid != 42081
AND n.nid != 53125
AND n.nid != 57807
AND n.nid != 63932
AND n.nid != 66756
AND n.nid != 76561
AND n.nid != 78250
AND n.nid != 82707
         LIMIT #{BATCH_SIZE}
        OFFSET #{offset};
...

我检查了第一个失败的节点以及它在源 Drupal 数据库中前后相邻的 nid，但没有发现任何问题。nid 被设置为主键，并且具有 AUTO_INCREMENT，原始 Drupal 站点运行正常，因此源数据库的完整性不存在根本性问题。

除了上述错误之外，脚本还存在以下限制：

固定链接： 看起来导入器脚本会为以前的节点 URL example.com/node/XXXXXXX 创建固定链接。但我还需要维护指向这些节点内特定评论的链接，其格式为：example.com/comment/YYYYYYY#comment-YYYYYYY（YYYYYYY 在两种情况下都相同）。Drupal 的 URL 方案不包含评论所属的节点 ID，而 Discourse 则包含（example.com/t/topic-keywords/XXXXXXX/YY），这似乎是一个主要的复杂问题。
用户名限制： Drupal 允许用户名中包含空格。我理解 Discourse 不允许，至少不允许新用户以这种方式创建。此帖子表明导入器脚本会自动“转换”有问题的用户名，但我没有在 /import_scripts/drupal.rb 中看到任何相关代码。 更新： 实际上，看起来 Discourse 已自动以正确的方式处理了这个问题。
被封禁用户： 脚本似乎会导入所有用户，包括被封禁的帐户。我或许可以很容易地在 SQL 选择 WHERE status = 1 中添加一个条件，只导入活动用户帐户，但我不知道这是否会导致记录序列化出现问题。最重要的是，我希望永久阻止这些先前被封禁的帐户名称及其关联的电子邮件地址，以免相同的问题用户再次在 Discourse 上注册。
用户个人资料字段： 有没有人知道其他导入器中是否有导入用户帐户个人资料中个人信息字段的示例代码？我只有一个个人资料字段（“地点”）需要导入。
头像（非 Gravatar）： Drupal 导入器中有导入 Gravatar 的代码，但没有导入更常用的本地帐户头像图片的代码，这似乎有点奇怪。
私人消息： 几乎所有 Drupal 7 论坛可能都会使用第三方 privatemsg 模块（Drupal 没有官方的 PM 功能）。导入器不支持导入 PM。在我的情况下，我需要导入大约 150 万条。

感谢您的帮助以及提供 Drupal 导入器脚本。

pfaffman · 2022 年11 月 28 日 20:07

这组问题对于大型导入来说很常见。编写它的人可能并不在意（也许没注意到）你描述的问题。

这听起来像是 Drupal 或数据库本身的错误（不应该出现重复 ID）。我可能会修改脚本来测试或捕获重复项时的错误，但你的方法奏效了（除非还有更多）。

你可以查看其他创建帖子永久链接的导入脚本。import_id 位于每个帖子的 PostCustomField 中。

它在 base.rb 或用户名建议器中。它基本上可以正常工作，你无法对其进行太多更改。

你可能不想这样做。问题是这些用户创建的帖子将归 system 所有。你可以查看其他脚本以了解如何禁用它们。fluxbb 有一个 suspend_users 脚本，应该会有帮助。

fluxbb（我碰巧正在处理它）可以做到这一点。你只需在导入用户脚本中添加类似以下内容：

          location: user['location'],

Gravatar 由 discourse core 处理，因此脚本不执行任何导入操作；它只是可以正常工作。你可以在其他脚本中搜索“avatar”以查找如何执行此操作的示例。

查找示例。。。ipboard 有 import_private_messages。

rahim123 · 2022 年11 月 28 日 20:10

感谢您的回复。我认为这不是 Drupal 数据库的问题，因为我检查了源数据库，找不到任何重复的 nid 键。

啊哈，所以它在 drupal.rb 之外还有那个功能。现在我正在查看测试导入站点，它实际上处理用户名转换得非常好。谢谢！

rahim123 · 2022 年12 月 1 日 21:29

启用 Unicode 用户名导入的最简单方法是什么（无需转换，即保留用户名 Narizón 而不是将其转换为 Narizon）？

我对一个没有配置 Web GUI 的实例进行了第一次 Drupal 导入器测试，因此我没有设置 Discourse 选项来允许 Unicode 用户名。如果设置了该选项，导入器会尊重它吗？为我进行生产迁移时，启用此功能的推荐方法是什么？

同时，对于我当前的测试实例，是否有任何 rake 命令可以将全名应用于用户名？（我已经激活了 prioritize username in ux，但由于我的测试用户习惯了 Drupal，它只支持用户名登录（而不是电子邮件地址），我认为最好保留他们的生产用户名，至少在全名字段中保留。）

pfaffman · 2022 年12 月 2 日 08:25

可能吧？

你可以在脚本开头设置站点设置。

我认为更改用户名是个坏主意，但如果你不喜欢它们，可以更改传递给用户名生成器的内容。

rahim123 · 2022 年12 月 2 日 12:40

谢谢，你的意思是导入完成后更改用户名吗？

pfaffman · 2022 年12 月 2 日 13:39

我意思是除非在旧系统中用户名是隐藏的，只显示真实姓名，否则根本不应更改它们。

如果后者是这种情况，那么我将更改脚本，使用户名成为他们的真实姓名。问题在于，如果他们不知道自己的电子邮件地址，他们将无法找到自己的帐户。

rahim123 · 2022 年12 月 2 日 13:44

明白了。在 Drupal 论坛上只有系统用户名，没有单独的真实姓名。此外，Drupal 不允许使用电子邮件地址登录，只能使用用户名。因此，在我的情况下，尽可能保留用户名非常重要。（仍然会有些用户名被转换，例如带空格的用户名。）所以我需要研究如何在导入脚本的开头设置 Discourse 设置。

pfaffman · 2022 年12 月 2 日 15:58

但 Discourse 可以，所以如果他们知道自己的电子邮件地址，他们就可以用它来重置密码，这可能是你应该告诉大家的做法，因为你无法猜测谁猜不到他们的用户名。

我想我会设置 SiteSetting.unicode_username=true 在导入脚本中并再次运行它，看看是否有效。你可以在 rails 控制台中尝试测试一下，看看是否可行。这可能会告诉你：

  User.create(username: 'Narizón', email: 'user@x.com',password: 'xxx', active: true)

嗯，我认为这仍然不会调用用户名创建器，所以你需要调用它

  UserNameSuggester.suggest("Narizón")

不。这仍然无法获得 Unicode 用户名。你可能需要找到 UsernameSuggester 并对其进行调整。

但如果你真的想更改用户名，现在更改而不是修复脚本可能是你想要的。你需要确保你更改用户名的做法会更新所有帖子中的用户名。如果你正在使用 rake 任务，它肯定会做到这一点。

rahim123 · 2022 年12 月 2 日 19:53

太棒了，非常感谢 Jay！下次运行导入器时我会试试这个方法。

pfaffman · 2022 年12 月 2 日 20:35

我认为你不应该费心：

这在 lib/user_name_suggester.rb 中，但也许您想要 User.normalize_username

rahim123 · 2022 年12 月 20 日 14:36

果然，你说对了。这甚至算不上一个 bug，结果发现是 Drupal 处理移动主题的一种奇怪方式，它会在前一个主题类别中留下一个面包屑。它只是在许多表中创建了一个重复的行，所有这些表最终都会被拉入一个完整的 Drupal 主题。所以看起来我需要弄清楚如何将 DISTINCT 应用于其中一个被选中的表……

pfaffman · 2022 年12 月 20 日 21:35

是的。令人惊讶的是，每次导入都是独一无二的，而且你的论坛居然是第一个遇到这个问题的（当然，很多人可能已经解决了问题但未能提交 PR 来更新）。或者他们可能忽略了错误？

rahim123 · 2022 年12 月 20 日 22:15

啊哈。我怀疑它不是一个很常用的功能，当一个帖子被移动到一个新的类别时，有一个可选的复选框可以在旧类别中留下一个“已移动到…”的链接。

有问题的重复项在 forum_index 的 nid 列中。所以看起来我可以用 GROUP BY nid 来修复它，对吗？

        SELECT fi.nid nid,
               fi.title title,
               fi.tid tid,
               n.uid uid,
               fi.created created,
               fi.sticky sticky,
               f.body_value body,
	       nc.totalcount views
          FROM forum_index fi
	 LEFT JOIN node n ON fi.nid = n.nid
	 LEFT JOIN field_data_body f ON f.entity_id = n.nid
	 LEFT JOIN node_counter nc ON nc.nid = n.nid
         WHERE n.type = 'forum'
           AND n.status = 1
         GROUP BY nid

这看起来很有希望，因为当我运行带有 GROUP BY nid 的查询时，行数减少了 8 行。

pfaffman · 2022 年12 月 20 日 22:30

这也许可行。我认为表中应该有一个值表明它已被移动，并且您可以仅选择那些没有该值的。

rahim123 · 2022 年12 月 20 日 22:40

这绝对是最合乎逻辑的设计方式。我猜这是 Drupal 的一种特性……

它唯一做的就是更改 tid（类别 ID）。这遵循了我在与 Drupal 数据库打交道的过程中学到的风格。我对数据库设计一无所知，但我的印象是，您可以显式存储数据，或者省略某些内容，然后通过程序逻辑来弄清楚；Drupal 似乎完全属于后者。

rahim123 · 2022 年12 月 30 日 21:23

Well, it looks like I’m almost there. Thanks a lot to Jay for the guidance.

Thanks, this was key, it was actually as simple as copying the permalink part of the Drupal import script itself and changing it to run on posts instead of topics:

    ## I added permalinks for each Drupal comment (reply) link: /comment/DIGITS#comment-DIGITS
    Post.find_each do |post|
      begin
        pcf = post.custom_fields
        if pcf && pcf['import_id']
          cid = pcf['import_id'][/cid:(\d+)/, 1]
          slug = "/comment/#{cid}" # The #comment-DIGITS part breaks the permalink and isn't needed
          Permalink.create(url: slug, post_id: post.id)
        end
      rescue => e
        puts e.message
        puts "Permalink creation failed for cid #{post.id}"
      end
    end

I was stuck for a while with my original attempt that included the relative page #comment-DIGITS part of the original Drupal link, which completely breaks the permalink in Discourse. then I realized that of course the # part of a link doesn’t actually get passed to the webserver and was only needed for Drupal to make it scroll to the part of the page where the specific comment was located. So it works fine without that in Discourse even if coming from an external web page with an old /comment/YYYYYY#comment-YYYYY link, it simply looks like this in Discourse: /comment/YYYYYY → /t/topic-title-words/123456/X and the URL bar shows like: /t/topic-title-words/123456/X#comment-YYYYYY , it doesn’t appear to care about the bogus #comment-YYYYYY part.

For some forums I suspect that the stock Drupal importer postprocess_posts function might actually be enough. It should be noted that it needs to be adjusted for each forum, there’s a rather sloppy hard-coded regexp replace for site.com → community.site.com. But after adjusting that it does a good job of rewriting internal forum links for nodes → topics as well as comments → replies. But I do have a fair number of external websites linking to individual comments (replies) on my forum and it’s worth conserving those. Plus Google indexes most of the 1.7M /comment-YYYYYY URLs and it would probably hurt my ranking if those all disappeared. I hope it won’t cause any problems for Discourse to have ~2M permalinks though?

Thanks a lot, I lifted that function almost without modifications, just had to adjust a few column names. Works great.

  def suspend_users
    puts '', "updating banned users"

    banned = 0
    failed = 0
    total = mysql_query("SELECT COUNT(*) AS count FROM users WHERE status = 0").first['count']

    system_user = Discourse.system_user

    mysql_query("SELECT name username, mail email FROM users WHERE status = 0").each do |b|
      user = User.find_by_email(b['email'])
      if user
        user.suspended_at = Time.now
        user.suspended_till = 200.years.from_now

        if user.save
          StaffActionLogger.new(system_user).log_user_suspend(user, "banned during initial import")
          banned += 1
        else
          puts "Failed to suspend user #{user.username}. #{user.errors.try(:full_messages).try(:inspect)}"
          failed += 1
        end
      else
        puts "Not found: #{b['email']}"
        failed += 1
      end

      print_status banned + failed, total
    end
  end

Also worked! I did have to deal with Drupal’s diffuse DB schema and LEFT JOIN profile_value location ON users.uid = location.uid to correlate another table that contains the profile data, but very cool that it’s so easy to add on the Discourse side of things. It’s worth noting that this process runs about 50% slower than stock, I suspect it’s due to the LEFT JOIN. But I can live with it, as I only have about 80K users.

This was fairly hard, once again due to Drupal’s disjointed database schema. I ended up using jforum.rb as the basis with a little help from the Vanilla importer too. The original script was rather paranoid with checking at every single variable pass to make sure the avatar filename isn’t null, so I removed most of those checks to make the code less messy. The worst that can happen is that the script could crash, but with the SQL query I used I don’t think even that could go wrong.

  def import_users
    puts "", "importing users"

    user_count = mysql_query("SELECT count(uid) count FROM users").first["count"]

    last_user_id = -1
    
    batches(BATCH_SIZE) do |offset|
      users = mysql_query(<<-SQL
          SELECT users.uid,
                 name username,
                 mail email,
                 created,
                 picture,
                 location.value location
            FROM users
             LEFT JOIN profile_value location ON users.uid = location.uid
           WHERE users.uid > #{last_user_id}
        ORDER BY uid
           LIMIT #{BATCH_SIZE}
      SQL
      ).to_a

      break if users.empty?

      last_user_id = users[-1]["uid"]

      users.reject! { |u| @lookup.user_already_imported?(u["uid"]) }

      create_users(users, total: user_count, offset: offset) do |row|
        if row['picture'] > 0
        	q = mysql_query("SELECT filename FROM file_managed WHERE fid = #{row['picture']};").first
        	avatar = q["filename"]
        end
        email = row["email"].presence || fake_email
        email = fake_email if !EmailAddressValidator.valid_value?(email)

        username = @htmlentities.decode(row["username"]).strip

        {
          id: row["uid"],
          name: username,
          email: email,
          location: row["location"],
          created_at: Time.zone.at(row["created"]),
	  	post_create_action: proc do |user|
		    import_avatar(user, avatar)
		end
        }
      end 
    end
  end
  def import_avatar(user, avatar_source)
    return if avatar_source.blank?

    path = File.join(ATTACHMENT_DIR, avatar_source)

      @uploader.create_avatar(user, path)
  end

After your paid help with the SQL query I ended up trying to hack it into the script for Discuz, IPboard, and Xenforo. I kept getting hitting dead ends with each one, I got closest with the Discuz model which appears to have a very similar database schema, but I couldn’t get past a bug with the @first_post_id_by_topic_id instance variable. After tons of trial and error I finally realized that it was improperly initialized at the beginning of the Discuz script (I tried to put it in the same location in the Drupal script) and this finally fixed it:

  def initialize
    super
    
    @first_post_id_by_topic_id = {}

    @htmlentities = HTMLEntities.new

    @client = Mysql2::Client.new(
      host: "172.17.0.3",
      username: "user",
      password: "pass",
      database: DRUPAL_DB
    )
  end

…

def import_private_messages
	puts '', 'creating private messages'

	pm_indexes = 'pm_index'
	pm_messages = 'pm_message'
	total_count = mysql_query("SELECT count(*) count FROM #{pm_indexes}").first['count']

	batches(BATCH_SIZE) do |offset|
		results = mysql_query("
SELECT pi.mid id, thread_id, pi.recipient to_user_id, pi.deleted deleted, pm.author user_id, pm.subject subject, pm.body message, pm.format format, pm.timestamp created_at FROM pm_index pi LEFT JOIN pm_message pm ON pi.mid=pm.mid WHERE deleted = 0
             LIMIT #{BATCH_SIZE}
            OFFSET #{offset};")

		break if results.size < 1

		# next if all_records_exist? :posts, results.map {|m| "pm:#{m['id']}"}

		create_posts(results, total: total_count, offset: offset) do |m|
			skip = false
			mapped = {}
			mapped[:id] = "pm:#{m['id']}"
			mapped[:user_id] = user_id_from_imported_user_id(m['user_id']) || -1
			mapped[:raw] = preprocess_raw(m['message'],m['format'])
			mapped[:created_at] = Time.zone.at(m['created_at'])
			thread_id = "pm_#{m['thread_id']}"
			if is_first_pm(m['id'], m['thread_id'])
				# find the title from list table
				#          pm_thread = mysql_query("
				#                SELECT thread_id, subject
				#                  FROM #{table_name 'ucenter_pm_lists'}
				#                 WHERE plid = #{m['thread_id']};").first
				mapped[:title] = m['subject']
				mapped[:archetype] = Archetype.private_message

          # Find the users who are part of this private message.
          import_user_ids = mysql_query("
                SELECT thread_id plid, recipient user_id
                  FROM pm_index
                 WHERE thread_id = #{m['thread_id']};
              ").map { |r| r['user_id'] }.uniq
          mapped[:target_usernames] = import_user_ids.map! do |import_user_id|
            import_user_id.to_s == m['user_id'].to_s ? nil : User.find_by(id: user_id_from_imported_user_id(import_user_id)).try(:username)
          end.compact
          if mapped[:target_usernames].empty? # pm with yourself?
            skip = true
            puts "Skipping pm:#{m['id']} due to no target"
          else
            @first_post_id_by_topic_id[thread_id] = mapped[:id]
          end
        else
          parent = topic_lookup_from_imported_post_id(@first_post_id_by_topic_id[thread_id])
          if parent
            mapped[:topic_id] = parent[:topic_id]
          else
            puts "Parent post pm thread:#{thread_id} doesn't exist. Skipping #{m["id"]}: #{m["message"][0..40]}"
            skip = true
          end
        end
        skip ? nil : mapped
      end

    end
end
# search for first pm id for the series of pm
def is_first_pm(pm_id, thread_id)
	result = mysql_query("
          SELECT mid id
            FROM pm_index
           WHERE thread_id = #{thread_id}
        ORDER BY id")
	result.first['id'].to_s == pm_id.to_s
end

Oh, and for most of these queries it also requires running this in the MySQL container to disable a strict mode SQL sanity check:
mysql -u root -ppass -e "SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));"

Another thing I realized was missing were a few thousand Drupal nodes of type poll . I first tried to just include WHERE type = 'forum' OR type = 'poll' in the import_topics function, but there is some seriously janky stuff going on in the original Drupal database that causes it to miss many of them. So I ended up copying the import_topics into a new import_polls function:

    def import_poll_topics
    puts '', "importing poll topics"

    polls = mysql_query(<<-SQL
      SELECT n.nid nid, n.title title, n.uid uid, n.created created, n.sticky sticky, taxonomy_index.tid tid, node_counter.totalcount views
        FROM node n
        LEFT JOIN taxonomy_index ON n.nid = taxonomy_index.nid
        LEFT JOIN node_counter ON n.nid = node_counter.nid
       WHERE n.type = 'poll'
         AND n.status = 1
    SQL
    ).to_a

    create_posts(polls) do |topic|
      {
        id: "nid:#{topic['nid']}",
        user_id: user_id_from_imported_user_id(topic['uid']) || -1,
        category: category_id_from_imported_category_id(topic['tid']),
        raw: "### You can see the archived poll results on the Wayback Machine:\n**https://web.archive.org/web/1234567890/http://myforum.com/node/#{topic['nid']}**",
        created_at: Time.zone.at(topic['created']),
        pinned_at: topic['sticky'].to_i == 1 ? Time.zone.at(topic['created']) : nil,
        title: topic['title'].try(:strip),
        views: topic['views'],
        custom_fields: { import_id: "nid:#{topic['nid']}" }
      }
    end
  end

I don’t care too much about importing the actual poll results, and it would require re-coding the entire algorithm that Drupal uses to tally up all the votes and eliminates duplicates. I mainly just want to import the followup comments in the poll thread. But just in case anyone wants to see the original poll results I made it write out a direct link to the original forum node in the Wayback Machine.

So the code is not at all elegant and probably isn’t very efficient, but for a one-shot deal that should get the job done.

Sorry for the walls of code, let me know if that irritates anyone and I can move them to a pastebin URL.

pfaffman · 2023 年1 月 2 日 22:39

大多数情况都是这样。这个主题是其他人可以效仿的一个很好的范例（前提是他们的技能水平与你相似）。

你估计花了多少时间进行定制？

恭喜！

rahim123 · 2023 年1 月 3 日 01:11

谢谢 Jay！感谢您的鼓励。

唉，我宁愿不去想那件事。在您用 SQL 查询指引我走上正轨后，大概花了 15 到 20 个小时。

如果您有什么想法，我很想听听您的意见：

在非常强大的 VPS 上使用生产数据进行完整的试运行大约需要 70 个小时。我想尽快让我的用户重新开始互动，即使帖子和私人消息的导入仍未完成。或者，我想到的另一个替代方案是禁用 preprocess_posts 函数，我也对其进行了大量修改，添加了额外的 gsub 正则表达式替换，并将所有帖子和私人消息通过 Pandoc 处理，根据原始帖子是 Textile 标记还是纯 HTML，使用一两个不同的命令。如果我禁用整个 preprocess_posts 例程，可能会将导入时间缩短近一半，然后我可以在所有原始数据导入后将所有格式化内容添加到 postprocess_posts 部分。但缺点是，事后我将无法轻松访问显示每个帖子源格式（Textile 或 HTML）的原始数据库列，而这是我进行 Pandoc 操作的一个条件。或者，我可以在每个帖子中添加一个自定义字段，将其标记为 textile 或 html，然后在稍后的后处理过程中检索它？不知道，只是在自言自语。

pfaffman · 2023 年1 月 3 日 05:32

当您再次运行导入脚本，只导入新数据时，它会运行得更快，因为它不会重复导入数据。因此，只需要几个小时。每次后续运行都会更快，因为需要导入的数据会越来越少。

您可以通过修改查询，使其仅返回特定时间之后的新数据来加快此过程。我接触过的大多数脚本都有一个 import_after 设置，正是为此目的（也为了允许通过导入一小部分数据来加快开发速度）。

话题		回复	浏览量
Migrate a phpBB3 forum to Discourse Migrating to Discourse how-to	467	98424	2025 年9 月 4 日
phpBB 3 Importer (old) Feature	90	17203	2014 年10 月 8 日
Importers for large forums Announcements	50	9488	2023 年12 月 1 日
Migrate a vBulletin 4 forum to Discourse Sysadmins how-to	183	30448	2025 年11 月 9 日
Migrate a Vanilla forum to Discourse Sysadmins how-to	44	16218	2023 年1 月 30 日

大型 Drupal 论坛迁移，导入器错误和限制

相关话题