Large Drupal forum migration, importer errors and limitations

Oh… I never even thought about running the importer more than once on the production data. Thanks a lot for that idea.

So basically I would do a production Drupal database dump and run a full import for however long that takes while leaving the Drupal forum open to the public, then I would put the Drupal forum in offline mode and do another database dump and load that one into the importer MySQL instance and re-run the importer script?

The only disadvantage I can see there is that any Drupal post edits or post deletions or existing user profile changes during that interval (lets call it 3 days) from initial import until the second import would not get imported, only completely new users and posts during that interval, correct?

That’s right. Data that are edited will be lost. The alternative is taking the forum offline for that time. You could warn people. In practice, no one has complained.

1 Like

Cool, that sounds like a pretty reasonable compromise to minimize downtime. Thanks again for the idea.

1 Like

Just after I committed to the final production import run I was randomly checking some of the imported forum threads when I discovered a big issue. As has been the recurring theme here, it’s not really the fault of Discourse or the importer script. But it turns out that when a topic reply (“comment” in Drupal parlance) is edited it sometimes changes the created timestamp. As far as I can tell it should change the changed timestamp instead. But despite this bug in Drupal it still keeps the comment threading in order. But the way the Drupal importer script for Discourse works it apparently sorts the replies by their created timestamp (although I don’t see any ORDER clauses in the drupal.rb script). From my testing in phpMyAdmin on the Drupal database it looks like it simply needs an ORDER BY c.cid ASC to maintain the correct threading order via the original Drupal comment ID cid which is is sequential and never changes. But I’m not sure if the Discourse importer will allow sequential replies to have dates that are out of order, and/or if it will go ahead and do its own sorting by post date? I’d like to get the opinion of the original creator of the drupal.rb importer (and anybody else too obviously), would this work and would there be any unintended consequences?

I think (but don’t really know, drupal’s database makes no sense to me), that you can adjust the query to pull the created at from the original post and not the edited one.

I think that there is a table that contains the original post/time and another with the edits.

I think that ording by cid won’t hurt anything.

1 Like

It really doesn’t make any sense. The problem is like this, the two highlighted rows should be the 2nd and 3rd replies in the thread, but in Discourse they’re somewhere after the 500th position because that’s where their timestamp is.

And of course Drupal decided to set the created and changed to the same thing… And it only happens sometimes, I can’t reproduce it myself by editing old posts. But I have this same problem in multiple long-running threads where the OP of the topic posted it and then immediately posted one or two follow-up comments with “reserved” in the body for adding additional information later, which he did several years later.

That would make sense, but here’s the same problematic edited comment spread across the two tables it uses:


1 Like

OK, it looks like this works, it fixes the two jumbled threads I found and doesn’t seem to hurt anything else.

  def import_replies

    batches(BATCH_SIZE) do |offset|
      results = mysql_query(<<-SQL
        SELECT c.cid, c.pid, c.nid, c.uid, c.created,
               f.comment_body_value body,
               f.comment_body_format format
          FROM comment c,
               field_data_comment_body f,
               node n
         WHERE c.cid = f.entity_id
           AND n.nid = c.nid
           AND c.status = 1
           AND n.type IN ('poll', 'forum')
           AND n.status = 1
           AND c.created > UNIX_TIMESTAMP(STR_TO_DATE('#{IMPORT_AFTER}', '%Y-%m-%d'))
         ORDER BY c.cid ASC  #<--- Fixed
         LIMIT #{BATCH_SIZE}
        OFFSET #{offset}
      SQL
      ).to_a
1 Like

Hmm, it looks like this came back to bite me. Since the postprocess_posts function replaces old internal links with the new Discourse URL, I had made an exception in the code for the https://web.archive.org/web/20230101093741/https://MyOldForum.com/node/98765 links that my importer created for the old Drupal polls in the Wayback Machine. But apparently something went wrong, because I just noticed in the production migrated site that the links ended up like https://web.archive.org/web/20230101093741/https://MyOldForum.com/t/-/12345 .

So now that I’m no longer in the context of a migration container is the custom field with the original original Drupal node nid still available in the Discourse topics DB table? If so it would seem possible to do a string replacement in the Rails console on all topics with the first post that contains View this poll on the Wayback Machine and then replace
https://web.archive.org/web/20230101093741/https://MyOldForum.com/t/-/[01234567890]*
with
https://web.archive.org/web/20230101093741/http://MyOldForum.com/node/$original_nid

Here’s my original poll import function:

    def import_poll_topics
    puts '', "importing poll topics"

    polls = mysql_query(<<-SQL
      SELECT n.nid nid, n.title title, n.uid uid, n.created created, n.sticky sticky, taxonomy_index.tid tid, node_counter.totalcount views
        FROM node n
        LEFT JOIN taxonomy_index ON n.nid = taxonomy_index.nid
        LEFT JOIN node_counter ON n.nid = node_counter.nid
       WHERE n.type = 'poll'
         AND n.status = 1
    SQL
    ).to_a

    create_posts(polls) do |topic|
      {
        id: "nid:#{topic['nid']}",
        user_id: user_id_from_imported_user_id(topic['uid']) || -1,
        category: category_id_from_imported_category_id(topic['tid']),
        # Use TEMPmyoldforum.com or else postprocess_posts() will try to convert the Wayback Machine /node/YYY link
        raw: "### View this poll on the Wayback Machine:\n**https://web.archive.org/web/20230101093741/http://TEMPmyoldforum.com/node/#{topic['nid']}**",
        created_at: Time.zone.at(topic['created']),
        pinned_at: topic['sticky'].to_i == 1 ? Time.zone.at(topic['created']) : nil,
        title: topic['title'].try(:strip),
        views: topic['views'],
        custom_fields: { import_id: "nid:#{topic['nid']}" }
      }
    end
  end

Yes.

t=Topic.find(1234)
t.custom_fields
t.custom_fields['import_id']
1 Like