Large Drupal forum migration, importer errors and limitations

Excellent, thanks very much Jay! I’ll try this the next time I run the importer.

I don’t think you should bother:

That’s in lib/user_name_suggester.rb, but maybe you want User.normalize_username

1 Like

Sure enough, you were right. It’s not even a bug per se, it turned out to be a weird way that Drupal handles moved topics while leaving a breadcrumb in the former topic category. It simply creates a duplicate row in one of the many tables that all get pulled in to what eventually becomes a complete Drupal topic. So it looks like I need to figure out how to apply DISTINCT to only one of the tables that gets selected…

1 Like

Yeah. It’s amazing how every import is a snowflake, and somehow yours is the first forum to have had that issue (of course, many people might have solved the problem and not managed to submit a PR with the update). Or maybe they ignored the errors?

Aha. I suspect it’s not a very commonly used function, when a thread is moved to a new category there is an optional checkbox to leave a “Moved to…” link in the old category.

The offending duplicate is in the nid column from forum_index . So it looks like I can fix it with with a GROUP BY nid right?

        SELECT fi.nid nid,
               fi.title title,
               fi.tid tid,
               n.uid uid,
               fi.created created,
               fi.sticky sticky,
               f.body_value body,
	       nc.totalcount views
          FROM forum_index fi
	 LEFT JOIN node n ON fi.nid = n.nid
	 LEFT JOIN field_data_body f ON f.entity_id = n.nid
	 LEFT JOIN node_counter nc ON nc.nid = n.nid
         WHERE n.type = 'forum'
           AND n.status = 1
         GROUP BY nid

It looks promising, because when I run the query with the GROUP BY nid there are 8 less rows.

That might work. I’d think that there would be some value in that table that said it had been moved and you could select only those without that value.

That would be definitely be the most logical way to design it. I guess it’s a Drupal thing…

The only thing it does is change the tid (category ID). That follows the style that I’ve learned during this ordeal with the Drupal database. I don’t know anything about DB design, but I get the impression that you can either explicitly store data, or else you can leave some things implicit and then figure them out via programmatic logic; Drupal seems to fall squarely into the latter camp.

2 Likes

Well, it looks like I’m almost there. Thanks a lot to Jay for the guidance.

Thanks, this was key, it was actually as simple as copying the permalink part of the Drupal import script itself and changing it to run on posts instead of topics:

    ## I added permalinks for each Drupal comment (reply) link: /comment/DIGITS#comment-DIGITS
    Post.find_each do |post|
      begin
        pcf = post.custom_fields
        if pcf && pcf['import_id']
          cid = pcf['import_id'][/cid:(\d+)/, 1]
          slug = "/comment/#{cid}" # The #comment-DIGITS part breaks the permalink and isn't needed
          Permalink.create(url: slug, post_id: post.id)
        end
      rescue => e
        puts e.message
        puts "Permalink creation failed for cid #{post.id}"
      end
    end

I was stuck for a while with my original attempt that included the relative page #comment-DIGITS part of the original Drupal link, which completely breaks the permalink in Discourse. then I realized that of course the # part of a link doesn’t actually get passed to the webserver and was only needed for Drupal to make it scroll to the part of the page where the specific comment was located. So it works fine without that in Discourse even if coming from an external web page with an old /comment/YYYYYY#comment-YYYYY link, it simply looks like this in Discourse: /comment/YYYYYY/t/topic-title-words/123456/X and the URL bar shows like: /t/topic-title-words/123456/X#comment-YYYYYY , it doesn’t appear to care about the bogus #comment-YYYYYY part.

For some forums I suspect that the stock Drupal importer postprocess_posts function might actually be enough. It should be noted that it needs to be adjusted for each forum, there’s a rather sloppy hard-coded regexp replace for site.comcommunity.site.com. But after adjusting that it does a good job of rewriting internal forum links for nodes → topics as well as comments → replies. But I do have a fair number of external websites linking to individual comments (replies) on my forum and it’s worth conserving those. Plus Google indexes most of the 1.7M /comment-YYYYYY URLs and it would probably hurt my ranking if those all disappeared. I hope it won’t cause any problems for Discourse to have ~2M permalinks though?


Thanks a lot, I lifted that function almost without modifications, just had to adjust a few column names. Works great.

  def suspend_users
    puts '', "updating banned users"

    banned = 0
    failed = 0
    total = mysql_query("SELECT COUNT(*) AS count FROM users WHERE status = 0").first['count']

    system_user = Discourse.system_user

    mysql_query("SELECT name username, mail email FROM users WHERE status = 0").each do |b|
      user = User.find_by_email(b['email'])
      if user
        user.suspended_at = Time.now
        user.suspended_till = 200.years.from_now

        if user.save
          StaffActionLogger.new(system_user).log_user_suspend(user, "banned during initial import")
          banned += 1
        else
          puts "Failed to suspend user #{user.username}. #{user.errors.try(:full_messages).try(:inspect)}"
          failed += 1
        end
      else
        puts "Not found: #{b['email']}"
        failed += 1
      end

      print_status banned + failed, total
    end
  end

Also worked! I did have to deal with Drupal’s diffuse DB schema and LEFT JOIN profile_value location ON users.uid = location.uid to correlate another table that contains the profile data, but very cool that it’s so easy to add on the Discourse side of things. It’s worth noting that this process runs about 50% slower than stock, I suspect it’s due to the LEFT JOIN. But I can live with it, as I only have about 80K users.


This was fairly hard, once again due to Drupal’s disjointed database schema. I ended up using jforum.rb as the basis with a little help from the Vanilla importer too. The original script was rather paranoid with checking at every single variable pass to make sure the avatar filename isn’t null, so I removed most of those checks to make the code less messy. The worst that can happen is that the script could crash, but with the SQL query I used I don’t think even that could go wrong.

  def import_users
    puts "", "importing users"

    user_count = mysql_query("SELECT count(uid) count FROM users").first["count"]

    last_user_id = -1
    
    batches(BATCH_SIZE) do |offset|
      users = mysql_query(<<-SQL
          SELECT users.uid,
                 name username,
                 mail email,
                 created,
                 picture,
                 location.value location
            FROM users
             LEFT JOIN profile_value location ON users.uid = location.uid
           WHERE users.uid > #{last_user_id}
        ORDER BY uid
           LIMIT #{BATCH_SIZE}
      SQL
      ).to_a

      break if users.empty?

      last_user_id = users[-1]["uid"]

      users.reject! { |u| @lookup.user_already_imported?(u["uid"]) }

      create_users(users, total: user_count, offset: offset) do |row|
        if row['picture'] > 0
        	q = mysql_query("SELECT filename FROM file_managed WHERE fid = #{row['picture']};").first
        	avatar = q["filename"]
        end
        email = row["email"].presence || fake_email
        email = fake_email if !EmailAddressValidator.valid_value?(email)

        username = @htmlentities.decode(row["username"]).strip

        {
          id: row["uid"],
          name: username,
          email: email,
          location: row["location"],
          created_at: Time.zone.at(row["created"]),
	  	post_create_action: proc do |user|
		    import_avatar(user, avatar)
		end
        }
      end 
    end
  end
  def import_avatar(user, avatar_source)
    return if avatar_source.blank?

    path = File.join(ATTACHMENT_DIR, avatar_source)

      @uploader.create_avatar(user, path)
  end

After your paid help with the SQL query I ended up trying to hack it into the script for Discuz, IPboard, and Xenforo. I kept getting hitting dead ends with each one, I got closest with the Discuz model which appears to have a very similar database schema, but I couldn’t get past a bug with the @first_post_id_by_topic_id instance variable. After tons of trial and error I finally realized that it was improperly initialized at the beginning of the Discuz script (I tried to put it in the same location in the Drupal script) and this finally fixed it:

  def initialize
    super
    
    @first_post_id_by_topic_id = {}

    @htmlentities = HTMLEntities.new

    @client = Mysql2::Client.new(
      host: "172.17.0.3",
      username: "user",
      password: "pass",
      database: DRUPAL_DB
    )
  end

def import_private_messages
	puts '', 'creating private messages'

	pm_indexes = 'pm_index'
	pm_messages = 'pm_message'
	total_count = mysql_query("SELECT count(*) count FROM #{pm_indexes}").first['count']

	batches(BATCH_SIZE) do |offset|
		results = mysql_query("
SELECT pi.mid id, thread_id, pi.recipient to_user_id, pi.deleted deleted, pm.author user_id, pm.subject subject, pm.body message, pm.format format, pm.timestamp created_at FROM pm_index pi LEFT JOIN pm_message pm ON pi.mid=pm.mid WHERE deleted = 0
             LIMIT #{BATCH_SIZE}
            OFFSET #{offset};")

		break if results.size < 1

		# next if all_records_exist? :posts, results.map {|m| "pm:#{m['id']}"}

		create_posts(results, total: total_count, offset: offset) do |m|
			skip = false
			mapped = {}
			mapped[:id] = "pm:#{m['id']}"
			mapped[:user_id] = user_id_from_imported_user_id(m['user_id']) || -1
			mapped[:raw] = preprocess_raw(m['message'],m['format'])
			mapped[:created_at] = Time.zone.at(m['created_at'])
			thread_id = "pm_#{m['thread_id']}"
			if is_first_pm(m['id'], m['thread_id'])
				# find the title from list table
				#          pm_thread = mysql_query("
				#                SELECT thread_id, subject
				#                  FROM #{table_name 'ucenter_pm_lists'}
				#                 WHERE plid = #{m['thread_id']};").first
				mapped[:title] = m['subject']
				mapped[:archetype] = Archetype.private_message

          # Find the users who are part of this private message.
          import_user_ids = mysql_query("
                SELECT thread_id plid, recipient user_id
                  FROM pm_index
                 WHERE thread_id = #{m['thread_id']};
              ").map { |r| r['user_id'] }.uniq
          mapped[:target_usernames] = import_user_ids.map! do |import_user_id|
            import_user_id.to_s == m['user_id'].to_s ? nil : User.find_by(id: user_id_from_imported_user_id(import_user_id)).try(:username)
          end.compact
          if mapped[:target_usernames].empty? # pm with yourself?
            skip = true
            puts "Skipping pm:#{m['id']} due to no target"
          else
            @first_post_id_by_topic_id[thread_id] = mapped[:id]
          end
        else
          parent = topic_lookup_from_imported_post_id(@first_post_id_by_topic_id[thread_id])
          if parent
            mapped[:topic_id] = parent[:topic_id]
          else
            puts "Parent post pm thread:#{thread_id} doesn't exist. Skipping #{m["id"]}: #{m["message"][0..40]}"
            skip = true
          end
        end
        skip ? nil : mapped
      end

    end
end
# search for first pm id for the series of pm
def is_first_pm(pm_id, thread_id)
	result = mysql_query("
          SELECT mid id
            FROM pm_index
           WHERE thread_id = #{thread_id}
        ORDER BY id")
	result.first['id'].to_s == pm_id.to_s
end

Oh, and for most of these queries it also requires running this in the MySQL container to disable a strict mode SQL sanity check:
mysql -u root -ppass -e "SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));"


Another thing I realized was missing were a few thousand Drupal nodes of type poll . I first tried to just include WHERE type = 'forum' OR type = 'poll' in the import_topics function, but there is some seriously janky stuff going on in the original Drupal database that causes it to miss many of them. So I ended up copying the import_topics into a new import_polls function:

    def import_poll_topics
    puts '', "importing poll topics"

    polls = mysql_query(<<-SQL
      SELECT n.nid nid, n.title title, n.uid uid, n.created created, n.sticky sticky, taxonomy_index.tid tid, node_counter.totalcount views
        FROM node n
        LEFT JOIN taxonomy_index ON n.nid = taxonomy_index.nid
        LEFT JOIN node_counter ON n.nid = node_counter.nid
       WHERE n.type = 'poll'
         AND n.status = 1
    SQL
    ).to_a

    create_posts(polls) do |topic|
      {
        id: "nid:#{topic['nid']}",
        user_id: user_id_from_imported_user_id(topic['uid']) || -1,
        category: category_id_from_imported_category_id(topic['tid']),
        raw: "### You can see the archived poll results on the Wayback Machine:\n**https://web.archive.org/web/1234567890/http://myforum.com/node/#{topic['nid']}**",
        created_at: Time.zone.at(topic['created']),
        pinned_at: topic['sticky'].to_i == 1 ? Time.zone.at(topic['created']) : nil,
        title: topic['title'].try(:strip),
        views: topic['views'],
        custom_fields: { import_id: "nid:#{topic['nid']}" }
      }
    end
  end

I don’t care too much about importing the actual poll results, and it would require re-coding the entire algorithm that Drupal uses to tally up all the votes and eliminates duplicates. I mainly just want to import the followup comments in the poll thread. But just in case anyone wants to see the original poll results I made it write out a direct link to the original forum node in the Wayback Machine.


So the code is not at all elegant and probably isn’t very efficient, but for a one-shot deal that should get the job done.

Sorry for the walls of code, let me know if that irritates anyone and I can move them to a pastebin URL.

1 Like

And that’s how most of them go. This topic is a great exemplar for someone else to follow (given that they start with a similar skill set to yours).

Do you have an estimate of how much time you spent doing your customizations?

Congrats!

1 Like

Thanks Jay! Appreciate the encouragement.

Ugh, I’d prefer to not think about that. :stuck_out_tongue_winking_eye: It was probably upwards of 15 or 20 hours after you put me on the right path with the SQL query.

I’d like to pick your brain on this if you have any thoughts:

It took around 70 hours to do a complete trial run with production data on a very powerful VPS. I’d like to get my users interacting again ASAP even if the posts and PMs import is still incomplete. Or another alternative idea I thought about would be to disable the preprocess_posts function, which I also heavily modified with additional gsub regexp replacements and also to pass all the posts and PMs though Pandoc with one of two different commands depending on whether the original post was Textile markup or pure HTML. If I disable the entire preprocess_posts routine it would probably cut the import time almost in half, and then I could add all that formatting stuff into the postprocess_posts section once all the raw data is imported. But the downside is that after the fact I wouldn’t be able to easily access the original database column that shows the source format (Textile or HTML) for each post, which is a conditional for my Pandoc manipulation. Or could I add a custom field to each post labeling it as textile or html and then retrieve that later during post-processing? Dunno, just thinking out loud here.

When you run the import script again with just the new data it will run much faster since it won’t be importing the data again. So it’ll take only a few hours. And each subsequent run will be faster since there will be less be data to import.

You can then speed up that by modifying the queries to return only data newer than a certain time. Most scripts I have touched have an import_after setting for just this purpose (but also to allow faster development by importing a small sunset) subset of the data).

1 Like

Oh… I never even thought about running the importer more than once on the production data. Thanks a lot for that idea.

So basically I would do a production Drupal database dump and run a full import for however long that takes while leaving the Drupal forum open to the public, then I would put the Drupal forum in offline mode and do another database dump and load that one into the importer MySQL instance and re-run the importer script?

The only disadvantage I can see there is that any Drupal post edits or post deletions or existing user profile changes during that interval (lets call it 3 days) from initial import until the second import would not get imported, only completely new users and posts during that interval, correct?

That’s right. Data that are edited will be lost. The alternative is taking the forum offline for that time. You could warn people. In practice, no one has complained.

1 Like

Cool, that sounds like a pretty reasonable compromise to minimize downtime. Thanks again for the idea.

1 Like

Just after I committed to the final production import run I was randomly checking some of the imported forum threads when I discovered a big issue. As has been the recurring theme here, it’s not really the fault of Discourse or the importer script. But it turns out that when a topic reply (“comment” in Drupal parlance) is edited it sometimes changes the created timestamp. As far as I can tell it should change the changed timestamp instead. But despite this bug in Drupal it still keeps the comment threading in order. But the way the Drupal importer script for Discourse works it apparently sorts the replies by their created timestamp (although I don’t see any ORDER clauses in the drupal.rb script). From my testing in phpMyAdmin on the Drupal database it looks like it simply needs an ORDER BY c.cid ASC to maintain the correct threading order via the original Drupal comment ID cid which is is sequential and never changes. But I’m not sure if the Discourse importer will allow sequential replies to have dates that are out of order, and/or if it will go ahead and do its own sorting by post date? I’d like to get the opinion of the original creator of the drupal.rb importer (and anybody else too obviously), would this work and would there be any unintended consequences?

I think (but don’t really know, drupal’s database makes no sense to me), that you can adjust the query to pull the created at from the original post and not the edited one.

I think that there is a table that contains the original post/time and another with the edits.

I think that ording by cid won’t hurt anything.

1 Like

It really doesn’t make any sense. The problem is like this, the two highlighted rows should be the 2nd and 3rd replies in the thread, but in Discourse they’re somewhere after the 500th position because that’s where their timestamp is.

And of course Drupal decided to set the created and changed to the same thing… And it only happens sometimes, I can’t reproduce it myself by editing old posts. But I have this same problem in multiple long-running threads where the OP of the topic posted it and then immediately posted one or two follow-up comments with “reserved” in the body for adding additional information later, which he did several years later.

That would make sense, but here’s the same problematic edited comment spread across the two tables it uses:


1 Like

OK, it looks like this works, it fixes the two jumbled threads I found and doesn’t seem to hurt anything else.

  def import_replies

    batches(BATCH_SIZE) do |offset|
      results = mysql_query(<<-SQL
        SELECT c.cid, c.pid, c.nid, c.uid, c.created,
               f.comment_body_value body,
               f.comment_body_format format
          FROM comment c,
               field_data_comment_body f,
               node n
         WHERE c.cid = f.entity_id
           AND n.nid = c.nid
           AND c.status = 1
           AND n.type IN ('poll', 'forum')
           AND n.status = 1
           AND c.created > UNIX_TIMESTAMP(STR_TO_DATE('#{IMPORT_AFTER}', '%Y-%m-%d'))
         ORDER BY c.cid ASC  #<--- Fixed
         LIMIT #{BATCH_SIZE}
        OFFSET #{offset}
      SQL
      ).to_a
1 Like

Hmm, it looks like this came back to bite me. Since the postprocess_posts function replaces old internal links with the new Discourse URL, I had made an exception in the code for the https://web.archive.org/web/20230101093741/https://MyOldForum.com/node/98765 links that my importer created for the old Drupal polls in the Wayback Machine. But apparently something went wrong, because I just noticed in the production migrated site that the links ended up like https://web.archive.org/web/20230101093741/https://MyOldForum.com/t/-/12345 .

So now that I’m no longer in the context of a migration container is the custom field with the original original Drupal node nid still available in the Discourse topics DB table? If so it would seem possible to do a string replacement in the Rails console on all topics with the first post that contains View this poll on the Wayback Machine and then replace
https://web.archive.org/web/20230101093741/https://MyOldForum.com/t/-/[01234567890]*
with
https://web.archive.org/web/20230101093741/http://MyOldForum.com/node/$original_nid

Here’s my original poll import function:

    def import_poll_topics
    puts '', "importing poll topics"

    polls = mysql_query(<<-SQL
      SELECT n.nid nid, n.title title, n.uid uid, n.created created, n.sticky sticky, taxonomy_index.tid tid, node_counter.totalcount views
        FROM node n
        LEFT JOIN taxonomy_index ON n.nid = taxonomy_index.nid
        LEFT JOIN node_counter ON n.nid = node_counter.nid
       WHERE n.type = 'poll'
         AND n.status = 1
    SQL
    ).to_a

    create_posts(polls) do |topic|
      {
        id: "nid:#{topic['nid']}",
        user_id: user_id_from_imported_user_id(topic['uid']) || -1,
        category: category_id_from_imported_category_id(topic['tid']),
        # Use TEMPmyoldforum.com or else postprocess_posts() will try to convert the Wayback Machine /node/YYY link
        raw: "### View this poll on the Wayback Machine:\n**https://web.archive.org/web/20230101093741/http://TEMPmyoldforum.com/node/#{topic['nid']}**",
        created_at: Time.zone.at(topic['created']),
        pinned_at: topic['sticky'].to_i == 1 ? Time.zone.at(topic['created']) : nil,
        title: topic['title'].try(:strip),
        views: topic['views'],
        custom_fields: { import_id: "nid:#{topic['nid']}" }
      }
    end
  end

Yes.

t=Topic.find(1234)
t.custom_fields
t.custom_fields['import_id']
1 Like