Large Drupal forum migration, importer errors and limitations

Hi there, this topic gives some background on the migration that I’m slowly planning and testing. I finally tried the Drupal importer last Friday on a testbed VPS using a combination of this and this. The importer is still running as I type this, so I haven’t been able to actually test the functionality of the test site yet, but it’s about to finish soon.

The biggest issue I’m facing is a “duplicate key value” on 8 apparently random nodes (the equivalent of topics in Discourse) out of ~80,000 total nodes. These are the specific nid numbers just in case there’s some really weird Y2K-esque sort of math bug at stake:

42081, 53125, 57807, 63932, 66756, 76561, 78250, 82707

This same error always happens on those same nids when re-running the importer:

Traceback (most recent call last):
	19: from script/import_scripts/drupal.rb:537:in `<main>'
	18: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	17: from script/import_scripts/drupal.rb:39:in `execute'
	16: from script/import_scripts/drupal.rb:169:in `import_forum_topics'
	15: from /var/www/discourse/script/import_scripts/base.rb:916:in `batches'
	14: from /var/www/discourse/script/import_scripts/base.rb:916:in `loop'
	13: from /var/www/discourse/script/import_scripts/base.rb:917:in `block in batches'
	12: from script/import_scripts/drupal.rb:195:in `block in import_forum_topics'
	11: from /var/www/discourse/script/import_scripts/base.rb:224:in `all_records_exist?'
	10: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/transactions.rb:209:in `transaction'
	 9: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/database_statements.rb:316:in `transaction'
	 8: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:317:in `within_new_transaction'
	 7: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `synchronize'
	 6: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `handle_interrupt'
	 5: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `block in synchronize'
	 4: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `handle_interrupt'
	 3: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:319:in `block in within_new_transaction'
	 2: from /var/www/discourse/script/import_scripts/base.rb:231:in `block in all_records_exist?'
	 1: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec': ERROR:  duplicate key value violates unique constraint "import_ids_pkey" (PG::UniqueViolation)
DETAIL:  Key (val)=(nid:42081) already exists.
	20: from script/import_scripts/drupal.rb:537:in `<main>'
	19: from /var/www/discourse/script/import_scripts/base.rb:47:in `perform'
	18: from script/import_scripts/drupal.rb:39:in `execute'
	17: from script/import_scripts/drupal.rb:169:in `import_forum_topics'
	16: from /var/www/discourse/script/import_scripts/base.rb:916:in `batches'
	15: from /var/www/discourse/script/import_scripts/base.rb:916:in `loop'
	14: from /var/www/discourse/script/import_scripts/base.rb:917:in `block in batches'
	13: from script/import_scripts/drupal.rb:195:in `block in import_forum_topics'
	12: from /var/www/discourse/script/import_scripts/base.rb:224:in `all_records_exist?'
	11: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/transactions.rb:209:in `transaction'
	10: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/database_statements.rb:316:in `transaction'
	 9: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:317:in `within_new_transaction'
	 8: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `synchronize'
	 7: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:21:in `handle_interrupt'
	 6: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `block in synchronize'
	 5: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activesupport-7.0.3.1/lib/active_support/concurrency/load_interlock_aware_monitor.rb:25:in `handle_interrupt'
	 4: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/activerecord-7.0.3.1/lib/active_record/connection_adapters/abstract/transaction.rb:319:in `block in within_new_transaction'
	 3: from /var/www/discourse/script/import_scripts/base.rb:243:in `block in all_records_exist?'
	 2: from /var/www/discourse/script/import_scripts/base.rb:243:in `ensure in block in all_records_exist?'
	 1: from /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-3.0.0/lib/patches/db/pg.rb:56:in `exec': ERROR:  current transaction is aborted, commands ignored until end of transaction block (PG::InFailedSqlTransaction)

The only way I could make it proceed was by hacking the SQL conditions:

...
	 LEFT JOIN node_counter nc ON nc.nid = n.nid
         WHERE n.type = 'forum'
           AND n.status = 1
AND n.nid != 42081
AND n.nid != 53125
AND n.nid != 57807
AND n.nid != 63932
AND n.nid != 66756
AND n.nid != 76561
AND n.nid != 78250
AND n.nid != 82707
         LIMIT #{BATCH_SIZE}
        OFFSET #{offset};
...

I inspected the first failed node as well as the previous and next nids on either side of it in the source Drupal database, and I can’t see anything wrong. The nid is set as the primary key and it has AUTO_INCREMENT, and the original Drupal site works fine, so there can’t be any fundamental issue with the source database integrity.


Apart from the above bug, these are the limitations I’m having with the script:

  1. Permalinks: It looks like the importer script will create permalinks for the former node URLs example.com/node/XXXXXXX. But I also need to maintain links to specific comments within those nodes, which have the format of: example.com/comment/YYYYYYY#comment-YYYYYYY (YYYYYYY is the same in both occurences). The Drupal URL scheme does not include the node ID that the comment is associated with, whereas Discourse does (example.com/t/topic-keywords/XXXXXXX/YY), so that looks like a major complication.

  2. Username limitations: Drupal allows spaces in usernames. I understand that Discourse does not, at least it doesn’t allow new users to create them that way. This post suggests that the importer script will automatically “convert” the problematic usernames, but I can’t see any code for that in /import_scripts/drupal.rb. Update: Actually it looks like Discourse handled this automatically in the correct manner.

  3. Banned users: It appears that the script imports all users, including banned accounts. I could probably add a condition fairly easily to the SQL selection WHERE status = 1 to only import active user accounts, but I’m not sure if that would cause issues with the serialization of the records. Above all I would prefer to keep those previously banned account names with their associated email addresses permanently blocked off so the same problem users don’t sign up again on Discourse.

  4. User profile fields: Does anybody know if there is example code in another one of the importers for importing personal information fields from user account profiles? I have just one profile field (“Location”) that I need to import.

  5. Avatars (not Gravatars): It seems kind of strange that there is code in the Drupal importer to import Gravatars but not for the far more commonly used local account avatar pictures.

  6. Private messages: Almost all Drupal 7 forums will probably be using the third-party privatemsg module (there is no official Drupal PM functionality). The importer doesn’t support importing PMs. In my case I need to import about 1.5M of them.

Thanks in advance for your help and for making the Drupal importer script available.

This set of issues is pretty much par for the course for a large import. Whoever it was written for didn’t care (perhaps not enough to notice) about the issues you describe.

Which sounds like a bug in Drupal or the database itself (duplicate ids shouldn’t happen). I would likely have modified the script to test for and/or catch the error when there are duplicates, but your way worked (unless there are more still).

You can look at other import scripts that create post permalinks. The import_id is in the PostCustomField of each post.

It’s either in base.rb or the username suggester. It mostly just works and there isn’t’ much you can do to change it.

You probably don’t want to do that. The issue is that the posts created by those users will be owned by system. You can look at other scripts for examples of how to deactivate them. fluxbb has a suspend_users script, which should help.

fluxbb (which I happen to be working on now) does that. YOu just add something like this to the import user script:

          location: user['location'],

Gravatars are handled by discourse core, so the script does nothing to import them; it just works. You can grep the other scripts for “avatar” to find examples of how to do that.

Look for examples. . . . ipboard has import_private_messages.

1 Like

Thanks for the reply. I don’t think this is a problem with the Drupal database, because I inspected the source database and I can’t find any duplicate nid keys.

Ahhh, so it has that functionality outside of drupal.rb . Now that I am poking around the test import site it actually looks like it handled the username conversions very well. Thanks!

1 Like

What would be the easiest way to enable the importation of unicode usernames (without converting them, i.e. keep the username Narizón instead of converting it to Narizon) ?

I did my first test of the Drupal importer on an instance with no web GUI configured, so I hadn’t set the Discourse option to allow unicode usernames. If that were set then would the importer have respected it? What’s the recommended way to enable this for when I run my production migration?

And meanwhile for my current testbed instance, is there any rake command to apply the fullname to the username? (I already activated prioritize username in ux but since my test users are accustomed to Drupal which only supports usernames for login [not email address], I think it would be best to maintain their production usernames, which at least got maintained in the fullname field.)

Probably?

You can set the site setting at the beginning of the script.

I think changing usernames is a bad idea, but if you don’t like them you could change what gets passed to the username generator.

Thanks, you mean changing them after the import is completed?

I think I mean changing them at all unless on the old system usernames were invisible and the saw only real names

If the latter is the case, then I would change the script to make the username their real name. The problem Wyeth that is that if they don’t know their email address they won’t be able to find their account.

Gotcha. On the Drupal forum there are only system usernames and no separate real names. And additionally Drupal doesn’t allow logging in with the email address, just the username. So that’s why it’s very important in my case to maintain the usernames as much as possible. (There will still be some converted usernames, such as the ones with spaces.) So I need to look into how to set Discourse settings at the beginning of the import script.

But Discourse does, so if they know their email address, they can use that to reset the password, which is probably what you should tell everyone to do, sin ce you can’t guess who can’t guess their username, I guess.

I think what I’d do is set SiteSetting.unicode_username=true in the import script and run it again to see if it works. You might contrive to test it in the rails console to see. This might tell you:

  User.create(username: 'Narizón', email: 'user@x.com',password: 'xxx', active: true)

Well, I think that might not call the username creator thingy, so you’ll need to call it

  UserNameSuggester.suggest("Narizón")

No. That still doesn’t get you a unicode username. You’ll need to find the UsernameSuggester and tweak it, I guess.

But if you really want to change the usernames, changing it now rather than fixing the script may be what you want to do. You need to make sure that the way you do it updates the username in all of the posts. If you’re using a rake task, it’ll definitely do that.

1 Like

Excellent, thanks very much Jay! I’ll try this the next time I run the importer.

I don’t think you should bother:

That’s in lib/user_name_suggester.rb, but maybe you want User.normalize_username

1 Like

Sure enough, you were right. It’s not even a bug per se, it turned out to be a weird way that Drupal handles moved topics while leaving a breadcrumb in the former topic category. It simply creates a duplicate row in one of the many tables that all get pulled in to what eventually becomes a complete Drupal topic. So it looks like I need to figure out how to apply DISTINCT to only one of the tables that gets selected…

1 Like

Yeah. It’s amazing how every import is a snowflake, and somehow yours is the first forum to have had that issue (of course, many people might have solved the problem and not managed to submit a PR with the update). Or maybe they ignored the errors?

Aha. I suspect it’s not a very commonly used function, when a thread is moved to a new category there is an optional checkbox to leave a “Moved to…” link in the old category.

The offending duplicate is in the nid column from forum_index . So it looks like I can fix it with with a GROUP BY nid right?

        SELECT fi.nid nid,
               fi.title title,
               fi.tid tid,
               n.uid uid,
               fi.created created,
               fi.sticky sticky,
               f.body_value body,
	       nc.totalcount views
          FROM forum_index fi
	 LEFT JOIN node n ON fi.nid = n.nid
	 LEFT JOIN field_data_body f ON f.entity_id = n.nid
	 LEFT JOIN node_counter nc ON nc.nid = n.nid
         WHERE n.type = 'forum'
           AND n.status = 1
         GROUP BY nid

It looks promising, because when I run the query with the GROUP BY nid there are 8 less rows.

That might work. I’d think that there would be some value in that table that said it had been moved and you could select only those without that value.

That would be definitely be the most logical way to design it. I guess it’s a Drupal thing…

The only thing it does is change the tid (category ID). That follows the style that I’ve learned during this ordeal with the Drupal database. I don’t know anything about DB design, but I get the impression that you can either explicitly store data, or else you can leave some things implicit and then figure them out via programmatic logic; Drupal seems to fall squarely into the latter camp.

2 Likes

Well, it looks like I’m almost there. Thanks a lot to Jay for the guidance.

Thanks, this was key, it was actually as simple as copying the permalink part of the Drupal import script itself and changing it to run on posts instead of topics:

    ## I added permalinks for each Drupal comment (reply) link: /comment/DIGITS#comment-DIGITS
    Post.find_each do |post|
      begin
        pcf = post.custom_fields
        if pcf && pcf['import_id']
          cid = pcf['import_id'][/cid:(\d+)/, 1]
          slug = "/comment/#{cid}" # The #comment-DIGITS part breaks the permalink and isn't needed
          Permalink.create(url: slug, post_id: post.id)
        end
      rescue => e
        puts e.message
        puts "Permalink creation failed for cid #{post.id}"
      end
    end

I was stuck for a while with my original attempt that included the relative page #comment-DIGITS part of the original Drupal link, which completely breaks the permalink in Discourse. then I realized that of course the # part of a link doesn’t actually get passed to the webserver and was only needed for Drupal to make it scroll to the part of the page where the specific comment was located. So it works fine without that in Discourse even if coming from an external web page with an old /comment/YYYYYY#comment-YYYYY link, it simply looks like this in Discourse: /comment/YYYYYY/t/topic-title-words/123456/X and the URL bar shows like: /t/topic-title-words/123456/X#comment-YYYYYY , it doesn’t appear to care about the bogus #comment-YYYYYY part.

For some forums I suspect that the stock Drupal importer postprocess_posts function might actually be enough. It should be noted that it needs to be adjusted for each forum, there’s a rather sloppy hard-coded regexp replace for site.comcommunity.site.com. But after adjusting that it does a good job of rewriting internal forum links for nodes → topics as well as comments → replies. But I do have a fair number of external websites linking to individual comments (replies) on my forum and it’s worth conserving those. Plus Google indexes most of the 1.7M /comment-YYYYYY URLs and it would probably hurt my ranking if those all disappeared. I hope it won’t cause any problems for Discourse to have ~2M permalinks though?


Thanks a lot, I lifted that function almost without modifications, just had to adjust a few column names. Works great.

  def suspend_users
    puts '', "updating banned users"

    banned = 0
    failed = 0
    total = mysql_query("SELECT COUNT(*) AS count FROM users WHERE status = 0").first['count']

    system_user = Discourse.system_user

    mysql_query("SELECT name username, mail email FROM users WHERE status = 0").each do |b|
      user = User.find_by_email(b['email'])
      if user
        user.suspended_at = Time.now
        user.suspended_till = 200.years.from_now

        if user.save
          StaffActionLogger.new(system_user).log_user_suspend(user, "banned during initial import")
          banned += 1
        else
          puts "Failed to suspend user #{user.username}. #{user.errors.try(:full_messages).try(:inspect)}"
          failed += 1
        end
      else
        puts "Not found: #{b['email']}"
        failed += 1
      end

      print_status banned + failed, total
    end
  end

Also worked! I did have to deal with Drupal’s diffuse DB schema and LEFT JOIN profile_value location ON users.uid = location.uid to correlate another table that contains the profile data, but very cool that it’s so easy to add on the Discourse side of things. It’s worth noting that this process runs about 50% slower than stock, I suspect it’s due to the LEFT JOIN. But I can live with it, as I only have about 80K users.


This was fairly hard, once again due to Drupal’s disjointed database schema. I ended up using jforum.rb as the basis with a little help from the Vanilla importer too. The original script was rather paranoid with checking at every single variable pass to make sure the avatar filename isn’t null, so I removed most of those checks to make the code less messy. The worst that can happen is that the script could crash, but with the SQL query I used I don’t think even that could go wrong.

  def import_users
    puts "", "importing users"

    user_count = mysql_query("SELECT count(uid) count FROM users").first["count"]

    last_user_id = -1
    
    batches(BATCH_SIZE) do |offset|
      users = mysql_query(<<-SQL
          SELECT users.uid,
                 name username,
                 mail email,
                 created,
                 picture,
                 location.value location
            FROM users
             LEFT JOIN profile_value location ON users.uid = location.uid
           WHERE users.uid > #{last_user_id}
        ORDER BY uid
           LIMIT #{BATCH_SIZE}
      SQL
      ).to_a

      break if users.empty?

      last_user_id = users[-1]["uid"]

      users.reject! { |u| @lookup.user_already_imported?(u["uid"]) }

      create_users(users, total: user_count, offset: offset) do |row|
        if row['picture'] > 0
        	q = mysql_query("SELECT filename FROM file_managed WHERE fid = #{row['picture']};").first
        	avatar = q["filename"]
        end
        email = row["email"].presence || fake_email
        email = fake_email if !EmailAddressValidator.valid_value?(email)

        username = @htmlentities.decode(row["username"]).strip

        {
          id: row["uid"],
          name: username,
          email: email,
          location: row["location"],
          created_at: Time.zone.at(row["created"]),
	  	post_create_action: proc do |user|
		    import_avatar(user, avatar)
		end
        }
      end 
    end
  end
  def import_avatar(user, avatar_source)
    return if avatar_source.blank?

    path = File.join(ATTACHMENT_DIR, avatar_source)

      @uploader.create_avatar(user, path)
  end

After your paid help with the SQL query I ended up trying to hack it into the script for Discuz, IPboard, and Xenforo. I kept getting hitting dead ends with each one, I got closest with the Discuz model which appears to have a very similar database schema, but I couldn’t get past a bug with the @first_post_id_by_topic_id instance variable. After tons of trial and error I finally realized that it was improperly initialized at the beginning of the Discuz script (I tried to put it in the same location in the Drupal script) and this finally fixed it:

  def initialize
    super
    
    @first_post_id_by_topic_id = {}

    @htmlentities = HTMLEntities.new

    @client = Mysql2::Client.new(
      host: "172.17.0.3",
      username: "user",
      password: "pass",
      database: DRUPAL_DB
    )
  end

def import_private_messages
	puts '', 'creating private messages'

	pm_indexes = 'pm_index'
	pm_messages = 'pm_message'
	total_count = mysql_query("SELECT count(*) count FROM #{pm_indexes}").first['count']

	batches(BATCH_SIZE) do |offset|
		results = mysql_query("
SELECT pi.mid id, thread_id, pi.recipient to_user_id, pi.deleted deleted, pm.author user_id, pm.subject subject, pm.body message, pm.format format, pm.timestamp created_at FROM pm_index pi LEFT JOIN pm_message pm ON pi.mid=pm.mid WHERE deleted = 0
             LIMIT #{BATCH_SIZE}
            OFFSET #{offset};")

		break if results.size < 1

		# next if all_records_exist? :posts, results.map {|m| "pm:#{m['id']}"}

		create_posts(results, total: total_count, offset: offset) do |m|
			skip = false
			mapped = {}
			mapped[:id] = "pm:#{m['id']}"
			mapped[:user_id] = user_id_from_imported_user_id(m['user_id']) || -1
			mapped[:raw] = preprocess_raw(m['message'],m['format'])
			mapped[:created_at] = Time.zone.at(m['created_at'])
			thread_id = "pm_#{m['thread_id']}"
			if is_first_pm(m['id'], m['thread_id'])
				# find the title from list table
				#          pm_thread = mysql_query("
				#                SELECT thread_id, subject
				#                  FROM #{table_name 'ucenter_pm_lists'}
				#                 WHERE plid = #{m['thread_id']};").first
				mapped[:title] = m['subject']
				mapped[:archetype] = Archetype.private_message

          # Find the users who are part of this private message.
          import_user_ids = mysql_query("
                SELECT thread_id plid, recipient user_id
                  FROM pm_index
                 WHERE thread_id = #{m['thread_id']};
              ").map { |r| r['user_id'] }.uniq
          mapped[:target_usernames] = import_user_ids.map! do |import_user_id|
            import_user_id.to_s == m['user_id'].to_s ? nil : User.find_by(id: user_id_from_imported_user_id(import_user_id)).try(:username)
          end.compact
          if mapped[:target_usernames].empty? # pm with yourself?
            skip = true
            puts "Skipping pm:#{m['id']} due to no target"
          else
            @first_post_id_by_topic_id[thread_id] = mapped[:id]
          end
        else
          parent = topic_lookup_from_imported_post_id(@first_post_id_by_topic_id[thread_id])
          if parent
            mapped[:topic_id] = parent[:topic_id]
          else
            puts "Parent post pm thread:#{thread_id} doesn't exist. Skipping #{m["id"]}: #{m["message"][0..40]}"
            skip = true
          end
        end
        skip ? nil : mapped
      end

    end
end
# search for first pm id for the series of pm
def is_first_pm(pm_id, thread_id)
	result = mysql_query("
          SELECT mid id
            FROM pm_index
           WHERE thread_id = #{thread_id}
        ORDER BY id")
	result.first['id'].to_s == pm_id.to_s
end

Oh, and for most of these queries it also requires running this in the MySQL container to disable a strict mode SQL sanity check:
mysql -u root -ppass -e "SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));"


Another thing I realized was missing were a few thousand Drupal nodes of type poll . I first tried to just include WHERE type = 'forum' OR type = 'poll' in the import_topics function, but there is some seriously janky stuff going on in the original Drupal database that causes it to miss many of them. So I ended up copying the import_topics into a new import_polls function:

    def import_poll_topics
    puts '', "importing poll topics"

    polls = mysql_query(<<-SQL
      SELECT n.nid nid, n.title title, n.uid uid, n.created created, n.sticky sticky, taxonomy_index.tid tid, node_counter.totalcount views
        FROM node n
        LEFT JOIN taxonomy_index ON n.nid = taxonomy_index.nid
        LEFT JOIN node_counter ON n.nid = node_counter.nid
       WHERE n.type = 'poll'
         AND n.status = 1
    SQL
    ).to_a

    create_posts(polls) do |topic|
      {
        id: "nid:#{topic['nid']}",
        user_id: user_id_from_imported_user_id(topic['uid']) || -1,
        category: category_id_from_imported_category_id(topic['tid']),
        raw: "### You can see the archived poll results on the Wayback Machine:\n**https://web.archive.org/web/1234567890/http://myforum.com/node/#{topic['nid']}**",
        created_at: Time.zone.at(topic['created']),
        pinned_at: topic['sticky'].to_i == 1 ? Time.zone.at(topic['created']) : nil,
        title: topic['title'].try(:strip),
        views: topic['views'],
        custom_fields: { import_id: "nid:#{topic['nid']}" }
      }
    end
  end

I don’t care too much about importing the actual poll results, and it would require re-coding the entire algorithm that Drupal uses to tally up all the votes and eliminates duplicates. I mainly just want to import the followup comments in the poll thread. But just in case anyone wants to see the original poll results I made it write out a direct link to the original forum node in the Wayback Machine.


So the code is not at all elegant and probably isn’t very efficient, but for a one-shot deal that should get the job done.

Sorry for the walls of code, let me know if that irritates anyone and I can move them to a pastebin URL.

1 Like

And that’s how most of them go. This topic is a great exemplar for someone else to follow (given that they start with a similar skill set to yours).

Do you have an estimate of how much time you spent doing your customizations?

Congrats!

1 Like

Thanks Jay! Appreciate the encouragement.

Ugh, I’d prefer to not think about that. :stuck_out_tongue_winking_eye: It was probably upwards of 15 or 20 hours after you put me on the right path with the SQL query.

I’d like to pick your brain on this if you have any thoughts:

It took around 70 hours to do a complete trial run with production data on a very powerful VPS. I’d like to get my users interacting again ASAP even if the posts and PMs import is still incomplete. Or another alternative idea I thought about would be to disable the preprocess_posts function, which I also heavily modified with additional gsub regexp replacements and also to pass all the posts and PMs though Pandoc with one of two different commands depending on whether the original post was Textile markup or pure HTML. If I disable the entire preprocess_posts routine it would probably cut the import time almost in half, and then I could add all that formatting stuff into the postprocess_posts section once all the raw data is imported. But the downside is that after the fact I wouldn’t be able to easily access the original database column that shows the source format (Textile or HTML) for each post, which is a conditional for my Pandoc manipulation. Or could I add a custom field to each post labeling it as textile or html and then retrieve that later during post-processing? Dunno, just thinking out loud here.

When you run the import script again with just the new data it will run much faster since it won’t be importing the data again. So it’ll take only a few hours. And each subsequent run will be faster since there will be less be data to import.

You can then speed up that by modifying the queries to return only data newer than a certain time. Most scripts I have touched have an import_after setting for just this purpose (but also to allow faster development by importing a small sunset) subset of the data).

1 Like