Oh… I never even thought about running the importer more than once on the production data. Thanks a lot for that idea.
So basically I would do a production Drupal database dump and run a full import for however long that takes while leaving the Drupal forum open to the public, then I would put the Drupal forum in offline mode and do another database dump and load that one into the importer MySQL instance and re-run the importer script?
The only disadvantage I can see there is that any Drupal post edits or post deletions or existing user profile changes during that interval (lets call it 3 days) from initial import until the second import would not get imported, only completely new users and posts during that interval, correct?
That’s right. Data that are edited will be lost. The alternative is taking the forum offline for that time. You could warn people. In practice, no one has complained.
Just after I committed to the final production import run I was randomly checking some of the imported forum threads when I discovered a big issue. As has been the recurring theme here, it’s not really the fault of Discourse or the importer script. But it turns out that when a topic reply (“comment” in Drupal parlance) is edited it sometimes changes the created timestamp. As far as I can tell it should change the changed timestamp instead. But despite this bug in Drupal it still keeps the comment threading in order. But the way the Drupal importer script for Discourse works it apparently sorts the replies by their created timestamp (although I don’t see any ORDER clauses in the drupal.rb script). From my testing in phpMyAdmin on the Drupal database it looks like it simply needs an ORDER BY c.cid ASC to maintain the correct threading order via the original Drupal comment ID cid which is is sequential and never changes. But I’m not sure if the Discourse importer will allow sequential replies to have dates that are out of order, and/or if it will go ahead and do its own sorting by post date? I’d like to get the opinion of the original creator of the drupal.rb importer (and anybody else too obviously), would this work and would there be any unintended consequences?
I think (but don’t really know, drupal’s database makes no sense to me), that you can adjust the query to pull the created at from the original post and not the edited one.
I think that there is a table that contains the original post/time and another with the edits.
It really doesn’t make any sense. The problem is like this, the two highlighted rows should be the 2nd and 3rd replies in the thread, but in Discourse they’re somewhere after the 500th position because that’s where their timestamp is.
And of course Drupal decided to set the created and changed to the same thing… And it only happens sometimes, I can’t reproduce it myself by editing old posts. But I have this same problem in multiple long-running threads where the OP of the topic posted it and then immediately posted one or two follow-up comments with “reserved” in the body for adding additional information later, which he did several years later.
That would make sense, but here’s the same problematic edited comment spread across the two tables it uses:
OK, it looks like this works, it fixes the two jumbled threads I found and doesn’t seem to hurt anything else.
def import_replies
…
batches(BATCH_SIZE) do |offset|
results = mysql_query(<<-SQL
SELECT c.cid, c.pid, c.nid, c.uid, c.created,
f.comment_body_value body,
f.comment_body_format format
FROM comment c,
field_data_comment_body f,
node n
WHERE c.cid = f.entity_id
AND n.nid = c.nid
AND c.status = 1
AND n.type IN ('poll', 'forum')
AND n.status = 1
AND c.created > UNIX_TIMESTAMP(STR_TO_DATE('#{IMPORT_AFTER}', '%Y-%m-%d'))
ORDER BY c.cid ASC #<--- Fixed
LIMIT #{BATCH_SIZE}
OFFSET #{offset}
SQL
).to_a
Hmm, it looks like this came back to bite me. Since the postprocess_posts function replaces old internal links with the new Discourse URL, I had made an exception in the code for the https://web.archive.org/web/20230101093741/https://MyOldForum.com/node/98765 links that my importer created for the old Drupal polls in the Wayback Machine. But apparently something went wrong, because I just noticed in the production migrated site that the links ended up like https://web.archive.org/web/20230101093741/https://MyOldForum.com/t/-/12345 .
So now that I’m no longer in the context of a migration container is the custom field with the original original Drupal node nid still available in the Discourse topics DB table? If so it would seem possible to do a string replacement in the Rails console on all topics with the first post that contains View this poll on the Wayback Machine and then replace https://web.archive.org/web/20230101093741/https://MyOldForum.com/t/-/[01234567890]*
with https://web.archive.org/web/20230101093741/http://MyOldForum.com/node/$original_nid
Here’s my original poll import function:
def import_poll_topics
puts '', "importing poll topics"
polls = mysql_query(<<-SQL
SELECT n.nid nid, n.title title, n.uid uid, n.created created, n.sticky sticky, taxonomy_index.tid tid, node_counter.totalcount views
FROM node n
LEFT JOIN taxonomy_index ON n.nid = taxonomy_index.nid
LEFT JOIN node_counter ON n.nid = node_counter.nid
WHERE n.type = 'poll'
AND n.status = 1
SQL
).to_a
create_posts(polls) do |topic|
{
id: "nid:#{topic['nid']}",
user_id: user_id_from_imported_user_id(topic['uid']) || -1,
category: category_id_from_imported_category_id(topic['tid']),
# Use TEMPmyoldforum.com or else postprocess_posts() will try to convert the Wayback Machine /node/YYY link
raw: "### View this poll on the Wayback Machine:\n**https://web.archive.org/web/20230101093741/http://TEMPmyoldforum.com/node/#{topic['nid']}**",
created_at: Time.zone.at(topic['created']),
pinned_at: topic['sticky'].to_i == 1 ? Time.zone.at(topic['created']) : nil,
title: topic['title'].try(:strip),
views: topic['views'],
custom_fields: { import_id: "nid:#{topic['nid']}" }
}
end
end