[bounty] Google+ (private ) communities: export screenscraper + importer

I have successfully imported users, topics, and posts. I am tweaking formatting translation. I have to test that categories are correctly translated. I expect to post a PR with the code soon.

6 Likes

My post formatting is generally looking clean, to the point that I don’t notice that it was originally authored outside of Discourse. This makes me happy.

I don’t blame Diaspora for not accepting the data attribute; that didn’t work, and it’s probably a good idea that it was blocked. But then I realized that since I have the ID and name for every plus-mention, I can just create shadow users for every referenced user, even if they have not authored a post or comment. So there will be no need for this workaround.

I may have one problem.

I can’t find examples of importing into pre-existing categories in the import scripts already written. This importer is written to insist that all categories already exist, as a requirement for the site that we’re importing into. My importer is failing to assign topics to categories. At least, I’m in the middle of a very large import into my development instance, and while my tags are correctly applied, I don’t see the posts in the categories that I tried to assign them to.

Using the base import, is putting topics into categories something that happens after all the topics and posts have been created? Or am I doing something wrong? The relevant code does essentially this (simplified to try to show the bare essentials):

category = Category.where(name: "Some Name")
...
mapped[:category] = category.ids
...
create_post(mapped, mapped[:id])

I’m not seeing any error messages. Should this be category.ids.to_i instead?

[Update: Solved: even though ids printed out as a singular number, it’s really a list. ids[0] and now I’m successfully importing categories.]

Update: Here is working code

https://github.com/johnsonm/discourse/tree/friendsmegplus

I’ll do more tests, and might force-push to my branch amended commit and/or rebase before I open a PR, but this code has run successfully for multiple community imports in a development instance, and I’m open to comments on the code.

I’d like to thank the Discourse team not only for making an Open Source forum system, and making it usable and responsive, but also making a system in which I as a novice to Ruby and Discourse could come up to speed and build a working importer in spare moments across two weekends and a few evenings. I’ve done enough software development to know that is evidence not only of thoughtful design but also ongoing diligence in maintaining the system. Well done folks!

10 Likes

The traditional “importer way” is to pass the import_id of the category when you create a category. Ning (and I think Google groups) has no category ID, so you can pass the name as the id when you create the category and then use category_id_from_imported_category_id(category_name) to find the category. This has the advantage of working with any other references to categories in the code that you started with.

If this is the only time to look up categories, then you can do it your way. .where() returns an array, so you need to pull out the first (and presumably only) category, so I think the way to fix what you’ve got is to replace the middle line above with

mapped[:category] = category.first.id

But that assumes that the category exists; I think that you’ve got logic that tests for that already? But if not, something like this:

category = Category.where(name: "Some Name")
if category.count < 1
      new_category = Category.new(
      name: opts[:name],
      user_id: opts[:user_id] || opts[:user].try(:id) || Discourse::SYSTEM_USER_ID,
      position: opts[:position],
      parent_category_id: opts[:parent_category_id],
      color: opts[:color] || category_color(opts[:parent_category_id]),
      text_color: opts[:text_color] || random_category_color,
      read_restricted: opts[:read_restricted] || false
    )
    # here's how you'd add the `import_id` that the lookup function uses
    new_category.custom_fields["import_id"] = "Some Name" # this'll be a variable!
    new_category.save!
end 
3 Likes

Thanks @pfaffman !

I see where().first is more idiomatic Ruby than my Python-inspired [0] later on, and lack of .first is why my error checking for nil? would never fire even when categories didn’t exist. That I can fix.

Google has UUIDs for categories, so now I know that I could use import_id if I want to import new categories. For my current purposes, I actually want the categories to have been created and organized first because I’m back-filling. I could make a conditional for whether to create category or error on unknown category, depending on whether it would actually be used. If someone asks, that looks not hard. I probably won’t bother unless someone here says they would actually use it, though. Google+ category names are in practice often long and would make the Discourse UI look a little off, so I think in this case it’s likely to be worth the effort up front. Or, if you are going to run some of these imports on a professional basis, happy to hand off to you for that!

Does providing the G+-provided UUID when creating a topic/post mean that as long as I’m providing the same UUID, create_post will see that it exists on the system and not re-create it? If so, that would make the script safe to run for incremental imports, which would be awesome!

Actual current version of my script:
https://github.com/johnsonm/discourse/blob/friendsmegplus/script/import_scripts/friendsmegplus.rb

3 Likes

Yes. And that’s nice since if you rename a category for some reason, they are still linked by the UUIDs.

Yes! That’s exactly how it works. Just stick that UUID in the id field and it’ll go into a import_id custom field. You can then use the lookup functions (topic_lookup_from_imported_post_id and friends in base/lookup_container.rb) to locate topics, users, posts, and categories by the UUIDs from the import.

2 Likes

Just to make sure I understand correctly: create_post will not create duplicates as long as I give it fixed ids, and I can use topic_lookup_from_imported_post_id etc if I need to look them up for any other reason?

Thanks so much for the help with this!

1 Like

You got it!

You’re very welcome.

Sounds like you’re pretty close!

1 Like

It’s generally working, and the .first fix made my error checking work. I also added optionally saving a list of upload URLs to a file. Those changes are now in the link above.

However, incremental import doesn’t work. Instead of just adding new posts to an existing topic, I ended up with two topics, one with fewer comments from the initial import, and another with all the comments from the later import with more data. I looked through the data to make sure that the IDs didn’t change in the source. The IDs in the source are identical, but the posts are duplicated. Looking in the database, I see the IDs in post_custom_fields so they are definitely being written.

Since I had intended to do one-shot imports in the first place, and the idea of incremental updates was an unexpected bonus from my point of view, I’m not sure it’s worth my time to debug this. I’m pretty happy with where I got, at this point. Is PR for “working but with known issue” even worth it if I’m not sure I’d have the time to even try to resolve the problem? I’ve signed the CLA so if someone else is interested in running with this and improving it they’ll be OK to base on my work.

Update: my current earlier code is was failing to upload images from Google+ comments that it should be able to upload. Sadly, I found this bug in production, and meanwhile other content had arrived so I couldn’t just restore from back; I had to delete the imported posts. (The bug was due to an inconsistency in the F+MG+E data, now fixed worked around.)

I supposed I should clean out post_custom_fields even though I didn’t see de-duplication function. Looks like rails -c and PostCustomField.where(name: 'import_id').destroy_all is my friend; there were no other imported posts in the instance.

5 Likes

I am considering my work on this essentially complete. Incremental updates are not working, but they are beyond the scope I care about. You can see the fidelity of the conversion by visiting makerforums and looking at the Herculien category and its sub-categories — the imported posts are tagged “gplus”. I’ll be importing lots more content, but I no longer expect to need to modify the importer substantially.

At this time, given the bug that is making incremental import not work, I plan not to submit a PR but am willing to do so if discourse devs request and are ok with a known bug. I have signed the CLA so it could be accepted.

@adorfer @erlend_sh @notriddle it’s probably worth each declaring whether you would pay out the bounty to someone who takes my work and fixes at least the known problem with incremental updates not working and brings it to PR. Also, in the next week or two, google tells us that we should expect meaningful google takeout for a community to be available, and it would be worth declaring whether you would pay out for being able to import the google takeout.

3 Likes
  • I’m not paying a bounty unless it gets merged into discourse/discourse. Whoever lands the PR gets it.
  • If two scripts are written, I’ll pay out the bounty twice (yes, totalling $400).
2 Likes

@notriddle, if you’ll send me your data (or a subset of it) that works with what @mcdanlj has, and @mcdanlj will get me the script (e.g., post it to https://gist.github.com/ and send me the link in public or private), I’ll see about fixing the incremental updates and submitting a PR.

NOTE: GIven that this script will be worthless in another month or two, it doesn’t make much sense to add it to core, but I suppose they can just delete it after a while.

EDIT: Sorry I didn’t notice your link above, @mcdanlj.

3 Likes

@pfaffman my current script is what I linked to above, but maybe the onebox made it not clear that it was a link to the actual script, so https://github.com/johnsonm/discourse/blob/friendsmegplus/script/import_scripts/friendsmegplus.rb hiding in the middle of the line not to be obscured by a onebox might help? I have updated since I first linked, so the onebox is likely stale.

1 Like

@pfaffman That’s a good point, yeah. Then it probably doesn’t need to be in Core. As for actually sending you exports to fix up the migations:

I was using Google Plus Communities for beta tester groups on a couple of Android apps. I just went into the Google+ web app, and they’re not in my list any more. Apparently, they auto-migrated me to using the new Google Play Beta Tester feedback system… so, yeah, I wrote that list of rules in an attempt to be fair and not reneg, but I don’t actually need Google+ migration any more :blush: I’m a little worried that the old posts will be lost, but whatever.

3 Likes

The schema docs for Friends+Me Google+ Exporter are at:

1 Like
  1. @mcdanlj i highly appreciate the results you already acomplished. this looks very effective, i am amazed about the speed to picked up the pieces and fittet everything together
  2. the criterium “google export” or “friendsplus gplus-exporter” for me is: what will give the best import results. (i have purchased a license of the friendsplus-exporter).
  3. incemental imports are not relevant for me
  4. later imports (“next year”) into fresh installations/existing discourse-instalation are not unlikely. i expext people will import existing exports-dumps after realizing that “import to wordpress” was not
    suitable.
  5. if i understand correctly, you do not want the bounty for yourself.
  6. since i see that the main task i done, i would offer you to receive my promised 200USD/€-part and you decide what you do with it. Now or later (e.g.e ither hand it to somebody who implents additional features or fixes bugs. Or donate it to the general discourse-project. Or give it to a chairity of your choice, or split between those options…)
3 Likes

@adorfer actually doing a migration is still a project, and I still do 1000 posts at a time looking for bugs. The problem with me not knowing ruby, rails, activerecord, or discourse two weeks ago is that I don’t have the experience to know what in my code might be fragile. Because I don’t know what to expect to break, I do full imports into a development copy of the real site for all data before I do any import into the active site. Even though I do backups of the real site before starting an import, it’s active online with others so a database rollback would be a bad thing.

I’m also not actually running the import on the system that holds the F+M data, so I have the script write a file containing files that I uploaded, copy them into a directory, and make a copy of the image CSV mapping file with the paths updated to point to the final location as they will be on the server where I’m running the real migration. So there’s some real site-specific scripting work needed to run a migration for real.

I measure the fidelity of the import by whether the data ported, and how close it looks to having been originally written in Discourse. In large part, I feel that I achieved the goal of looking like native content. Here are the limitations I’m aware of:

  • Losing “+1” that’s not in the Friends+Me export (which, had it been there, I would have looked for how to convert to likes)
  • There is no real album feature that corresponds to G+ albums as far as I know, so it’s just a set of pictures uploaded
  • The F+M data does not include videos uploaded directly to G+ (instead of, say, youtube)

For the bounty, I’d be happiest for it to go to @pfaffman when he takes the script and applies the benefit of his longer experience to make it more robust as well as just fixing known bugs. He has been encouraging and helpful while I’ve been hacking at this. My purpose and pleasure here has been rescuing data from Google while preserving control by the individual authors over their ported data. I’m pleased that Discourse provided that opportunity.

4 Likes

what are the odds of asking the g+friends-exporter to include the +1 likes in the dump? (even perhaps anonymously if the data is not easily retrievable)?

I asked, but I think he’s been busy keeping up with other requests and Google changing their UI in ways that break the exporter, because it’s super important to update a product right before you axe it. (I’m not bitter or anything…)

And if he gets to it, it will probably be too late for my imports anyway, because I’m hoping to be mostly done by the end of this coming week.

I will say that if he does get to it, I would expect and hope for it to be in the same format as post/comment author records, including name and google ID. Adding it will include updating the part of the script that finds all the users to check for and possibly create up front to find all those records too.

@adorfer Update 11 March: I just heard back from him that he specifically plans not to do it because it would substantially slow down the already-slow import process. So it looks like the odds are truly low!

2 Likes

Hey, @adorfer. If you want to get me your data or a subset of it, I’ll take a pass at seeing if any fix-ups are needed with the importer as it exists. I’ll have a look at the incremental import. You say that it’s not important to you, but unless you have only small data sets, the incremental feature is pretty important. If you have more than a few thousand posts, there are many reasons that you might need to stop and restart a migration.

1 Like

A now-known issue is that many of the imported images are distorted until after a rebake. We’re just going to import everything into makerforums and then rebake after it’s all done, but maybe it’s something you’ll understand and know how to fix for others without that problem. As of this writing, you can see this at Hand tapping while watching football. - Build Logs - The Maker Forums — until somebody rebakes it or the author edits the text. :slight_smile: