[bounty] Google+ (private ) communities: export screenscraper + importer

(mcdanlj) #16

I think I’ve worked out how to do message formatting. I expect that the Post.cook_methods[:regular] will be able to handle embedded bits of html like <b> <i> and <s> which I found I really needed to use with my jekyll import because nesting doesn’t work the same between markdown and what G+ produced.

The only things I know left for message formatting are:

  1. Turning plus-references into at-references that will resolve internally; there can be references outside the community to users not imported, so I’ll have to only conditionally turn them into at-references if the user exists on the system already or in the import. Update: I think I have this figured out. There might be a better way than "<a class="mention" href="/u/#{user.name}">@#{user.name}</a>" though?
  2. Coming up with a reasonable title for a topic from a post. I have an algorithm but I’ll have to validate that it generates OK titles.

(Gerhard Schlager) #17

Using @username should be enough. Discourse will create the proper HTML when the post gets cooked. I guess you figured out a way to find existing users by their plus-reference?

BTW: Are you writing a proper import script by using our base importer? Depending on how you import the users, you should be able to use find_user_by_import_id(google_plus_user_id) to find existing users.

(mcdanlj) #18

Oh, great! That’s easier. Glad to know I was trying too hard.

I do want to be able to “fix up” references later to users who don’t exist, so if I don’t find them either in the import or on the system, I was thinking of something like "<b data-google-plus-id="123456789">+GooglePlusName</b>"

I’m using the base importer, using ning.rb as a pattern since it seemed to be the closest analog. I’m using find_user_by_import_id but not all users will be in the import; I’ll have multiple imports across different G+ communities and many people will have already signed in so I also need to look in GoogleUserInfo to resolve the mapping. My plan is actually to import all the users across all the imports, and then to import the posts across all the imports, so that cross-import user references are all resolved correctly. A lot of the users interacted in a lot of related communities on G+ and I want to create a familiar experience for them.

Now, whether it’s a “proper” import script others can judge! :wink: Thanks for putting up with half-formed thoughts from this new-to-Discourse-and-new-to-Ruby guy, and for help pointing me in the right direction. I’ll say that it’s clear to me that some real attention has been paid to making imports work in Discourse, which is motivational for trying to add an importer!


(mcdanlj) #19

@gerhard — A related question… I now notice that the ning.rb importer has special handling for youtube links. Should I expect that youtube (and vimeo, etc?) links will be recognized and transformed into iframes with a viewer when the post is cooked, just like at-references to users? (The python code I wrote for static site imports has substantial regexp handling for youtube that I could pretty much lift intact, but if Discourse does it better I’d rather just trust Discourse to do the right thing.)

(Gerhard Schlager) #20

The Ning importer removes iframes so that Discourse can detect the links and create oneboxes. You don’t have to do anything special in your import script as long as the link is on a line itself.

(mcdanlj) #21

Progress report: I have written something vaguely resembling ruby an importer that looks like it covers all the current requirements in general design. (It doesn’t translate G+ +1’s into likes, because the exporter does not represent the +1s.) I haven’t run it at all actually imported data with it yet. :roll_eyes:

My script takes as arguments paths to files containing Friends+Me Google+ Exporter JSON export files, Friends+Me Google+ Exporter CSV image map files, and a single JSON file that maps Google+ Category IDs to Discourse categories, subcategories, and tags. It does not create new categories; if a category is missing it complains and bails out after writing a new file in which to fill the missing information to complete the import. It expects all the categories to have been created already.

The idea is to do all the imports into a single discourse in one pass, so that all possible “plus-mentions” of other G+ users across multiple G+ communities turn into “at-mentions” in Discourse, even for people not active in the community in which they are mentioned, as long as they wrote some post or comment somewhere in the whole set of data being imported. This is because so far it looks like I’ll be importing about 10 communities, with about 300MB of input JSON and about 40GB of images.

It is intended to work on a Discourse instance that already has users referenced in the import by google ID, and that already has content and categories created. I hope that it will also make it possible for people to log in with google OAuth2 after the import and automatically own their content because their google auth ID is tied to the fake account holding the data, so that their ability to own their own content is preserved.

I expect that a 431-line file that has never seen an interpreter will be loads of fun to debug, especially when written and being tested by someone who has never written any ruby before. I don’t pretend that writing this script is the largest part of the work. I’ll share it now or any later time with anyone seeking the bounty, as long as you’ll share your fixes with me regardless of bounty progress; just PM me. I’ll share it myself under GPLv3 at such time as I get it working. In the meantime, I’m considering this work my contribution toward someone else claiming the bounty, to make it more likely to be worth the time for whoever takes it to completion, because of the comment above that the bounty is smaller than typical.

(mcdanlj) #22

I have successfully imported users, topics, and posts. I am tweaking formatting translation. I have to test that categories are correctly translated. I expect to post a PR with the code soon.

(mcdanlj) #23

My post formatting is generally looking clean, to the point that I don’t notice that it was originally authored outside of Discourse. This makes me happy.

I don’t blame Diaspora for not accepting the data attribute; that didn’t work, and it’s probably a good idea that it was blocked. But then I realized that since I have the ID and name for every plus-mention, I can just create shadow users for every referenced user, even if they have not authored a post or comment. So there will be no need for this workaround.

I may have one problem.

I can’t find examples of importing into pre-existing categories in the import scripts already written. This importer is written to insist that all categories already exist, as a requirement for the site that we’re importing into. My importer is failing to assign topics to categories. At least, I’m in the middle of a very large import into my development instance, and while my tags are correctly applied, I don’t see the posts in the categories that I tried to assign them to.

Using the base import, is putting topics into categories something that happens after all the topics and posts have been created? Or am I doing something wrong? The relevant code does essentially this (simplified to try to show the bare essentials):

category = Category.where(name: "Some Name")
mapped[:category] = category.ids
create_post(mapped, mapped[:id])

I’m not seeing any error messages. Should this be category.ids.to_i instead?

[Update: Solved: even though ids printed out as a singular number, it’s really a list. ids[0] and now I’m successfully importing categories.]

Update: Here is working code

I’ll do more tests, and might force-push to my branch amended commit and/or rebase before I open a PR, but this code has run successfully for multiple community imports in a development instance, and I’m open to comments on the code.

I’d like to thank the Discourse team not only for making an Open Source forum system, and making it usable and responsive, but also making a system in which I as a novice to Ruby and Discourse could come up to speed and build a working importer in spare moments across two weekends and a few evenings. I’ve done enough software development to know that is evidence not only of thoughtful design but also ongoing diligence in maintaining the system. Well done folks!

(Jay Pfaffman) #24

The traditional “importer way” is to pass the import_id of the category when you create a category. Ning (and I think Google groups) has no category ID, so you can pass the name as the id when you create the category and then use category_id_from_imported_category_id(category_name) to find the category. This has the advantage of working with any other references to categories in the code that you started with.

If this is the only time to look up categories, then you can do it your way. .where() returns an array, so you need to pull out the first (and presumably only) category, so I think the way to fix what you’ve got is to replace the middle line above with

mapped[:category] = category.first.id

But that assumes that the category exists; I think that you’ve got logic that tests for that already? But if not, something like this:

category = Category.where(name: "Some Name")
if category.count < 1
      new_category = Category.new(
      name: opts[:name],
      user_id: opts[:user_id] || opts[:user].try(:id) || Discourse::SYSTEM_USER_ID,
      position: opts[:position],
      parent_category_id: opts[:parent_category_id],
      color: opts[:color] || category_color(opts[:parent_category_id]),
      text_color: opts[:text_color] || random_category_color,
      read_restricted: opts[:read_restricted] || false
    # here's how you'd add the `import_id` that the lookup function uses
    new_category.custom_fields["import_id"] = "Some Name" # this'll be a variable!

(mcdanlj) #25

Thanks @pfaffman !

I see where().first is more idiomatic Ruby than my Python-inspired [0] later on, and lack of .first is why my error checking for nil? would never fire even when categories didn’t exist. That I can fix.

Google has UUIDs for categories, so now I know that I could use import_id if I want to import new categories. For my current purposes, I actually want the categories to have been created and organized first because I’m back-filling. I could make a conditional for whether to create category or error on unknown category, depending on whether it would actually be used. If someone asks, that looks not hard. I probably won’t bother unless someone here says they would actually use it, though. Google+ category names are in practice often long and would make the Discourse UI look a little off, so I think in this case it’s likely to be worth the effort up front. Or, if you are going to run some of these imports on a professional basis, happy to hand off to you for that!

Does providing the G+-provided UUID when creating a topic/post mean that as long as I’m providing the same UUID, create_post will see that it exists on the system and not re-create it? If so, that would make the script safe to run for incremental imports, which would be awesome!

Actual current version of my script:

Turn off the +- fancypants formatting
(Jay Pfaffman) #26

Yes. And that’s nice since if you rename a category for some reason, they are still linked by the UUIDs.

Yes! That’s exactly how it works. Just stick that UUID in the id field and it’ll go into a import_id custom field. You can then use the lookup functions (topic_lookup_from_imported_post_id and friends in base/lookup_container.rb) to locate topics, users, posts, and categories by the UUIDs from the import.

(mcdanlj) #27

Just to make sure I understand correctly: create_post will not create duplicates as long as I give it fixed ids, and I can use topic_lookup_from_imported_post_id etc if I need to look them up for any other reason?

Thanks so much for the help with this!

(Jay Pfaffman) #28

You got it!

You’re very welcome.

Sounds like you’re pretty close!

(mcdanlj) #29

It’s generally working, and the .first fix made my error checking work. I also added optionally saving a list of upload URLs to a file. Those changes are now in the link above.

However, incremental import doesn’t work. Instead of just adding new posts to an existing topic, I ended up with two topics, one with fewer comments from the initial import, and another with all the comments from the later import with more data. I looked through the data to make sure that the IDs didn’t change in the source. The IDs in the source are identical, but the posts are duplicated. Looking in the database, I see the IDs in post_custom_fields so they are definitely being written.

Since I had intended to do one-shot imports in the first place, and the idea of incremental updates was an unexpected bonus from my point of view, I’m not sure it’s worth my time to debug this. I’m pretty happy with where I got, at this point. Is PR for “working but with known issue” even worth it if I’m not sure I’d have the time to even try to resolve the problem? I’ve signed the CLA so if someone else is interested in running with this and improving it they’ll be OK to base on my work.

Update: my current earlier code is was failing to upload images from Google+ comments that it should be able to upload. Sadly, I found this bug in production, and meanwhile other content had arrived so I couldn’t just restore from back; I had to delete the imported posts. (The bug was due to an inconsistency in the F+MG+E data, now fixed worked around.)

I supposed I should clean out post_custom_fields even though I didn’t see de-duplication function. Looks like rails -c and PostCustomField.where(name: 'import_id').destroy_all is my friend; there were no other imported posts in the instance.

(mcdanlj) #30

I am considering my work on this essentially complete. Incremental updates are not working, but they are beyond the scope I care about. You can see the fidelity of the conversion by visiting makerforums and looking at the Herculien category and its sub-categories — the imported posts are tagged “gplus”. I’ll be importing lots more content, but I no longer expect to need to modify the importer substantially.

At this time, given the bug that is making incremental import not work, I plan not to submit a PR but am willing to do so if discourse devs request and are ok with a known bug. I have signed the CLA so it could be accepted.

@adorfer @erlend_sh @notriddle it’s probably worth each declaring whether you would pay out the bounty to someone who takes my work and fixes at least the known problem with incremental updates not working and brings it to PR. Also, in the next week or two, google tells us that we should expect meaningful google takeout for a community to be available, and it would be worth declaring whether you would pay out for being able to import the google takeout.

(Michael Howell) #31
  • I’m not paying a bounty unless it gets merged into discourse/discourse. Whoever lands the PR gets it.
  • If two scripts are written, I’ll pay out the bounty twice (yes, totalling $400).

(Jay Pfaffman) #32

@notriddle, if you’ll send me your data (or a subset of it) that works with what @mcdanlj has, and @mcdanlj will get me the script (e.g., post it to https://gist.github.com/ and send me the link in public or private), I’ll see about fixing the incremental updates and submitting a PR.

NOTE: GIven that this script will be worthless in another month or two, it doesn’t make much sense to add it to core, but I suppose they can just delete it after a while.

EDIT: Sorry I didn’t notice your link above, @mcdanlj.

(mcdanlj) #33

@pfaffman my current script is what I linked to above, but maybe the onebox made it not clear that it was a link to the actual script, so discourse/friendsmegplus.rb at friendsmegplus · johnsonm/discourse · GitHub hiding in the middle of the line not to be obscured by a onebox might help? I have updated since I first linked, so the onebox is likely stale.

(Michael Howell) #34

@pfaffman That’s a good point, yeah. Then it probably doesn’t need to be in Core. As for actually sending you exports to fix up the migations:

I was using Google Plus Communities for beta tester groups on a couple of Android apps. I just went into the Google+ web app, and they’re not in my list any more. Apparently, they auto-migrated me to using the new Google Play Beta Tester feedback system… so, yeah, I wrote that list of rules in an attempt to be fair and not reneg, but I don’t actually need Google+ migration any more :blush: I’m a little worried that the old posts will be lost, but whatever.

(mcdanlj) #35

The schema docs for Friends+Me Google+ Exporter are at: