[bounty] Google+ (private ) communities: export screenscraper + importer

Also, there are at least two other possible source of data for G+ info to be imported into Discourse.

  1. Google has promised (¯\(ツ)/¯) to provide a community takeout option scant days before starting to delete the entire site. No schema has been documented or discussed for this data.
  2. There is an open source migration tool that could probably be “borrowed” — the K40 community tried this tool and it did not succeed, but they have over 5000 posts. It might work for smaller communities, either as a long term move directly, or as a way to preserve a community while working on an import from their appscript clone into Discourse for a more long-term solution.
  3. [Update: added] https://github.com/FiXato/Plexodus-Tools — also open source — has code both for working with takeout and talking to the G+ API (which has about two weeks of intermittent life left)
2 Likes

I should be clearer: If someone else wants to pursue the bounty, such as it is, I’m happy for you to do it and will contribute what I’ve learned so far towards the project, without asking to share the bounty. I just want this to exist.

I see that GoogleUserInfo has google_user_id on which I suppose I can join for G+ users who have already logged in via google auth. That’s the same ID that’s in the export.

I’m wondering if I could set just the user_id and google_user_id fields in GoogleUserInfo for the fake users I create because they haven’t logged in yet via google auth, and then if they log in later via google auth, the google_oauth2_authenticator will merge the users automatically and fix their user name and email from the oauth2 response?

1 Like

If you can share enough data that I can write an importer, I can see what it would take.

I would be seeking the bounty. This is my day job. I would also offer to run imports for people with budgets, as each one is surprisingly different. Once I’ve run an import to get the code to work with one site, I’d share the code. I’ve got at least one person interested in an import, which could make it worth my while.

I’ve written several importers.

I will share the code.

There is no issue with long term maintenance! This game will be over in a few weeks.

If you can get me the dump you have I can see about helping match those Google ids. I think that what you suggest might be possible, but I’d need to see the data to tell for sure.

4 Likes

This morning, I read ActiveRecord docs, so GoogleUserInfo.find_by looks like my friend. I’m a Go/Python/C developer so I just have to look everything up as I go along… :slight_smile:

I have a local test environment running in which to do test imports, so that’s not so hard.

Here’s the information I have outside Discourse so far:

  1. The author of the Friends+Me exporter has a commented exemplar for schema doc at Google+ Export JSON structure - Google Docs
  2. Anthony Bolgar / K40 · GitLab is an import I did into a Jekyll site before I realized that this was really an option, with the the exported feed JSON, a mapping of URLs to image files in the repository, the python script I hacked together to build the jekyll site, and 4807 images checked into the repository.

I’m having fun hacking at this for a bit. If you get to the point where you are ready to actually start work on it, drop me a PM and I’ll provide what I have so far, though I don’t promise to stop playing with it myself at that point. I don’t want to use this topic as an ugly form of source code management. :wink:

I expect that outside of markup, most of the code will be re-usable for importing google community takeout archives when google actually releases them, so this might be a head start on being able to do more imports for people who have just waited until the last minute.

3 Likes

Here’s my untested attempt to import users:

    def import_author_user(author)
      id = author["id"]
      if not @users[id].present?
        google_user_info = ::GoogleUserInfo.find_by(google_user_id: author["id"]
        if google_user_info.nil?
          name = author["name"]
          email = "gplus-#{name.gsub(/\s+/, "")}-#{id}@example.com"
          {
            id: id,
            email: email,
            name: name,
            post_create_action: proc do |newuser|
              newuser.approved = true
              newuser.approved_by_id = @system_user.id
              newuser.approved_at = newuser.created_at
              newuser.save
              ::GoogleUserInfo.create({ 
                user_id: newuser.id,
                google_user_id: id,
              })
            end
          }
        else
          email = google_user_info.email
        end
        @users[id] = email
      end
    end
  end

I explicitly intend by @example.com to prevent hostile takeover, and for the google_user_id to allow later automatic user merge when they log in.

Again, I have no idea if that will work.

3 Likes

I suggest you use a .invalid domain in order to prevent outgoing emails. Something like this:

email = "#{id}@gplus.invalid"
6 Likes

That’s a much better idea, I had forgotten about .invalid and it’s exactly right for the purpose. Thank you!

(example.com shouldn’t result in outgoing emails either by specification, but its purpose is primarily documentation.)

5 Likes

I think I’ve worked out how to do message formatting. I expect that the Post.cook_methods[:regular] will be able to handle embedded bits of html like <b> <i> and <s> which I found I really needed to use with my jekyll import because nesting doesn’t work the same between markdown and what G+ produced.

The only things I know left for message formatting are:

  1. Turning plus-references into at-references that will resolve internally; there can be references outside the community to users not imported, so I’ll have to only conditionally turn them into at-references if the user exists on the system already or in the import. Update: I think I have this figured out. There might be a better way than "<a class="mention" href="/u/#{user.name}">@#{user.name}</a>" though?
  2. Coming up with a reasonable title for a topic from a post. I have an algorithm but I’ll have to validate that it generates OK titles.
3 Likes

Using @username should be enough. Discourse will create the proper HTML when the post gets cooked. I guess you figured out a way to find existing users by their plus-reference?

BTW: Are you writing a proper import script by using our base importer? Depending on how you import the users, you should be able to use find_user_by_import_id(google_plus_user_id) to find existing users.

3 Likes

Oh, great! That’s easier. Glad to know I was trying too hard.

I do want to be able to “fix up” references later to users who don’t exist, so if I don’t find them either in the import or on the system, I was thinking of something like "<b data-google-plus-id="123456789">+GooglePlusName</b>"

I’m using the base importer, using ning.rb as a pattern since it seemed to be the closest analog. I’m using find_user_by_import_id but not all users will be in the import; I’ll have multiple imports across different G+ communities and many people will have already signed in so I also need to look in GoogleUserInfo to resolve the mapping. My plan is actually to import all the users across all the imports, and then to import the posts across all the imports, so that cross-import user references are all resolved correctly. A lot of the users interacted in a lot of related communities on G+ and I want to create a familiar experience for them.

Now, whether it’s a “proper” import script others can judge! :wink: Thanks for putting up with half-formed thoughts from this new-to-Discourse-and-new-to-Ruby guy, and for help pointing me in the right direction. I’ll say that it’s clear to me that some real attention has been paid to making imports work in Discourse, which is motivational for trying to add an importer!

Thanks!

4 Likes

@gerhard — A related question… I now notice that the ning.rb importer has special handling for youtube links. Should I expect that youtube (and vimeo, etc?) links will be recognized and transformed into iframes with a viewer when the post is cooked, just like at-references to users? (The python code I wrote for static site imports has substantial regexp handling for youtube that I could pretty much lift intact, but if Discourse does it better I’d rather just trust Discourse to do the right thing.)

2 Likes

The Ning importer removes iframes so that Discourse can detect the links and create oneboxes. You don’t have to do anything special in your import script as long as the link is on a line itself.

https://meta.discourse.org/t/what-is-a-onebox/78060

7 Likes

Progress report: I have written something vaguely resembling ruby an importer that looks like it covers all the current requirements in general design. (It doesn’t translate G+ +1’s into likes, because the exporter does not represent the +1s.) I haven’t run it at all actually imported data with it yet. :roll_eyes:

My script takes as arguments paths to files containing Friends+Me Google+ Exporter JSON export files, Friends+Me Google+ Exporter CSV image map files, and a single JSON file that maps Google+ Category IDs to Discourse categories, subcategories, and tags. It does not create new categories; if a category is missing it complains and bails out after writing a new file in which to fill the missing information to complete the import. It expects all the categories to have been created already.

The idea is to do all the imports into a single discourse in one pass, so that all possible “plus-mentions” of other G+ users across multiple G+ communities turn into “at-mentions” in Discourse, even for people not active in the community in which they are mentioned, as long as they wrote some post or comment somewhere in the whole set of data being imported. This is because so far it looks like I’ll be importing about 10 communities, with about 300MB of input JSON and about 40GB of images.

It is intended to work on a Discourse instance that already has users referenced in the import by google ID, and that already has content and categories created. I hope that it will also make it possible for people to log in with google OAuth2 after the import and automatically own their content because their google auth ID is tied to the fake account holding the data, so that their ability to own their own content is preserved.

I expect that a 431-line file that has never seen an interpreter will be loads of fun to debug, especially when written and being tested by someone who has never written any ruby before. I don’t pretend that writing this script is the largest part of the work. I’ll share it now or any later time with anyone seeking the bounty, as long as you’ll share your fixes with me regardless of bounty progress; just PM me. I’ll share it myself under GPLv3 at such time as I get it working. In the meantime, I’m considering this work my contribution toward someone else claiming the bounty, to make it more likely to be worth the time for whoever takes it to completion, because of the comment above that the bounty is smaller than typical.

11 Likes

I have successfully imported users, topics, and posts. I am tweaking formatting translation. I have to test that categories are correctly translated. I expect to post a PR with the code soon.

6 Likes

My post formatting is generally looking clean, to the point that I don’t notice that it was originally authored outside of Discourse. This makes me happy.

I don’t blame Diaspora for not accepting the data attribute; that didn’t work, and it’s probably a good idea that it was blocked. But then I realized that since I have the ID and name for every plus-mention, I can just create shadow users for every referenced user, even if they have not authored a post or comment. So there will be no need for this workaround.

I may have one problem.

I can’t find examples of importing into pre-existing categories in the import scripts already written. This importer is written to insist that all categories already exist, as a requirement for the site that we’re importing into. My importer is failing to assign topics to categories. At least, I’m in the middle of a very large import into my development instance, and while my tags are correctly applied, I don’t see the posts in the categories that I tried to assign them to.

Using the base import, is putting topics into categories something that happens after all the topics and posts have been created? Or am I doing something wrong? The relevant code does essentially this (simplified to try to show the bare essentials):

category = Category.where(name: "Some Name")
...
mapped[:category] = category.ids
...
create_post(mapped, mapped[:id])

I’m not seeing any error messages. Should this be category.ids.to_i instead?

[Update: Solved: even though ids printed out as a singular number, it’s really a list. ids[0] and now I’m successfully importing categories.]

Update: Here is working code

https://github.com/johnsonm/discourse/tree/friendsmegplus

I’ll do more tests, and might force-push to my branch amended commit and/or rebase before I open a PR, but this code has run successfully for multiple community imports in a development instance, and I’m open to comments on the code.

I’d like to thank the Discourse team not only for making an Open Source forum system, and making it usable and responsive, but also making a system in which I as a novice to Ruby and Discourse could come up to speed and build a working importer in spare moments across two weekends and a few evenings. I’ve done enough software development to know that is evidence not only of thoughtful design but also ongoing diligence in maintaining the system. Well done folks!

10 Likes

The traditional “importer way” is to pass the import_id of the category when you create a category. Ning (and I think Google groups) has no category ID, so you can pass the name as the id when you create the category and then use category_id_from_imported_category_id(category_name) to find the category. This has the advantage of working with any other references to categories in the code that you started with.

If this is the only time to look up categories, then you can do it your way. .where() returns an array, so you need to pull out the first (and presumably only) category, so I think the way to fix what you’ve got is to replace the middle line above with

mapped[:category] = category.first.id

But that assumes that the category exists; I think that you’ve got logic that tests for that already? But if not, something like this:

category = Category.where(name: "Some Name")
if category.count < 1
      new_category = Category.new(
      name: opts[:name],
      user_id: opts[:user_id] || opts[:user].try(:id) || Discourse::SYSTEM_USER_ID,
      position: opts[:position],
      parent_category_id: opts[:parent_category_id],
      color: opts[:color] || category_color(opts[:parent_category_id]),
      text_color: opts[:text_color] || random_category_color,
      read_restricted: opts[:read_restricted] || false
    )
    # here's how you'd add the `import_id` that the lookup function uses
    new_category.custom_fields["import_id"] = "Some Name" # this'll be a variable!
    new_category.save!
end 
3 Likes

Thanks @pfaffman !

I see where().first is more idiomatic Ruby than my Python-inspired [0] later on, and lack of .first is why my error checking for nil? would never fire even when categories didn’t exist. That I can fix.

Google has UUIDs for categories, so now I know that I could use import_id if I want to import new categories. For my current purposes, I actually want the categories to have been created and organized first because I’m back-filling. I could make a conditional for whether to create category or error on unknown category, depending on whether it would actually be used. If someone asks, that looks not hard. I probably won’t bother unless someone here says they would actually use it, though. Google+ category names are in practice often long and would make the Discourse UI look a little off, so I think in this case it’s likely to be worth the effort up front. Or, if you are going to run some of these imports on a professional basis, happy to hand off to you for that!

Does providing the G+-provided UUID when creating a topic/post mean that as long as I’m providing the same UUID, create_post will see that it exists on the system and not re-create it? If so, that would make the script safe to run for incremental imports, which would be awesome!

Actual current version of my script:
https://github.com/johnsonm/discourse/blob/friendsmegplus/script/import_scripts/friendsmegplus.rb

3 Likes

Yes. And that’s nice since if you rename a category for some reason, they are still linked by the UUIDs.

Yes! That’s exactly how it works. Just stick that UUID in the id field and it’ll go into a import_id custom field. You can then use the lookup functions (topic_lookup_from_imported_post_id and friends in base/lookup_container.rb) to locate topics, users, posts, and categories by the UUIDs from the import.

2 Likes

Just to make sure I understand correctly: create_post will not create duplicates as long as I give it fixed ids, and I can use topic_lookup_from_imported_post_id etc if I need to look them up for any other reason?

Thanks so much for the help with this!

1 Like

You got it!

You’re very welcome.

Sounds like you’re pretty close!

1 Like