[bounty] Google+ (private ) communities: export screenscraper + importer


(Andreas Dorfer) #1

Continuing the discussion from Importing from Google Groups:

I am looking for a way to salvage a existing GooglePlus Communities.

  • groups ares “closed” (non public) ones: Not (fully) exportable the tools i know
    in other words: export has to run in the user context.
    (most drastic approach would be “browser-screenscraping via Silenium”)
  • Import into discourse should include images
  • include comments on the posts and comments on individual images
  • preserving as many detail as possible

Bounty:

  • I would put via paypal 200€ or 200USD (as you like ) on the table.
  • this will probably not cover the cost of a full development
  • idea is that more people join
  • code will be made public available. (hopefully that does not turn people away from joining)

Export-screenscraper has to meet the existing timeline:
https://support.google.com/plus/answer/9195133
(in other words: be operational latest within 1 months / end of february 2019)


(Erlend Sogge Heggen) #3

We’d be happy to put up another €200 for this, payable only via PayPal.


(Michael Howell) #4

I’ll add another $200, again paid via PayPal.


(Stephane Buisson) #5

Hi Guys,

Some tools are available see below, but From G+ to Owner/admin we should wait early March.
"Google+ Communities
To download data for Communities where you’re an owner or moderator, select Google+ Communities. You will get:

Names and links to Google+ profiles of community owners, moderators, members, applicants, banned members, and invitees
Links to posts shared with the community
Community metadata, including community picture, community settings, content control settings, your role, and community categories
Important: Starting early March 2019, you will also be able to download additional details from public communities, including author, body, and photos for every community post."

Now here the tool I am thinking about : https://gplus-exporter.friendsplus.me/

Stéphane is owner of https://plus.google.com/communities/118113483589382049502 and also interested in Discourse.


(Jay Pfaffman) #6

But it looks like you don’t get email addresses, so it’ll be difficult to allow users to be connected with their posts.

I’ll have a look at the tool linked above. $600 is still short of what I generally charge for a new importer (a recent project was over $3000) , and it looks like it’ll have a short window of usefulness.

I can have a look next week at the resources linked and got much time they’ll save.


(Andreas Dorfer) #7

This is bit of a hazzle. An admin needs to join/merge the accounts “to be claimed” later manually.
But at least for Groups with just a few dozend really active members it’s feasable.

the google exporter may be good starting point and even reduce a lot of preassure “to get it completed fast”.
(But i have not looked into the format of the G+exporter yet, i am not shure if all neccesary/relevant details is covered. Now would be the time perhaps to get in contact with the author in order to ask for supplemental data to fetch, especially since there seem to be nearly daily releases.)


(mcdanlj) #8

Just FYI, I’ve been hacking at this for the K40 community and maybe some other communities, starting from a friendsplus.me JSON export, and didn’t see this post until a moment ago. I have used the friendsplus.me exporter to create static Jekyll archives of several communities, so I’m already up to speed with their JSON format.

I’ve never touched ruby before, so I’m learning Discourse and ruby at the same time, but it doesn’t look too bad. I took the route of creating suspended @example.com (so that they cannot be validated by an attacker) fake users for admins to merge later, though I’ll probably add the ability to provide a JSON map to known existing users at the time of the import.

It looks like the ning importer has dealt with many of the same general needs, so I’ve been reading it as a pattern for this work.

I’m not seeking the bounty and I do intend to share my work. I intend to be involved with running it a limited number of times and intend not to be a long-term maintainer for the script. I intend to offer a PR when it works, but given the limited lifetime of the script it might be reasonable not to merge the PR, especially if it’s not up to quality expectations, and instead leave it as documentation.


(mcdanlj) #9

Also, there are at least two other possible source of data for G+ info to be imported into Discourse.

  1. Google has promised (¯\(ツ)/¯) to provide a community takeout option scant days before starting to delete the entire site. No schema has been documented or discussed for this data.
  2. There is an open source migration tool that could probably be “borrowed” — the K40 community tried this tool and it did not succeed, but they have over 5000 posts. It might work for smaller communities, either as a long term move directly, or as a way to preserve a community while working on an import from their appscript clone into Discourse for a more long-term solution.
  3. [Update: added] GitHub - FiXato/Plexodus-Tools: A collection of tools to process the Google Plus-related data from Google Takeout. — also open source — has code both for working with takeout and talking to the G+ API (which has about two weeks of intermittent life left)

(mcdanlj) #10

I should be clearer: If someone else wants to pursue the bounty, such as it is, I’m happy for you to do it and will contribute what I’ve learned so far towards the project, without asking to share the bounty. I just want this to exist.

I see that GoogleUserInfo has google_user_id on which I suppose I can join for G+ users who have already logged in via google auth. That’s the same ID that’s in the export.

I’m wondering if I could set just the user_id and google_user_id fields in GoogleUserInfo for the fake users I create because they haven’t logged in yet via google auth, and then if they log in later via google auth, the google_oauth2_authenticator will merge the users automatically and fix their user name and email from the oauth2 response?


(Jay Pfaffman) #11

If you can share enough data that I can write an importer, I can see what it would take.

I would be seeking the bounty. This is my day job. I would also offer to run imports for people with budgets, as each one is surprisingly different. Once I’ve run an import to get the code to work with one site, I’d share the code. I’ve got at least one person interested in an import, which could make it worth my while.

I’ve written several importers.

I will share the code.

There is no issue with long term maintenance! This game will be over in a few weeks.

If you can get me the dump you have I can see about helping match those Google ids. I think that what you suggest might be possible, but I’d need to see the data to tell for sure.


(mcdanlj) #12

This morning, I read ActiveRecord docs, so GoogleUserInfo.find_by looks like my friend. I’m a Go/Python/C developer so I just have to look everything up as I go along… :slight_smile:

I have a local test environment running in which to do test imports, so that’s not so hard.

Here’s the information I have outside Discourse so far:

  1. The author of the Friends+Me exporter has a commented exemplar for schema doc at Google+ Export JSON structure - Google Docs
  2. Anthony Bolgar / K40 · GitLab is an import I did into a Jekyll site before I realized that this was really an option, with the the exported feed JSON, a mapping of URLs to image files in the repository, the python script I hacked together to build the jekyll site, and 4807 images checked into the repository.

I’m having fun hacking at this for a bit. If you get to the point where you are ready to actually start work on it, drop me a PM and I’ll provide what I have so far, though I don’t promise to stop playing with it myself at that point. I don’t want to use this topic as an ugly form of source code management. :wink:

I expect that outside of markup, most of the code will be re-usable for importing google community takeout archives when google actually releases them, so this might be a head start on being able to do more imports for people who have just waited until the last minute.


(mcdanlj) #13

Here’s my untested attempt to import users:

    def import_author_user(author)
      id = author["id"]
      if not @users[id].present?
        google_user_info = ::GoogleUserInfo.find_by(google_user_id: author["id"]
        if google_user_info.nil?
          name = author["name"]
          email = "gplus-#{name.gsub(/\s+/, "")}-#{id}@example.com"
          {
            id: id,
            email: email,
            name: name,
            post_create_action: proc do |newuser|
              newuser.approved = true
              newuser.approved_by_id = @system_user.id
              newuser.approved_at = newuser.created_at
              newuser.save
              ::GoogleUserInfo.create({ 
                user_id: newuser.id,
                google_user_id: id,
              })
            end
          }
        else
          email = google_user_info.email
        end
        @users[id] = email
      end
    end
  end

I explicitly intend by @example.com to prevent hostile takeover, and for the google_user_id to allow later automatic user merge when they log in.

Again, I have no idea if that will work.


(Gerhard Schlager) #14

I suggest you use a .invalid domain in order to prevent outgoing emails. Something like this:

email = "#{id}@gplus.invalid"

(mcdanlj) #15

That’s a much better idea, I had forgotten about .invalid and it’s exactly right for the purpose. Thank you!

(example.com shouldn’t result in outgoing emails either by specification, but its purpose is primarily documentation.)


(mcdanlj) #16

I think I’ve worked out how to do message formatting. I expect that the Post.cook_methods[:regular] will be able to handle embedded bits of html like <b> <i> and <s> which I found I really needed to use with my jekyll import because nesting doesn’t work the same between markdown and what G+ produced.

The only things I know left for message formatting are:

  1. Turning plus-references into at-references that will resolve internally; there can be references outside the community to users not imported, so I’ll have to only conditionally turn them into at-references if the user exists on the system already or in the import. Update: I think I have this figured out. There might be a better way than "<a class="mention" href="/u/#{user.name}">@#{user.name}</a>" though?
  2. Coming up with a reasonable title for a topic from a post. I have an algorithm but I’ll have to validate that it generates OK titles.

(Gerhard Schlager) #17

Using @username should be enough. Discourse will create the proper HTML when the post gets cooked. I guess you figured out a way to find existing users by their plus-reference?

BTW: Are you writing a proper import script by using our base importer? Depending on how you import the users, you should be able to use find_user_by_import_id(google_plus_user_id) to find existing users.


(mcdanlj) #18

Oh, great! That’s easier. Glad to know I was trying too hard.

I do want to be able to “fix up” references later to users who don’t exist, so if I don’t find them either in the import or on the system, I was thinking of something like "<b data-google-plus-id="123456789">+GooglePlusName</b>"

I’m using the base importer, using ning.rb as a pattern since it seemed to be the closest analog. I’m using find_user_by_import_id but not all users will be in the import; I’ll have multiple imports across different G+ communities and many people will have already signed in so I also need to look in GoogleUserInfo to resolve the mapping. My plan is actually to import all the users across all the imports, and then to import the posts across all the imports, so that cross-import user references are all resolved correctly. A lot of the users interacted in a lot of related communities on G+ and I want to create a familiar experience for them.

Now, whether it’s a “proper” import script others can judge! :wink: Thanks for putting up with half-formed thoughts from this new-to-Discourse-and-new-to-Ruby guy, and for help pointing me in the right direction. I’ll say that it’s clear to me that some real attention has been paid to making imports work in Discourse, which is motivational for trying to add an importer!

Thanks!


(mcdanlj) #19

@gerhard — A related question… I now notice that the ning.rb importer has special handling for youtube links. Should I expect that youtube (and vimeo, etc?) links will be recognized and transformed into iframes with a viewer when the post is cooked, just like at-references to users? (The python code I wrote for static site imports has substantial regexp handling for youtube that I could pretty much lift intact, but if Discourse does it better I’d rather just trust Discourse to do the right thing.)


(Gerhard Schlager) #20

The Ning importer removes iframes so that Discourse can detect the links and create oneboxes. You don’t have to do anything special in your import script as long as the link is on a line itself.


(mcdanlj) #21

Progress report: I have written something vaguely resembling ruby an importer that looks like it covers all the current requirements in general design. (It doesn’t translate G+ +1’s into likes, because the exporter does not represent the +1s.) I haven’t run it at all actually imported data with it yet. :roll_eyes:

My script takes as arguments paths to files containing Friends+Me Google+ Exporter JSON export files, Friends+Me Google+ Exporter CSV image map files, and a single JSON file that maps Google+ Category IDs to Discourse categories, subcategories, and tags. It does not create new categories; if a category is missing it complains and bails out after writing a new file in which to fill the missing information to complete the import. It expects all the categories to have been created already.

The idea is to do all the imports into a single discourse in one pass, so that all possible “plus-mentions” of other G+ users across multiple G+ communities turn into “at-mentions” in Discourse, even for people not active in the community in which they are mentioned, as long as they wrote some post or comment somewhere in the whole set of data being imported. This is because so far it looks like I’ll be importing about 10 communities, with about 300MB of input JSON and about 40GB of images.

It is intended to work on a Discourse instance that already has users referenced in the import by google ID, and that already has content and categories created. I hope that it will also make it possible for people to log in with google OAuth2 after the import and automatically own their content because their google auth ID is tied to the fake account holding the data, so that their ability to own their own content is preserved.

I expect that a 431-line file that has never seen an interpreter will be loads of fun to debug, especially when written and being tested by someone who has never written any ruby before. I don’t pretend that writing this script is the largest part of the work. I’ll share it now or any later time with anyone seeking the bounty, as long as you’ll share your fixes with me regardless of bounty progress; just PM me. I’ll share it myself under GPLv3 at such time as I get it working. In the meantime, I’m considering this work my contribution toward someone else claiming the bounty, to make it more likely to be worth the time for whoever takes it to completion, because of the comment above that the bounty is smaller than typical.