[bounty] Google+ (private ) communities: export screenscraper + importer

(mcdanlj) #81

@pfaffman I don’t see where that code creates the categories; that looks like the code I already have.

1 Like
(Jay Pfaffman) #82

Oops. It seems that I didn’t pay attention to what i charged.

1 Like
(Jay Pfaffman) #83

I’ll try again here. I’ve got a site that’s down (that heated up just as I was merging your code with mine). Here is what I meant to share before. I’ll still try to submit a PR Real Soon Now (as I’m about to run this import again with your new code. . . )

  def map_categories
    puts "", "Mapping categories from Google+ to Discourse..."

    @categories.each do |id, cat|
      if cat["parent"].present? and !cat["parent"].empty?
        # Two separate sub-categories can have the same name, so need to identify by parent
        Category.where(name: cat["category"]).each do |category|
          parent = Category.where(id: category.parent_category_id).first
          @cats[id] = category if parent.name == cat["parent"]
        end
      else
        if category = Category.where(name: cat["category"]).first
          @cats[id] = category
        else # create category
          params = {}
          params[:name] = cat['category']
          params[:id] = id
          puts "Creating #{cat['category']}"
          category = create_category(params, id)
          @cats[id] = category
        end
      end
      raise RuntimeError.new("Could not find category #{cat["category"]} for #{cat}") if @cats[id].nil?
    end
  end
1 Like
(mcdanlj) #84

Thanks!

Looks like that doesn’t handle sub-categories; am I missing something?

1 Like
(Jay Pfaffman) #85

Err. Yeah. I think that’d be true. I just let your script build the categories.json file and went with it. I’ll let the site owner clean that up later.

I think that there’s a clever way to have the part at the top add the parent category to [params] and then have my create category block in just once.

It’s not clear that another other than the two of us cares about this. You can decide whether to reject the code or add a warning. I think it’ll just crap out if they json file has a sub-category, which is what it was doing for categories before.

1 Like
#86

@mcdanlj With a recent import, approximately 1,345 users of the 4,672 in the G+ community were brought in. Does the importer only include those users that have +1’d, commented or posted?

If that’s the case, I’m fine with it, but if not I’d like to know if there’s a bug or if there’s some specific user criteria being evaluated on import. Many thanks for all your effort on this project!

2 Likes
Feedback on the new Review Queue
(mcdanlj) #87

It imports only users who have posted or commented. There’s no support for +1, sadly. If you have a list of users from a Google takeout it wouldn’t be hard to recognize them and add them, but it’s not clear how that would be valuable. What would you use it for?

2 Likes
(Jay Pfaffman) #88

This is actually a feature. Having a bunch of users with no posts and bogus email addresses wouldn’t help anyone.

2 Likes
(mcdanlj) #89

It occurs to me that I don’t know what the script does with a plus-mention of a user who never posted. I’d have to read it and I’m at work right now but I don’t remember handling that.

Update: Oh look, I did something right. (“Don’t worry. I won’t let it go to my head.”)

  • import_author_user imports the users who create posts or comments.
  • import_message_users imports the users who were plus-mentioned.

So that should just work correctly.

I should re-clarify that although Google+ can display all the users who clicked +1 if you click all the way through, the exporter only exports the count of plus-ones, not the users who clicked +1, which means (as discussed way back in this thread) we can’t import them into discourse in any meaningful way.

1 Like
(Keith John Hutchison) #90

Assuming any subcategory only linked to a parent category then if each category had a parent property then the json file could be processed in two passes. Parents then subcategories.

1 Like
(Jay Pfaffman) #91

I think that those approvals might have to do with the G+ stuff and might be a bug. See Needs Approval feature (Nothing to Approve) and SSO user unnecessary approval with 403 error.

(mcdanlj) #92

Friends+Me Google+ Exporter 1.8.4 — released just this morning — is able to find images and videos that 1.8.3 and earlier did not find.

I strongly suggest downloading it and running it right now before Google turns off the lights:

  • Recommended: Disable tor. It is reportedly no longer required, and the download will be faster and less load on tor if you disable the service.
    • Click the gear icon at the upper right
    • Disable tor
    • Exit the exporter and restart it with tor disabled.
  • Press the REFRESH ALL button.
  • Download all images, and if it fails to download some, do it again until they are all downloaded or you aren’t making progress.
  • Download all videos, repeatedly until all downloaded

I will write a new importer script, based on this exporter, which is intended to be OK to run directly in production (after taking a backup, please), and for each already-imported and not subsequently deleted topic or post, rewrites the post so that all the images and videos are included. It will also fix up ## tags if you ran the importer back when it had the bug of double-hashing tags, repair some missing oneboxes by putting more URLs on their own lines, and fix any nested markup failures. If I find other bugs in the representation between now and then I’ll try to fix them up too. In general, it will use the same formatting code as the existing exporter, and if the formatting produces a different result, it will use it.

If your Discourse is currently live or will be live before this work is complete, I recommend that you immediately add to your Discourse a pinned or banner topic similar to this pinned topic on makerforums:

Just to set expectations, it’s more than likely that it will be a week or three before I have this ready. It’s not fundamentally hard, but being confident that the testing is sufficient will take some thought, and I haven’t started writing it yet.

Update: I have this working in test, but gaining confidence that it is working fully correctly is hard. These edits are not something that can be easily backed out.

4 Likes
(Celeste Weingartner) #93

Waiting patiently for this new version… Some stats on the level of import we’re using it for:

Over 4000 users, and over 15,000 posts.

We did an import already once, with a weekish old version of the importer, running a new export now with 1.8.4 and waiting for the new version of mcdanlj’s import script.

Thank you mcdanlj and Jay Pfaffman for your hard work on this.

5 Likes
(mcdanlj) #94

Update: After talking to Alois Bělaška of Friends+Me, it does make sense to PR this import script. Google is maintaining Google+ for GSuite users, and GSuite users of Google+ might determine at some point that a migration to Discourse would better meet their corporate needs. Given that, it would be better for this script to be incorporated into Discourse in case this happens.

I’ll incorporate optional category creation before submitting a PR.

4 Likes
(mcdanlj) #95

I have published some changes to the script, but @Celeste_W I doubt they affect you.

  • I added the ability to provide a mapping from Google user ID to Discourse @-handle for cases where you know about a specific user to map to a Discourse user, or for cases where (as I have) you have merged the imported user into a non-imported user, breaking the import association.
  • I added a --first-date option to import only posts made after a certain date. We have some communities that we would like to import that were not cleaned of spam before a new moderator took over and started paying attention.

The import script only writes new posts, it never modifies old posts.

The update script that I’m working on only fixes up old posts, it does not import additional posts.

5 Likes
(mcdanlj) #96

I have (successfully, as far as I can tell) used the update script I wrote. Added about 6GB of images I missed the first time around, and fixed up formatting and tags.

I’ll commit it to my current branch. I will not expect to include it when I submit a PR; I’ll be rebasing my work onto a new branch for the PR.

My branch now has category creation from @pfaffman in it, based on a documented boolean at the top of the script.

I have neither tested the change nor merged up to latest, because I don’t currently have imports to run and we’re still waiting for the dust to settle on the beta6 changes before moving forward on makerforums. Let me know if you have any problems.

Update 8 April:
PR opened

The fixup script seemed to work, but I missed something, and all the images and videos I uploaded were clearly not marked as uploaded and were tombstoned and garbage-collected. So 7.5GB of images and videos that I restored have gone away. The only exceptions I find are posts the I edited by hand.

I believe that the problem was that I saved the post and tried to schedule a rebake by setting rebake_version to nil instead of calling Post.revise()

@Celeste_W @adorfer @Rafa @lapwingg @irek Please note that I have pushed a new version of the script/import_scripts/friendsmegplus-update.rb script that I believe will fix this bug. If you have already run the friendsmegplus-update.rb script you will want to run it again with the updated version. I’ll keep testing it, and I might come up with more fixes; I’ll plan to update when I have more information. It’s at least more promising in testing so far.

My current version, because it uses Post.revise, sends users notifications of system edit for every modification. In one sense that’s ok, but because this is happening to the majority of over 50000 topics, the most active users will have their notification widget spammed and made useless across the update. Wish I knew a better way!

I finished running it successfully. My latest edit handles posts to deleted topics correctly. It seems to have worked right. It shows revisions so I can compare to see what changes it made, and this gives me confidence that 7.5GB of images and videos won’t be tombstoned this time. :slight_smile:

I just wrote up the process I used, after realizing that more people might benefit from it. Hopefully this satisfies the requests for documentation I’ve seen somewhere earlier in this massive thread! Feel free to ask questions here and I’ll plan to update my documenting blog post with additional details.

5 Likes
(Celeste Weingartner) #97

Going to be giving this a shot… We noticed that the import script does not handle banned users/emails well and kind of puked, not sure if that was due to the banned user… or what just yet.

(Celeste Weingartner) #98

The new script and updater worked a treat. Thank you.

3 Likes
(mcdanlj) #99

Was that only with an older version of the script, and the latest version resolved the problem for you?

1 Like
(mcdanlj) #100

I have fixed the lint failures, so travis is now green for the PR.

Is there someone who can review it? :slight_smile:

…Thanks @sam!

4 Likes