[bounty] Google+ (private ) communities: export screenscraper + importer

(Jay Pfaffman) #42

Is there something in the original raw that includes sizing of the images?


Not sure where that size is coming from. It’s conceivable that (but not entirely likely) it’s a bug in Discourse, but I’d first check what the original message looks like.

Are you on latest?

(mcdanlj) #43

I am pretty sure that I had rebased on master well before I ran that import. Reading the code, the bug is in embedded_image_html which doesn’t preserve aspect ratio:

    def embedded_image_html(upload)
      image_width = [upload.width, SiteSetting.max_image_width].compact.min
      image_height = [upload.height, SiteSetting.max_image_height].compact.min
      upload_name = upload.short_url || upload.url

There’s not any multiplication or “square rootification” in there anywhere. About 20 years ago, my quick hack shell script to create indexes of images did this:

        eval $(identify -format 'OLDWIDTH=%w;OLDHEIGHT=%h' $file)
        NEWWIDTH=$(echo "sqrt(($AREA*$OLDWIDTH)/$OLDHEIGHT)" | bc)
        NEWHEIGHT=$(echo "sqrt(($AREA*$OLDHEIGHT)/$OLDWIDTH)" | bc)

That formula is a few millennia old, so I don’t claim any credit, and similar logic could keep aspect ratio while honoring the most restrictive constraint.

(mcdanlj) #44

Actually, wouldn’t it make more sense to just leave off the width and height, or leave it off if either exceeds the SiteSetting? Then I would expect the site settings to just override and do the right thing. It seems odd to have any width by height encoded in the markdown at all… I pushed a workaround to my branch that creates a method that doesn’t set width or height at all. I’ll let baking take care of that based on site settings.

Looks like I can fix this up with rake posts:remap:

Here’s what works within irb:

irb(main):001:0> s = '![9b1cf6ecdc9e99cba74cd19a7a46a150.jpeg|690x500](upload://oogg153c3ECDwmgszXxhe9Zt2Vr.jpeg)'
=> "![9b1cf6ecdc9e99cba74cd19a7a46a150.jpeg|690x500](upload://oogg153c3ECDwmgszXxhe9Zt2Vr.jpeg)"
irb(main):002:0> s.gsub(/(\!\[[^|]+)\|\d+x\d+\]/, '\1]')
=> "![9b1cf6ecdc9e99cba74cd19a7a46a150.jpeg](upload://oogg153c3ECDwmgszXxhe9Zt2Vr.jpeg)"

Regardless of whether we use " or ' in the shell, we need rather a lot of \ to call rake and have it work:

$ bin/rake "posts:remap[(\\\!\\\[[^|]+)\\\|\\\d+x\\\d+\], \\\1], regex]"
warning: regular expression has ']' without escape: /(\!\[[^|]+)\|\d+x\d+]/

Yes. Yes it does. :slight_smile:

Update: That posts:remap was effective on my test system with around 14K topics on it, but on the production server with 37.6K topics and 263K posts, as well as an additional 25GB or so of images, it failed partway through. It appears to have run out of memory; I no longer have the error message to reference and my recollection is hazy. It appears to have set off a storm of image reprocessing. The system is running a constant load of convert and optipng with a smattering of other image processing tools, to the point where it is making the UI slow.

I’m mentioning this operational issue here only for the benefit of the person three years in the future who finds this post from a search and tries a similar change… :slight_smile:


Is there any way that the G+ (private) community importer functionality could be made to work with a Discourse-hosted account? If so, what would be the necessary process to make it happen, step by step?

(mcdanlj) #46

@Rafa are you talking about google takeout data? If so, someone would have to write an import script for that format.

Alternatively, you could purchase the Friends+Me exporter ($20 or maybe $30 now), which has the ability to import private as well as public posts, in which case this script can do the import part of the process. That’s not a step-by-step, that’s a way to get data. As @pfaffman pointed out, the import process takes some knowledge to execute. Can’t give you a step-by-step, sorry.

(mcdanlj) #47

Another potential improvement to the script would be to recognize silenced users and drop them from the import. We are importing related G+ communities that have been imperfectly cleared of spam and that have overlapping spammers, and if we silence a spammer after importing one, it would be a good idea to not import spam posts from the same spammer in the next import…

(mcdanlj) #48

@pfaffman it’s not create_post that ignores duplicate import IDs, it’s create_posts plural. I used the ning importer as a pattern, which doesn’t use create_posts probably for the same reason it would be a pain to use it in this importer.

I fixed incremental imports, added blacklist capability, and added progress indication for topic/post import and it’s all there on my branch.

(Jay Pfaffman) #49

Aha! That sounds right. Sorry if I lead you astray.

That’s fantastic! I’ve been working on a new installation/deployment system for Discourse that’s taken all of my attention lately.

(mcdanlj) #50

If only Discourse were open source, I could have read it and figured it out myself! Oh wait, it is! :wink: In all seriousness, I just mentioned it in case someone is ever reading this thread for context on some other importer they are writing.

While I was driving home from work the other day, it suddenly occurred to me that the lack of a progress indicator probably meant I was doing it wrong. As soon as I asked the question the right way, the pattern was obvious. No wasted time.

(mcdanlj) #51

I originally approved the @gplus.invalid users so that they could “just work” with google login. Sadly, it appears that this has been causing Discourse to send emails to these domains, and as we’re using the Digital Ocean configuration with mailgun, mailgun is now throttling us. For this reason, I have removed the code to approve the @gplus.invalid users from my branch.

I also originally made the blacklist both suspend and silence blacklisted users. However, in order to make it possible for blacklisted users to appeal the blacklisting, I have changed it to only silence them.

Finally, I found that for whatever reason, I am seeing a lot of completely empty posts. No message, no image, no images from Google+. While I don’t know why they are appearing, they are definitely empty in the source data and are noise in the import, so I added a minimum raw post length and defaulted to 12 characters, which is probably too low a limit anyway. That includes links to images, so having even one image with no text will exceed the minimum and be imported.

(Jay Pfaffman) #52

You need to deactivate those users. That will stop them from getting mail. You’ll still have to do something to get them to fix their address when they do log in.

(mcdanlj) #53

Yeah. I tried to deactivate only the users who hadn’t actually logged in, which should also change their primary email by merging with their google auth response:

users = User.where(trust_level: 1)
users.each do |u|
  if u.email.ends_with?("@gplus.invalid")

There was probably a better way to do it, but sooner was better than perfect.

(mcdanlj) #54

Well, yes, that’s the big problem I have now. I was hoping that the google login path would fix up the address when it merges in the authentication record, but no such luck. And I don’t see an option I can set for all these users to force them to change their email address on sign-in.

So right now, I have fixed the problem where we were hammering mailgun with @gplus.invalid addresses, but have now completely broken the ability to log in via google auth for all imported users who haven’t already logged in with google auth. :confused:

It looks like with the imported users disabled as described, they can never log in, because it does not update their email with the email provided in google auth response. It doesn’t even set it as a secondary email. The whole point of the mapping doesn’t work this way.

Could I perhaps enable them but set their bounce score high, reset in 1000 years?

Here’s hoping that I didn’t just break things too badly:

users = User.where(trust_level: 1)
users.each do |u|
  if u.email.ends_with?("@gplus.invalid")
    s = UserStat.where(user_id: u.id).first
    s.bounce_score = 5.0
    s.reset_bounce_score_after = 1000.years.from_now

If this works, I’ll change the import script to use this instead of disabling the bad-email users.

It still won’t fix the fact that the google login info email is ignored, but I am hoping it at least works around the most immediate problem.

I’m guessing this is the only importer that has to deal with not having valid email addresses for users, so this is a weird problem.

(Andreas Dorfer) #55

perhaps I am missing a point, but what about disabling email on the system (like for ML imports) during import phasis?

(mcdanlj) #56

@adorfer It does disable mail during the import. But then it tries to send some emails after the import.

The import script now sets up activated users with the address marked as bouncing. So far, it looks OK.

I wish that google authentication set, or could be configured to set, the account primary email to the email provided in the google authentication payload, but it looks from the code like that was considered and roundly rejected. I guess I could generate a PM from the administrative user to each imported user asking them to set their email? I’m unlikely to add this to the importer though! I’ve probably imported 90% of the content I’ll ever import with it…


@mcdanlj Please forgive me for not being very clear. I’m talking about using the Friends+Me exporter to get the data from a private Google+ Community, and then using your script to import that data to a community hosted on Discourse.org (as opposed to a self-install on Digital Ocean, etc.).
When I wrote step by step, I meant a high-level overview so that I have an idea of how to hire someone to handle the process for me. That is, unless you’re available! :smiley:

(mcdanlj) #58

@Rafa that would take someone else. My only experience with Discourse (other than slight experience as an end user) has been the past three weeks, and that only setting up a development environment on my laptop and doing an import on a digital ocean droplet.

The paths in the google-plus-image-list.csv file from F+M G+E have absolute paths, so you either have to put the files in the same path on the discourse server or you have to edit the file to point to the location to which you copy them on the server. I edited it. I did max 1000 posts at a time, and made packs of those posts with only the image files referenced in those posts, using the upload-paths.txt feature of the script running in my development discourse on the system on which I ran F+M G+E. I copied just those files to the appropriate location on the server and ran the importer there against only those copied files, 1000 or fewer posts at a time. The 3D Printing community I imported was 23 packs, because it was almost 23,000 posts. I made sure that the import packs contained the spammer blacklist file so that spam posts weren’t imported.

The details of how I constructed the files are specific to my system and the layout of the DO discourse install, so not useful beyond the general description.

I would think that anyone who has ever done a discourse import of any kind would be able to figure out this importer. The only oddity really is that because I’m trying to avoid creating duplicate or typo categories, I intentionally left category creation out of the importer and insist that the categories exist already. Anyone who has run an import can figure out how to create them; anyone who has written an importer can figure out how to create them dynamically if they care, since almost every importer written already does that.

If @pfaffman finishes his current task, he’d be an excellent resource, since he’s been following along this whole time and helping out, and it’s what he actually does for a living.

(Jay Pfaffman) #59

I’ve done dozens of imports, written several importers and have done several imports for Discourse.org hosted customers. See
Discourse Migration – Literate Computing.

Apologies if I’ve sent this to you already. This is a long topic!

(mcdanlj) #60

My latest change added a --last-date=ISO8601DATEHERE for importing communities that as of a certain date were abandoned by all but spammers. I found a community with no real posts or comments after a certain date but with lots of great content before that date and only spammers afterward, so this enables me to preserve it. :slight_smile:

(David Taylor) #61

Emails to .invalid addresses are now skipped: FEATURE: Skip sending emails to domains on the `.invalid` TLD (#7162) · discourse/discourse@420c6f8 · GitHub

We don’t always replace emails for social logins, because users might deliberately want different email addresses for different services. However, in this case it would certainly be better to replace an invalid email address with a real one, so I’ve created this PR which is pending review: FEATURE: Fetch email from auth provider if current user email is invalid by davidtaylorhq · Pull Request #7163 · discourse/discourse · GitHub. As part of this, users will automatically be activated.

So, the best state for you to import users is active: false, email:"something@something.invalid", and there is no need to mess with the bounce score.

Also, you may have noticed that we migrated google_user_infos into user_associated_accounts, so you will need to replace GoogleUserInfo references with UserAssociatedAccount.where(provider_name:"google_oauth2")