[bounty] Google+ (private ) communities: export screenscraper + importer

(mcdanlj) #62

@david Thanks for the new feature and for the heads-up. I hadn’t seen either change yet. (Have I mentioned enough how awesome this community is?)

Update: I have merged up to the commit we’re currently running in production, 4a00772c199b, which claims to be 2.3.0.beta4 but doesn’t match the tagged 2ee02cb6c7fe. (I just want to make sure my dev matches prod as closely as I can.) c93c9f17818c on my branch adjusts to UserAssociatedAccount and seems to be working in dev though I haven’t yet tried it in production.

1 Like
(Andreas Dorfer) #63

A Readme would be helpful for those who do not start imports on a daily basis.
In case it’s similar to other importers, perhaps linkt to them/their respective forum threads.

(Irek) #64

@mcdanlj Kudos to you and your hard work!
A few days ago I’ve asked you (at g+) if it’s possible to customize a mapping between g+ posts and discourse topics and posts. Right now I’m already importing data from my community exported thanks to the “Google+ Exporter Tool”. My initial attempts to import were wrong because I was trying to import to a production version of a discourse instance and I got errors. It turned out that one have to install a development version of Discourse and copy all your migration tool files over Discourse project. This way it all finished with success and I have my community imported. And at this point I would like to ask another question. Is it possible to use your migration script on production version (this would be the fastest way)? Or should I use it like I mentioned on development version, then make a standard discourse backup and then restore this backup on my production instance of Discourse? I didn’t try the latter way yet, but I think I should assume production and development versions compatibility. I mean they both should be for example. 2.3.0beta5 (latest right now) to keep database schema compatibility.

(mcdanlj) #65

@irek Thanks for reporting success! That makes me happy to hear!

I use the script in development to do test imports, and then I copy the source data to the production server and use the script to do the same import using the script in production. I’m not doing a database dump/restore or a discourse merge. People keep using makerforums and contributing during an import; the only downside I know is that email gets turned off during the import so: people trying to set up a new account aren’t sent activation emails, and notification emails aren’t sent. I have a pinned post with details about that, and we have a summary in the site banner.

So I don’t know what problems you are experiencing in production; all I can say is that all of my imports into the production instance have been live imports using the script while the site is running.

Edit: Oh, if I remember correctly now you are importing into a Discourse.org-hosted forum. That is something that I have no experience with. :slight_smile:

(mcdanlj) #66

I have had a bug in the importer that is causing previously caused it to miss importing link elements in the data. See this bug report from Reddit for example post that had the problem:

Update: My branch has what I hope is a fix for this and has also been updated past beta5 to keep up with the current version on makerforums. I found 8558 posts in my import data that were potentially affected, out over 27K that had the link element (in the rest, it was redundant), and some of those 8558 were blacklisted as spam, so I had fewer to fix up. I have tested that fixed posts look right, and when rebaked they onebox properly.

I wrote a quick python script (sorry, Ruby devs) to identify affected G+ posts and write a JSON file of G+ IDs and missing URLs. I wrote a separate import script to apply those changes to a running instance. I’m hoping this didn’t actually affect @irek and I haven’t heard of anyone else actually doing an import yet. @irek if you need any of this, please PM me and I’ll give you what I have to fix it up.

(Andreas Dorfer) #67

It looks like the data structure has been changed a bit?

UPDATE version 1.8.0 / 15-3-2019 :

  • NEW uploaded videos are detected in downloaded posts and available for download.
  • NEW count of +1s is downloaded for posts and comments.
  • NEW 2 times faster communication to Google+ servers means faster posts and comments download.
  • FIX JSON structure description was updated Google+ Export JSON structure - Google Docs to describe video download URL and +1 counts.
(mcdanlj) #68

That’s only count of +1s. I asked to be sure; Alois said that he can’t scrape a complete list. This surprised me but he knows the limitations there better than I do. It can’t turn into “likes” because it’s not user records.

I don’t have room on my hard drive to refresh with 1.8.0 and download all the videos that were posted to groups I have already downloaded, so I’m probably not even going to try adding that to the importer.

Probably not hard to extend it; it’s just going to be one more case near image, images, and links

I did implement a whitelist.json feature that, when used, imports G+ posts only by user IDs contained in the whitelist, and then imports all comments to those posts, except for comments posted by user IDs in the blacklist (if any). I did this for at least one community that was not only spam-choked but also had lots of off-topic content I didn’t want to import, yet had some quality content I did want to import, and it was typical that some users stayed on topic and others didn’t.

(Andreas Dorfer) #69

a rather clunky way would be to generate a list of users “anonlike00” to “anonlike99” (assuming that no posts has more than 99 like in the batch) and use those “users”.

1 Like
(mcdanlj) #70

@adorfer you could while importing users also find the highest like count and find/create your ghosts. It would make the import slower. I won’t be implementing it, in any case. It definitely doesn’t fit my goal of making the content look as if it had been originally authored in Discourse.

(mcdanlj) #71

Status update

I believe that I have finished all the development work I can see for this plugin, though I’m not promising to go away.

It is of relatively time-limited utility. Lacking automated tests, as mine is, import scripts will tend to rot when the rest of Discourse is changed, so instead of being a useful pattern for others it would gradually and insensibly decay into (or, more into?) a counter-example for the next author of an importer. I don’t expect to be an ongoing active developer here, even though I’m grateful to those who are, so I’m unlikely to catch such decay as it happens. Therefore, I won’t submit a PR.

That said, I have signed the CLA, and I support anyone else who does feel that it would bring value to submit a PR, with or without a rebase.

Using this importer

For anyone running imports with this code, I’m sorry, but you’ll just have to read this massive thread. I’m not going to try to summarize it into a README. For better or worse, reading this wall of text is still much less work than actually running an import.

When all your imports are done, you’ll want to rebake. For example, I have seen that oneboxes didn’t show up until after a rebake.

If you are doing an offline import before making content live, a simple rebake is the easiest thing. However, if — like we had for https://forum.makerforums.info/ — you want to import posts into a live site, you probably would rather rebake as a background task. To do that, I recommend following this example:

Post.in_batches.update_all('baked_version = NULL')

When you do that, make sure that rebake_old_posts_count is set to a value that your server can support. It’s number of posts rebaked every 15 minutes. I suggest starting small and watching sidekiq jobs to make sure you are not getting behind. (https://yoursite/sidekiq/ after logging into https://yoursite/ will show this.)

Automatically rebaking old posts starts with the newest posts and rebakes backwards in time, so the most immediately relevant content will rebake first.

(Jay Pfaffman) #72

I ran it yesterday and modified it so that it’ll create the categories. I’ll push those changes when I get a chance.

From what I can tell, it looks pretty great! I’m waiting to hear back from the person who’s group it is.

(Kamil) #73

@mcdanlj I’m trying to migrate forum using production version of Discourse. I have added all scripts which were on your branch but still Ruby throws out missing files errors.
Could you tell me what I should add to production version before I run your script?

(mcdanlj) #74

@lapwingg the only file on my branch is the one importer script. It runs against the current beta. You can see the base version by looking at my repository clone. It depends on recent beta. More than that goes beyond what I know.

1 Like
(Jay Pfaffman) #75

I did this on a production instance yesterday (from inside the container)

cp /data/friendsmegplus.rb script/import_scripts/friendsmegplus.rb
ruby ./script/import_scripts/friendsmegplus.rb /data//in.json /data//categories.json 

and it ran fine (after I added code to create the categories).

(mcdanlj) #76

I don’t want to auto-create categories for my imports, and it turns out that I’m going to import more after all. Did you make a class-level boolean to control whether to create them, or just make it always create them?

Did you have other changes?

(Jay Pfaffman) #77

It creates only the categories in the category.json file, so I don’t think that it should break anything.

I think that’s all I’ve done so far. I haven’t heard from the person I was doing this for with wherever I was going to do next.

I’ll try to have a look in a few hours.

(mcdanlj) #78

I want to protect myself from typos editing piles of JSON, and I have made this mistake in practice. My eyes glaze over and I read what I thought I wrote. I’m more sensitive to this because all my real imports are into a live site, which isn’t the normal case.

I have also received a suggestion to be able to drop whole G+ categories, so I’ll probably start spitting out a default 'import': true in the JSON which can be changed to false to disable import of a whole category that collected spam or simply isn’t the target of the import.

I added handling for the new video and videos tags added in 1.8.0 of the exporter.

I tried to upload .mp4 files exactly the same way that I’m successfully uploading images, but the upload seems not to succeed. I’m doing exactly what I would expect to work:


What displays is just “54932d87535f5e2be951af6cfd63692b.mp4” and not as a link. However, if after I do a test import, I edit the affected post and do an upload of the video through the UI, the video works correctly, though the way it’s oneboxed is not great.

This feels like a bug. ![description](upload://foo.mp4) does not work, nor does ![description](https://sitename/uploads/original/3X/0/1/0123456789.mp4) but a bare line https://sitename/uploads/original/3X/0/1/01234566789.mp4 does work and is what happens when you upload a video, unlike uploading an image. It’s ugly and requires me to provide a configuration setting for the site base URL for the import script. I even tried "<video width='100%' height='100%' controls><source src='#{upload.url}'><a href='#{upload.url}'>#{upload.url}</a></video>" which is how the video is rendered, hoping that would make the relative URL work, but no such luck. It just shows as a link.

So I’m really not super happy about where I ended up, but it’s better than leaving videos behind.

Update: The videos seem to work fine in preview, but not when viewing. Also, editing the URL to the same URL that works if I do an upload from the browser doesn’t fix it. I think I’m done trying. At least the files are uploaded, and maybe some day the URLs can be fixed. It’s not particularly awesome though. Makes me sad, because I thought I was going to be able to resurrect something like 500 videos uploaded to G+ groups that are going away. I guess there’s a tiny chance that it will work in prod but not dev, so I’ll test that with an import containing only a few videos and hope.

(mcdanlj) #79

My script doesn’t correctly handle a single piece of text that is, say, bold italic — if it’s bold with italic partial string inside it will probably be fine, but in general it didn’t handle more than one formatting applied to a substring. Not sure how many of those there really are, but I’m fixing that too.

(Jay Pfaffman) #80

Here’s read_categories that creates categories, if you want it.

  def read_categories
    @feeds.each do |feed|
      feed["accounts"].each do |account|
        account["communities"].each do |community|
          community["categories"].each do |category|
            if !@categories[category["id"]].present?
               # Create empty entries to write and fill in manually
               @categories[category["id"]] = {
                 "name" => category["name"],
                 "community" => community["name"],
                 "category" => "",
                 "parent" => nil,
                 "tags" => [],
            elsif !@categories[category["id"]]["community"].present?
              @categories[category["id"]]["community"] = community["name"]

Maybe add a @create_categories = false at the top and add that to the if I added. . . .

1 Like
(mcdanlj) #81

@pfaffman I don’t see where that code creates the categories; that looks like the code I already have.

1 Like