[bounty] Google+ (private ) communities: export screenscraper + importer

(mcdanlj) #62

@david Thanks for the new feature and for the heads-up. I hadn’t seen either change yet. (Have I mentioned enough how awesome this community is?)

Update: I have merged up to the commit we’re currently running in production, 4a00772c199b, which claims to be 2.3.0.beta4 but doesn’t match the tagged 2ee02cb6c7fe. (I just want to make sure my dev matches prod as closely as I can.) c93c9f17818c on my branch adjusts to UserAssociatedAccount and seems to be working in dev though I haven’t yet tried it in production.

(Andreas Dorfer) #63

A Readme would be helpful for those who do not start imports on a daily basis.
In case it’s similar to other importers, perhaps linkt to them/their respective forum threads.

(Irek) #64

@mcdanlj Kudos to you and your hard work!
A few days ago I’ve asked you (at g+) if it’s possible to customize a mapping between g+ posts and discourse topics and posts. Right now I’m already importing data from my community exported thanks to the “Google+ Exporter Tool”. My initial attempts to import were wrong because I was trying to import to a production version of a discourse instance and I got errors. It turned out that one have to install a development version of Discourse and copy all your migration tool files over Discourse project. This way it all finished with success and I have my community imported. And at this point I would like to ask another question. Is it possible to use your migration script on production version (this would be the fastest way)? Or should I use it like I mentioned on development version, then make a standard discourse backup and then restore this backup on my production instance of Discourse? I didn’t try the latter way yet, but I think I should assume production and development versions compatibility. I mean they both should be for example. 2.3.0beta5 (latest right now) to keep database schema compatibility.

(mcdanlj) #65

@irek Thanks for reporting success! That makes me happy to hear!

I use the script in development to do test imports, and then I copy the source data to the production server and use the script to do the same import using the script in production. I’m not doing a database dump/restore or a discourse merge. People keep using makerforums and contributing during an import; the only downside I know is that email gets turned off during the import so: people trying to set up a new account aren’t sent activation emails, and notification emails aren’t sent. I have a pinned post with details about that, and we have a summary in the site banner.

So I don’t know what problems you are experiencing in production; all I can say is that all of my imports into the production instance have been live imports using the script while the site is running.

Edit: Oh, if I remember correctly now you are importing into a Discourse.org-hosted forum. That is something that I have no experience with. :slight_smile:

(mcdanlj) #66

I have had a bug in the importer that is causing previously caused it to miss importing link elements in the data. See this bug report from Reddit for example post that had the problem:

Update: My branch has what I hope is a fix for this and has also been updated past beta5 to keep up with the current version on makerforums. I found 8558 posts in my import data that were potentially affected, out over 27K that had the link element (in the rest, it was redundant), and some of those 8558 were blacklisted as spam, so I had fewer to fix up. I have tested that fixed posts look right, and when rebaked they onebox properly.

I wrote a quick python script (sorry, Ruby devs) to identify affected G+ posts and write a JSON file of G+ IDs and missing URLs. I wrote a separate import script to apply those changes to a running instance. I’m hoping this didn’t actually affect @irek and I haven’t heard of anyone else actually doing an import yet. @irek if you need any of this, please PM me and I’ll give you what I have to fix it up.

(Andreas Dorfer) #67

It looks like the data structure has been changed a bit?

UPDATE version 1.8.0 / 15-3-2019 :

  • NEW uploaded videos are detected in downloaded posts and available for download.
  • NEW count of +1s is downloaded for posts and comments.
  • NEW 2 times faster communication to Google+ servers means faster posts and comments download.
  • FIX JSON structure description was updated Google+ Export JSON structure - Google Docs to describe video download URL and +1 counts.

(mcdanlj) #68

That’s only count of +1s. I asked to be sure; Alois said that he can’t scrape a complete list. This surprised me but he knows the limitations there better than I do. It can’t turn into “likes” because it’s not user records.

I don’t have room on my hard drive to refresh with 1.8.0 and download all the videos that were posted to groups I have already downloaded, so I’m probably not even going to try adding that to the importer.

Probably not hard to extend it; it’s just going to be one more case near image, images, and links

I did implement a whitelist.json feature that, when used, imports G+ posts only by user IDs contained in the whitelist, and then imports all comments to those posts, except for comments posted by user IDs in the blacklist (if any). I did this for at least one community that was not only spam-choked but also had lots of off-topic content I didn’t want to import, yet had some quality content I did want to import, and it was typical that some users stayed on topic and others didn’t.

(Andreas Dorfer) #69

a rather clunky way would be to generate a list of users “anonlike00” to “anonlike99” (assuming that no posts has more than 99 like in the batch) and use those “users”.

(mcdanlj) #70

@adorfer you could while importing users also find the highest like count and find/create your ghosts. It would make the import slower. I won’t be implementing it, in any case. It definitely doesn’t fit my goal of making the content look as if it had been originally authored in Discourse.

(mcdanlj) #71

Status update

I believe that I have finished all the development work I can see for this plugin, though I’m not promising to go away.

It is of relatively time-limited utility. Lacking automated tests, as mine is, import scripts will tend to rot when the rest of Discourse is changed, so instead of being a useful pattern for others it would gradually and insensibly decay into (or, more into?) a counter-example for the next author of an importer. I don’t expect to be an ongoing active developer here, even though I’m grateful to those who are, so I’m unlikely to catch such decay as it happens. Therefore, I won’t submit a PR.

That said, I have signed the CLA, and I support anyone else who does feel that it would bring value to submit a PR, with or without a rebase.

Using this importer

For anyone running imports with this code, I’m sorry, but you’ll just have to read this massive thread. I’m not going to try to summarize it into a README. For better or worse, reading this wall of text is still much less work than actually running an import.

When all your imports are done, you’ll want to rebake. For example, I have seen that oneboxes didn’t show up until after a rebake.

If you are doing an offline import before making content live, a simple rebake is the easiest thing. However, if — like we had for https://forum.makerforums.info/ — you want to import posts into a live site, you probably would rather rebake as a background task. To do that, I recommend following this example:

Post.in_batches.update_all('baked_version = NULL')

When you do that, make sure that rebake_old_posts_count is set to a value that your server can support. It’s number of posts rebaked every 15 minutes. I suggest starting small and watching sidekiq jobs to make sure you are not getting behind. (https://yoursite/sidekiq/ after logging into https://yoursite/ will show this.)

Automatically rebaking old posts starts with the newest posts and rebakes backwards in time, so the most immediately relevant content will rebake first.

(Jay Pfaffman) #72

I ran it yesterday and modified it so that it’ll create the categories. I’ll push those changes when I get a chance.

From what I can tell, it looks pretty great! I’m waiting to hear back from the person who’s group it is.

(Kamil) #73

@mcdanlj I’m trying to migrate forum using production version of Discourse. I have added all scripts which were on your branch but still Ruby throws out missing files errors.
Could you tell me what I should add to production version before I run your script?

(mcdanlj) #74

@lapwingg the only file on my branch is the one importer script. It runs against the current beta. You can see the base version by looking at my repository clone. It depends on recent beta. More than that goes beyond what I know.

(Jay Pfaffman) #75

I did this on a production instance yesterday (from inside the container)

cp /data/friendsmegplus.rb script/import_scripts/friendsmegplus.rb
ruby ./script/import_scripts/friendsmegplus.rb /data//in.json /data//categories.json 

and it ran fine (after I added code to create the categories).