Migration of Google Groups to Discourse

import

(Marcus Baw) #1

Here is a ‘working prototype’ of a google group migration script that attempts to be more ‘all-in-one’ and reduce the number of steps, complexity and difficulty of doing google group imports.

https://github.com/pacharanero/discourse/blob/master/script/import_scripts/googlegroups.rb

Having finally gotten around to doing a pro-bono google group import I promised to do for the wonderful Valentina Project, I had to re-familiarise myself with the available google group scraping tools, as suggested to me by @erlend_sh, who pointed me in the direction of a few google group scraping libraries and Discourse’s own mbox importer script. So while it’s all in my head I thought I’d have a go at a more user-friendly import script, and some documentation for it.

Thanks to @steinwaywhw for his work, both in the linked topic and in An Importer for Google Groups, without which I wouldn’t have been able to do this.

#Notes

  • I think I may have solved the problem of email addresses being redacted by Google Groups, (it doesn’t happen if you are logged into Google Groups with a ‘Manager’ level account) so that now, the mbox.rb importer can create users in Discourse with the correct email addresses. The rightful owners of those created users would then only need to do a password reset in order to be able to log into their new user on Discourse, and all their Google Group posts would be correctly assigned to them. I’ve tested this on a real migration, it works, and I’m keen to get feedback from other testers.
  • I’ve tried to use the OO design pattern established in the other import scripts, for example ImportScripts::Mbox subclasses ImportScripts::Base. In my case, I wanted to use a lot of functionality from ImportScripts::Mbox, so my script is a subclass of ImportScripts::Mbox.
  • It works… but it’s definitely not finished yet and I’d appreciate constructive criticism, pull requests and amusing emoji.

#How to use it

You will need to be a little bit familiar with the Linux command line, SSH and stuff like that. I’ve tried to make the step-by-step instructions as clear as possible, but there might be slight variations in the output of certain commands. Please reply to the thread if you are having problems.

  1. Cookies. In order to be able to extract users’ email addresses correctly from the Google Group, you will need to have Manager access to the Google Group. Having logged into Google Groups (on your normal computer) with this Manager account, export the Google cookies from your browser. (I used the cookies.txt Chrome extension to get the cookies.txt file (Without this step, the scrape will work BUT the email addresses are truncated/redacted by Google Groups so they look like this: marcu....@gmail.com, and of course this messes up creation of new users on Discourse)

  2. Upload cookies.txt. Once you have the cookies.txt file, the easiest way to get it into your Docker container from your computer is to upload it as an attachment to any post in your discourse forum. You will need the file path for the next step, you can get the URL from the post, it will be something like: /uploads/default/original/1X/245aa0cdc6847cf59647e1c7102e253e99d40b69.txt

  3. SSH into your server
    $ ssh user@your-discourse-server

  4. Change directory into the Discourse directory
    $ cd /var/discourse

  5. Enter the Discourse Docker container, using ./launcher tool.
    $ ./launcher enter app

  6. Copy cookies.txt to /tmp/ so that the import script can find it. Prepend /var/www/discourse/public to the URL from the previous step, this gives you the full file path, to use with the cp (unix copy) command:
    # cp '/var/www/discourse/public/uploads/default/original/1X/245aa......40b69.txt /tmp/

  7. Install some stuff that’s needed by mbox.rb, the importer script, for its index
    # apt install sqlite3 libsqlite3-dev
    # gem install sqlite3

  8. Change into the import scripts directory with the cd command.
    # cd /var/www/discourse/script/import_scripts

  9. Get the google group script (and the monkeypatched version of mbox.rb it requires)
    # wget https://raw.githubusercontent.com/pacharanero/discourse/master/script/import_scripts/googlegroups.rb
    `# wget https://raw.githubusercontent.com/pacharanero/discourse/master/script/import_scripts/mbox.rb

  10. Change user to the discourse user so that you can make changes to the database
    # su discourse

  11. Run the script!
    # ruby googlegroups.rb name-of-your-google-group

Any problems, please feel free to discuss in the replies.

Marcus


Migrating from Google Groups
Import from Google Groups to Discourse
Discourse vs Email & Mailing lists
Migrating from Google Groups
An Importer for Google Groups
Importing from Google Groups
Import from Google Groups to Discourse
(Marcus Baw) #2

By way of illustration, this is what an original topic looks like in Google Groups:

And this is the unmodified appearance of the same post, imported/migrated into Discourse using the script ImportScripts::GoogleGroups and the monkey-patched ImportScripts::Mbox as described above.

M


(Steinway Wu) #3

Hi @pacharanero, thanks for this easier script! Just one question, how did you make the cookie work? I tried to use a manager cookie with both wget and curl, but it never worked. It only works in my browser. For uploading cookies, may I suggest using scp? Uploading a cookie as attachment may be insecure. Thanks again.


(Marcus Baw) #4

Hi @steinwaywhw, thanks and it’s no problem.

I used the cookies.txt extension (I thin the same one that you used?) and in my testing I did indeed use scp to copy over into /tmp/ on the cloud server. But then you have to copy this cookie into /tmp/ on the Docker container, using docker ps to get the Docker container ID and then docker cp to copy the cookie from the server.

But for the purposes of the HOWTO above, I thought that it would be easiest to tell people to upload it as an attachment to a post (which can be a private message), because it was simpler and one less step. I don’t get how it could be insecure though? If it’s an attachment to a private message, then only the recipients of the private message (me) and people with access to the Docker container (me) can get that cookie file. And in transport it’s secure if we’re all using (I hope) HTTPS :slight_smile:

I’m not sure why the cookie.txt method didn’t work for you, I did 2 migrations over the weekend (2 different Google Groups into 2 different Discourses) with the correct Manager cookie.txt for each and they both came over with full email addresses, meaning the Users are created correctly in Discourse and can be accessed by someone in possession of that email address, simply by doing a password reset.

Nice work on the pattern matching email fixer workaround, though. It’s entirely possible that Google would, in the future, change something so that migrating with the cookie wouldn’t work. Nice to know we’d have an option if that happened!

Marcus


(Steinway Wu) #5

Hi @pacharanero, I finally find time and tried it on my own machine. And I found several problems. (not yours though :slight_smile:, but I think it is crucial to point out so that others won’t get into the same problem I have).

  1. The cookie trick still doesn’t work for me. I really don’t know why. So for myself, I have to add back the cross match of user emails. I think it may depends on settings on Google Group side.
  2. The topic/reply wiring logic in mbox.rb have some problems. In short, it fails to identify all topics and results in incomplete imports.

For #2, I detail the problem here and give a working solution.

In create_email_indices, email_date is directly extracted from email thread, a string in this case, and saved to SQLite. The string is in a format that SQLite doesn’t really understand. It causes sorting problems later in message_indices. Here I used DateTime.parse(email_date).to_s to patch it.

In message_indices, the way it identifies “topics” and “replies” is not reliable. By looking at raw mbox files, I see that some “topics” do have a “reply_to” field, while some of the “replies” are missing “reply_to” field, or replying to another “reply”, instead of the original “topic”. It results in an incomplete import. I have 800+ topics in the google group, but I only see 60+ topics in discourse after import. So what I came up with, is to group all records by title, and sort by date. Then choose the first one as topic, reset its “reply_to” to null, and all others’ “reply-to” to this topic’s msg_id. This way, I’m able to import all topics correctly.

Also, the SPLIT_AT constant is discarded. I don’t know that it does, but in my google group case, I think it is unnecessary. I disabled it.

Here is the code I use, Google Group => Discourse Importer - Ruby Snippet - glot.io. Before using, you need to comment out the very last line of mbox.rb.


#6

Hey @pacharanero , just got this mostly working! There’s one issue I’m running into which is that I’m getting some kind of weird encoding error for some of my posts. Here’s what the error message looks like. It looks like some kind of encoding issue since some = signs are getting replaced by =3D.

WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;margin-bottom:.0001pt;tex='
WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;margin-bottom:.0001pt;tex='
WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;

Here’s the log:

/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/utilities.rb:239:in `to_crlf': Interrupt
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:1998:in `raw_source='
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:2121:in `init_with_string'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:129:in `initialize'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:51:in `new'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:51:in `new'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:188:in `read_from_string'
/sites/discourse/script/import_scripts/mbox.rb:86:in `block in all_messages'
/sites/discourse/script/import_scripts/mbox.rb:68:in `each'
/sites/discourse/script/import_scripts/mbox.rb:68:in `each_with_index'
/sites/discourse/script/import_scripts/mbox.rb:68:in `all_messages'
/sites/discourse/script/import_scripts/mbox.rb:190:in `create_email_indices'
/sites/discourse/script/import_scripts/mbox.rb:23:in `execute'
  from googlegroups.rb:49:in `execute'
/sites/discourse/script/import_scripts/base.rb:45:in `perform'
  from googlegroups.rb:92:in `<main>'

(Mittineague) #7

Because “3D” is the hex for the equal sign I guessing that is part of the problem

AFAIK “MsoNormal” is from Microsoft Word.
(great for print documents, bad for code)

Not much help I know, but it will hopefully give you a clue about where to look.

eg. Maybe the Word curly qoutes are getting removed, the equals are getting changed to its hex value, and the result is a messed up parse.


(Marcus Baw) #8

@spacewaffle I think I had a similar error for just one post out of the large test GG import I did with this script. The mbox importer reported the error in the command line output, but it then continued the import to completion.

I’m not sure I have any ‘proper’ advice for how to fix this error, I agree with @Mittineague it does sound like a character encoding thing though. A bit of a hack, but an you identify the 3 culprit .mbox files with grep or something, and edit them manually so the parse works?

M


(Deferred Procrastination) #9

Cool, I take it this moved to: GitHub - pacharanero/google_group.to_discourse: Import script from a private Google Group into a Discourse forum


#10

Thanks so much for putting this together!

Unfortunately I’m hitting an error before anything meaningful happens.
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.7.1/lib/active_support/dependencies.rb:274:in 'require': No such file to load -- ./lib/onebox/discourse_onebox_sanitize_config (LoadError)

Any ideas? I read through the script and I noticed it isn’t creating the directory /tmp/google-group-crawler/

(My knowledge of discourse & ruby / gems is pretty limited :grimacing:)

Partial fix: I downgraded Discourse from latest-tests to 1.7.2 and then it worked!


(Marcus Baw) #11

@spencermcc glad it worked for you. I do know that the Discourse import scripts stuff has changed since I wrote the scraper. I keep meaning to update it, but as ever, time is the limiting factor.

What I’d like to do is refactor the whole thing so that it would be able to sit in the discourse standard importer ‘library’, at which point I could submit a PR and donate the whole thing to Discourse Core. Unfortunately, it required a patched/amended mbox.rb in order to work, and although mbox.rb has changed since I did the original script, I haven’t yet taken this change into consideration.

As with the Discourse Team and their prioritisation of implementation of new features, if there is anyone out there wanting paid Google Groups export/import to Discourse, then become a ‘customer’ and I will update the script :slight_smile:

M


(Rafael dos Santos Silva) #13

If you do this on a development machine that can’t send e-mails (the default) they won’t get any notifications.


(Marcus Baw) #15

The google_group.to_discourse scrip currently doesn’t work with the latest version of Discourse - I haven’t had time to update it since mbox.rb in Discourse was changed.


(blaumeer) #17

As a workaround, I would install an older version of Discourse to do the import, and then switch to the latest.


(Marcus Baw) #19

I’ve never actually had to downgrade to a previous version of Discourse myself so I’m not sure how to do it. My only concern about downgrading an install of Discourse ‘latest’ back to a previous version is that some database migrations might not work properly going backwards. The only way to be sure is, I guess, to try it.

Sorry I can’t give a timeline for updating the script for Discourse latest. Someone did contact me this week regarding some paid work doing a GG=>Discourse migration, if that work comes off then I will have to update the script. I will always publish my current latest version in the public repo on GH, so you’ll get what I am using.

M


(Russ Ennis) #23

For step 2, using ‘docker cp cookies.txt app:/cookies.txt’ might be easier than going via an attachment.


(Marcus Baw) #24

@Russ_Ennis that’s true, you could do this. You would still have to get the cookies.txt onto that server from your development machine, so it would actually be a 2-step process, something like:
(performed in the context of the Discourse server, but outside the Docker container)

scp marcus@mydevlaptop:/cookies.txt /cookies.txt     #copy file to server
docker cp cookies.txt app:/cookies.txt               #copy file into Docker container

I decided for my howto that using an attachment is slightly faster, as well as more accessible for those less comfortable with unixy command line tools. But feel free to use whichever suits you best.

Marcus


(Russ Ennis) #25

Fair enough. I looked at doing the attachment route first, but fell back to cmdline when it rejected my attempt to attached a non-{gif,jpg,png} file.


(Marcus Baw) #26

Ah yes! @Russ_Ennis you make an excellent point here which is that for uploading the cookies.txt via an attachment to work, you need to have enabled that file type in the Site Settings. The relevant setting is at http://YourSiteBaseURL/admin/site_settings/category/all_results?filter=authorized.

So in fact my ‘1 step process’ may not in fact be as simple as I thought. Thanks for pointing this out. I think I’ll add the your way of doing this to the Howto next time I revise it.

Marcus


#27

Thank you for this excellent work, unfortunately I am running into a problem where there doesn’t seem to be any data:

root@ubuntu-test-discourse4:/var/discourse# ./launcher enter app
root@ubuntu-test-discourse4-app:/var/www/discourse# cp /var/www/discourse/public/uploads/default/original/1X/5189a928f732b1ca761a61c14be8998ffa6e871e.txt /tmp/cookies.txt
root@ubuntu-test-discourse4-app:/var/www/discourse# apt install sqlite3 libsqlite3-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  sqlite3-doc
The following NEW packages will be installed:
  libsqlite3-dev sqlite3
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 1,023 kB of archives.
After this operation, 3,637 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial/main amd64 libsqlite3-dev amd64 3.11.0-1ubuntu1 [508 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial/main amd64 sqlite3 amd64 3.11.0-1ubuntu1 [515 kB]
Fetched 1,023 kB in 0s (1,488 kB/s)
Selecting previously unselected package libsqlite3-dev:amd64.
(Reading database ... 34533 files and directories currently installed.)
Preparing to unpack .../libsqlite3-dev_3.11.0-1ubuntu1_amd64.deb ...
Unpacking libsqlite3-dev:amd64 (3.11.0-1ubuntu1) ...
Selecting previously unselected package sqlite3.
Preparing to unpack .../sqlite3_3.11.0-1ubuntu1_amd64.deb ...
Unpacking sqlite3 (3.11.0-1ubuntu1) ...
Setting up libsqlite3-dev:amd64 (3.11.0-1ubuntu1) ...
Setting up sqlite3 (3.11.0-1ubuntu1) ...
root@ubuntu-test-discourse4-app:/var/www/discourse# gem install sqlite3
Fetching: sqlite3-1.3.13.gem (100%)
Building native extensions.  This could take a while...
Successfully installed sqlite3-1.3.13
1 gem installed
root@ubuntu-test-discourse4-app:/var/www/discourse# cd /var/www/discourse/script/import_scripts
root@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts# git clone https://github.com/pacharanero/google_group.to_discourse.git
Cloning into 'google_group.to_discourse'...
remote: Counting objects: 165, done.
remote: Total 165 (delta 0), reused 0 (delta 0), pack-reused 165
Receiving objects: 100% (165/165), 55.12 KiB | 0 bytes/s, done.
Resolving deltas: 100% (89/89), done.
Checking connectivity... done.
root@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts# mv google_group.to_discourse/* .
root@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts# su discourse
discourse@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts$ ruby googlegroups.rb test-discourse-migration
loading existing groups...
loading existing users...
loading existing categories...
loading existing posts...
loading existing topics...
######## Your Google Group name is test-discourse-migration
######## SoI'm expecting the Google Group URL to be https://groups.google.com/forum/#!forum/test-discourse-migration
######## Clone the Google Group Crawler from icy
Cloning into '/tmp/google-group-crawler'...
remote: Counting objects: 258, done.
remote: Total 258 (delta 0), reused 0 (delta 0), pack-reused 258
Receiving objects: 100% (258/258), 59.12 KiB | 0 bytes/s, done.
Resolving deltas: 100% (98/98), done.
Checking connectivity... done.
######## Start the first pass collection of topics
mkdir: created directory './test-discourse-migration'
mkdir: created directory './test-discourse-migration//threads/'
mkdir: created directory './test-discourse-migration//msgs/'
mkdir: created directory './test-discourse-migration//mbox/'
:: Creating './test-discourse-migration//threads/t.0' with 'forum/test-discourse-migration'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/test-discourse-migration'...
--2017-03-21 22:27:52--  https://groups.google.com/forum/?_escaped_fragment_=forum/test-discourse-migration
Resolving groups.google.com (groups.google.com)... 209.85.144.102, 209.85.144.139, 209.85.144.101, ...
Connecting to groups.google.com (groups.google.com)|209.85.144.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                  [ <=>                                               ]     590  --.-KB/s    in 0s      

2017-03-21 22:27:52 (47.2 MB/s) - written to stdout [590]

######## Iterate through topics to get messages

creating indices
0 records couldn't be associated with parents
0 replies to wire up

importing users
/var/www/discourse/script/import_scripts/base.rb:223:in `all_records_exist?': private method `exec' called for nil:NilClass (NoMethodError)
	from /var/www/discourse/script/import_scripts/mbox.rb:279:in `block in import_users'
	from /var/www/discourse/script/import_scripts/base.rb:801:in `block in batches'
	from /var/www/discourse/script/import_scripts/base.rb:800:in `loop'
	from /var/www/discourse/script/import_scripts/base.rb:800:in `batches'
	from /var/www/discourse/script/import_scripts/mbox.rb:276:in `import_users'
	from /var/www/discourse/script/import_scripts/mbox.rb:26:in `execute'
	from googlegroups.rb:49:in `execute'
	from /var/www/discourse/script/import_scripts/base.rb:45:in `perform'
	from googlegroups.rb:91:in `<main>'
discourse@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts$ 

I dug down into the code and eventually worked out that I need a URL like this to get the raw MBOX:

https://groups.google.com/forum/message/raw?msg=test-discourse-migration-2/cat-1/3HVQZHxSiek

The problem is that even when I just navigate to this in my web browser while I am authenticated in the group, the request returns a blank file, there literally is no data being returned by Google!

Have Google killed this? Or have I done something wrong somewhere?

Cheers
Kel