Migration of Google Groups to Discourse

import
#6

Hey @pacharanero , just got this mostly working! There’s one issue I’m running into which is that I’m getting some kind of weird encoding error for some of my posts. Here’s what the error message looks like. It looks like some kind of encoding issue since some = signs are getting replaced by =3D.

WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;margin-bottom:.0001pt;tex='
WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;margin-bottom:.0001pt;tex='
WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;

Here’s the log:

/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/utilities.rb:239:in `to_crlf': Interrupt
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:1998:in `raw_source='
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:2121:in `init_with_string'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:129:in `initialize'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:51:in `new'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:51:in `new'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:188:in `read_from_string'
/sites/discourse/script/import_scripts/mbox.rb:86:in `block in all_messages'
/sites/discourse/script/import_scripts/mbox.rb:68:in `each'
/sites/discourse/script/import_scripts/mbox.rb:68:in `each_with_index'
/sites/discourse/script/import_scripts/mbox.rb:68:in `all_messages'
/sites/discourse/script/import_scripts/mbox.rb:190:in `create_email_indices'
/sites/discourse/script/import_scripts/mbox.rb:23:in `execute'
  from googlegroups.rb:49:in `execute'
/sites/discourse/script/import_scripts/base.rb:45:in `perform'
  from googlegroups.rb:92:in `<main>'
0 Likes

(Mittineague) #7

Because “3D” is the hex for the equal sign I guessing that is part of the problem

AFAIK “MsoNormal” is from Microsoft Word.
(great for print documents, bad for code)

Not much help I know, but it will hopefully give you a clue about where to look.

eg. Maybe the Word curly qoutes are getting removed, the equals are getting changed to its hex value, and the result is a messed up parse.

0 Likes

(Marcus Baw) #8

@spacewaffle I think I had a similar error for just one post out of the large test GG import I did with this script. The mbox importer reported the error in the command line output, but it then continued the import to completion.

I’m not sure I have any ‘proper’ advice for how to fix this error, I agree with @Mittineague it does sound like a character encoding thing though. A bit of a hack, but an you identify the 3 culprit .mbox files with grep or something, and edit them manually so the parse works?

M

0 Likes

(Deferred Procrastination) #9

Cool, I take it this moved to: GitHub - pacharanero/google_group.to_discourse: Import script from a private Google Group into a Discourse forum

0 Likes

#10

Thanks so much for putting this together!

Unfortunately I’m hitting an error before anything meaningful happens.
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.7.1/lib/active_support/dependencies.rb:274:in 'require': No such file to load -- ./lib/onebox/discourse_onebox_sanitize_config (LoadError)

Any ideas? I read through the script and I noticed it isn’t creating the directory /tmp/google-group-crawler/

(My knowledge of discourse & ruby / gems is pretty limited :grimacing:)

Partial fix: I downgraded Discourse from latest-tests to 1.7.2 and then it worked!

2 Likes

(Marcus Baw) #11

@spencermcc glad it worked for you. I do know that the Discourse import scripts stuff has changed since I wrote the scraper. I keep meaning to update it, but as ever, time is the limiting factor.

What I’d like to do is refactor the whole thing so that it would be able to sit in the discourse standard importer ‘library’, at which point I could submit a PR and donate the whole thing to Discourse Core. Unfortunately, it required a patched/amended mbox.rb in order to work, and although mbox.rb has changed since I did the original script, I haven’t yet taken this change into consideration.

As with the Discourse Team and their prioritisation of implementation of new features, if there is anyone out there wanting paid Google Groups export/import to Discourse, then become a ‘customer’ and I will update the script :slight_smile:

M

3 Likes

(Rafael dos Santos Silva) #13

If you do this on a development machine that can’t send e-mails (the default) they won’t get any notifications.

0 Likes

(Marcus Baw) #15

The google_group.to_discourse scrip currently doesn’t work with the latest version of Discourse - I haven’t had time to update it since mbox.rb in Discourse was changed.

1 Like

(blaumeer) #17

As a workaround, I would install an older version of Discourse to do the import, and then switch to the latest.

1 Like

(Marcus Baw) #19

I’ve never actually had to downgrade to a previous version of Discourse myself so I’m not sure how to do it. My only concern about downgrading an install of Discourse ‘latest’ back to a previous version is that some database migrations might not work properly going backwards. The only way to be sure is, I guess, to try it.

Sorry I can’t give a timeline for updating the script for Discourse latest. Someone did contact me this week regarding some paid work doing a GG=>Discourse migration, if that work comes off then I will have to update the script. I will always publish my current latest version in the public repo on GH, so you’ll get what I am using.

M

2 Likes

(Russ Ennis) #23

For step 2, using ‘docker cp cookies.txt app:/cookies.txt’ might be easier than going via an attachment.

1 Like

(Marcus Baw) #24

@Russ_Ennis that’s true, you could do this. You would still have to get the cookies.txt onto that server from your development machine, so it would actually be a 2-step process, something like:
(performed in the context of the Discourse server, but outside the Docker container)

scp marcus@mydevlaptop:/cookies.txt /cookies.txt     #copy file to server
docker cp cookies.txt app:/cookies.txt               #copy file into Docker container

I decided for my howto that using an attachment is slightly faster, as well as more accessible for those less comfortable with unixy command line tools. But feel free to use whichever suits you best.

Marcus

1 Like

(Russ Ennis) #25

Fair enough. I looked at doing the attachment route first, but fell back to cmdline when it rejected my attempt to attached a non-{gif,jpg,png} file.

0 Likes

(Marcus Baw) #26

Ah yes! @Russ_Ennis you make an excellent point here which is that for uploading the cookies.txt via an attachment to work, you need to have enabled that file type in the Site Settings. The relevant setting is at http://YourSiteBaseURL/admin/site_settings/category/all_results?filter=authorized.

So in fact my ‘1 step process’ may not in fact be as simple as I thought. Thanks for pointing this out. I think I’ll add the your way of doing this to the Howto next time I revise it.

Marcus

0 Likes

#27

Thank you for this excellent work, unfortunately I am running into a problem where there doesn’t seem to be any data:

root@ubuntu-test-discourse4:/var/discourse# ./launcher enter app
root@ubuntu-test-discourse4-app:/var/www/discourse# cp /var/www/discourse/public/uploads/default/original/1X/5189a928f732b1ca761a61c14be8998ffa6e871e.txt /tmp/cookies.txt
root@ubuntu-test-discourse4-app:/var/www/discourse# apt install sqlite3 libsqlite3-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  sqlite3-doc
The following NEW packages will be installed:
  libsqlite3-dev sqlite3
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 1,023 kB of archives.
After this operation, 3,637 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial/main amd64 libsqlite3-dev amd64 3.11.0-1ubuntu1 [508 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial/main amd64 sqlite3 amd64 3.11.0-1ubuntu1 [515 kB]
Fetched 1,023 kB in 0s (1,488 kB/s)
Selecting previously unselected package libsqlite3-dev:amd64.
(Reading database ... 34533 files and directories currently installed.)
Preparing to unpack .../libsqlite3-dev_3.11.0-1ubuntu1_amd64.deb ...
Unpacking libsqlite3-dev:amd64 (3.11.0-1ubuntu1) ...
Selecting previously unselected package sqlite3.
Preparing to unpack .../sqlite3_3.11.0-1ubuntu1_amd64.deb ...
Unpacking sqlite3 (3.11.0-1ubuntu1) ...
Setting up libsqlite3-dev:amd64 (3.11.0-1ubuntu1) ...
Setting up sqlite3 (3.11.0-1ubuntu1) ...
root@ubuntu-test-discourse4-app:/var/www/discourse# gem install sqlite3
Fetching: sqlite3-1.3.13.gem (100%)
Building native extensions.  This could take a while...
Successfully installed sqlite3-1.3.13
1 gem installed
root@ubuntu-test-discourse4-app:/var/www/discourse# cd /var/www/discourse/script/import_scripts
root@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts# git clone https://github.com/pacharanero/google_group.to_discourse.git
Cloning into 'google_group.to_discourse'...
remote: Counting objects: 165, done.
remote: Total 165 (delta 0), reused 0 (delta 0), pack-reused 165
Receiving objects: 100% (165/165), 55.12 KiB | 0 bytes/s, done.
Resolving deltas: 100% (89/89), done.
Checking connectivity... done.
root@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts# mv google_group.to_discourse/* .
root@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts# su discourse
discourse@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts$ ruby googlegroups.rb test-discourse-migration
loading existing groups...
loading existing users...
loading existing categories...
loading existing posts...
loading existing topics...
######## Your Google Group name is test-discourse-migration
######## SoI'm expecting the Google Group URL to be https://groups.google.com/forum/#!forum/test-discourse-migration
######## Clone the Google Group Crawler from icy
Cloning into '/tmp/google-group-crawler'...
remote: Counting objects: 258, done.
remote: Total 258 (delta 0), reused 0 (delta 0), pack-reused 258
Receiving objects: 100% (258/258), 59.12 KiB | 0 bytes/s, done.
Resolving deltas: 100% (98/98), done.
Checking connectivity... done.
######## Start the first pass collection of topics
mkdir: created directory './test-discourse-migration'
mkdir: created directory './test-discourse-migration//threads/'
mkdir: created directory './test-discourse-migration//msgs/'
mkdir: created directory './test-discourse-migration//mbox/'
:: Creating './test-discourse-migration//threads/t.0' with 'forum/test-discourse-migration'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/test-discourse-migration'...
--2017-03-21 22:27:52--  https://groups.google.com/forum/?_escaped_fragment_=forum/test-discourse-migration
Resolving groups.google.com (groups.google.com)... 209.85.144.102, 209.85.144.139, 209.85.144.101, ...
Connecting to groups.google.com (groups.google.com)|209.85.144.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                  [ <=>                                               ]     590  --.-KB/s    in 0s      

2017-03-21 22:27:52 (47.2 MB/s) - written to stdout [590]

######## Iterate through topics to get messages

creating indices
0 records couldn't be associated with parents
0 replies to wire up

importing users
/var/www/discourse/script/import_scripts/base.rb:223:in `all_records_exist?': private method `exec' called for nil:NilClass (NoMethodError)
	from /var/www/discourse/script/import_scripts/mbox.rb:279:in `block in import_users'
	from /var/www/discourse/script/import_scripts/base.rb:801:in `block in batches'
	from /var/www/discourse/script/import_scripts/base.rb:800:in `loop'
	from /var/www/discourse/script/import_scripts/base.rb:800:in `batches'
	from /var/www/discourse/script/import_scripts/mbox.rb:276:in `import_users'
	from /var/www/discourse/script/import_scripts/mbox.rb:26:in `execute'
	from googlegroups.rb:49:in `execute'
	from /var/www/discourse/script/import_scripts/base.rb:45:in `perform'
	from googlegroups.rb:91:in `<main>'
discourse@ubuntu-test-discourse4-app:/var/www/discourse/script/import_scripts$ 

I dug down into the code and eventually worked out that I need a URL like this to get the raw MBOX:

https://groups.google.com/forum/message/raw?msg=test-discourse-migration-2/cat-1/3HVQZHxSiek

The problem is that even when I just navigate to this in my web browser while I am authenticated in the group, the request returns a blank file, there literally is no data being returned by Google!

Have Google killed this? Or have I done something wrong somewhere?

Cheers
Kel

0 Likes

(Marcus Baw) #28

Nope, Google haven’t killed it. I did a full export/migration last week for
a customer and it was fine.

My script is still not up to date, for which I apologise, but I haven’t had
time to refactor it for changes to Discourse after 1.7

It sounds to me like your GG ‘logged-in-manager’ cookie isn’t valid or is
not being found by the script.

I am hoping to be able to fix the script in the next few weeks but can’t
guarantee.

M

7 Likes

#29

Oops! Meant to reply to this the other day, but as a newbie I’d exceeded my maximum number of posts so couldn’t!

Thanks for the reply, very much appreciated, I am still no further, I’ve pm’d you.

Cheers
Kel

0 Likes

#30

So, thinking a little abstractly I thought why not try from my local machine, if I can get the MBOX files then I can start to do the import just using the default Discourse MBOX script.

The original Icy script does exactly the same thing I described above, and I also tried this Python script with the same result - i.e. no results!!

This one has a very nice login feature using Lynx (bit of a blast from the past!) so I know it’s not a cookies problem and it claims to work with public groups too!

I’m pretty convinced I’m not doing anything wrong and there is something amiss here…

Feeling totally depressed about this now, was expecting a day or two and I am on day 6 now :sob:

3 Likes

#31

Little update, @pacharanero very kindly had a look at my server to see where I was going wrong and it turns out that it wasn’t working because the GG had categories and neither script handles those :persevere:

Phew… so I wasn’t doing anything wrong per se, but now I know what the problem is I can start approaching it again with a fresh mind and new things to try!

Big ups for @pacharanero, anyone looking for a migration service should get in touch with him as he very obviously knows what he is doing and he’s a very nice chap too :smiley:

6 Likes

(Gerhard Schlager) unlisted #32
0 Likes