(Superseded) Importing from Google Groups

:warning: This Howto is out-of-date.

Please check out Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc) for updated instructions.


@pacharanero’s fantastic post on Migration of Google Groups to Discourse only works with Discourse v1.7.2, so I wanted to share the steps I followed to import Google Groups into the Discourse v1.8.x (the latest stable version).

If you’ve installed and configured Discourse before and are comfortable on the command line, I hope this guide will be as helpful to you as @pacharanero’s guide has been to me.

If you run into any problems with these steps, just ask below and I’m glad to help. And if you have suggestions on how I can improve the guide, please share so we can make this process easier for everyone!

A few warnings before we start

Moving a community from Google Groups to Discourse is time-consuming, fragile, and annoying. If you can find someone who has done it before to do it for you, I’d strongly recommend that option!

If you’d like to try a move yourself, please read through all the steps first before attempting it. And please try it on a test server before actually moving your community. Again, it’s a fragile process.

The rough steps are to install and configure Discourse, install some prerequisites, scrape all your Google Groups messages (you’ll need to be a “Manager” to do this) into mbox files, then import those mbox files into Discourse.

The scrape of the messages from Google Groups is slow so give yourself a few hours. I’d also recommend you put the Google Group in read-only mode so you don’t miss any new messages while doing the import.

The import is CPU and RAM intensive, I’d recommend using the biggest machine you can find, backup the Discourse data after the import, and then restore it to a production machine.

I’ve had trouble getting this guide to work on recent versions of Discourse v1.9.x, but it does work great on v1.8.x. On your import machine, edit /var/discourse/containers/app.yaml, set version to version: stable, do a rebuild, and use that version to be safe.

1. Install and configure Discourse

  1. Install Discourse!

  2. Skip Setup by clicking Maybe Later. Then in Settings…

    • Disable emails so you don’t spam anyone with a digest email or notification
      • If you do a backup, you won’t get the notification to download
    • Grant yourself trust level 4 and moderator

2. Install prerequisites inside Docker

  1. Log into the server. You might want to use mosh/tmux instead of SSH since the process requires many long running tasks.

    ssh user@your-discourse-server 
    cd /var/discourse
    ./launcher enter app
    
  2. Install pngout and pngquant. This is needed to compress images on import.

    cd /tmp
    wget http://static.jonof.id.au/dl/kenutils/pngout-20150319-linux-static.tar.gz
    tar zxvf pngout-20150319-linux-static.tar.gz
    cp pngout-20150319-linux-static/i686/pngout-static /usr/local/bin/pngout
    
    apt-get install build-essential libpng16-dev -y
    git clone --recursive https://github.com/pornel/pngquant.git
    cd pngquant
    make && make install
    
  3. Install sqlite3 for mbox import.

    apt-get install sqlite3 libsqlite3-dev -y
    
  4. Prep the Gemfile.

    cp /var/www/discourse/Gemfile /tmp/Gemfile
    
  5. Add sqlite3 to /tmp/Gemfile. The beginning of the file will look like this:

    source 'https://rubygems.org'
    # if there is a super emergency and rubygems is playing up, try
    #source 'http://production.cf.rubygems.org'
    
    gem 'sqlite3'
    
    # does not install in linux ATM, so hack this for now
    gem 'bootsnap', require: false
    
  6. Install the needed gems. There’s likely a better way of installing gems in deployment mode, but this is how I did it.

    cd /tmp/
    bundle install
    cp /tmp/Gemfile /var/www/discourse/Gemfile
    cp /tmp/Gemfile.lock /var/www/discourse/Gemfile.lock
    cd /var/www/discourse/
    bundle install
    
  7. Edit /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/discourse_image_optim-0.24.5/lib/image_optim/worker/jhead.rb to fix compression errors during import. The beginning of the file will look like this:

    require 'image_optim/worker'
    require 'exifr/jpeg'
    

Scrape messages from Google Groups

  1. Download @pacharanero’s awesome google_group.to_discourse script.

    cd /var/www/discourse/script/import_scripts/
    wget https://raw.githubusercontent.com/pacharanero/google-group-to-discourse-migration-script/master/googlegroup.rb
    
  2. Edit /var/www/discourse/script/import_scripts/googlegroup.rb to comment out import_to_discourse. The end of the file will look like this:

    setup
    scrape_google_group_to_mbox
    # import_to_discourse
    
  3. To scrape the emails from Google Groups, you will need a Google account with a “Manager” level access to the Google Group.

    Use Chrome to log into the Google Group using that manager account. Then with the cookies.txt extension, get a valid cookie file.

    SID, HSID, SSID from .google.com are all that is needed, so trim the large cookie file and put the resulting three lines at /var/www/discourse/script/import_scripts/cookies.txt. The file will look like this:

    .google.com TRUE  / FALSE 1568431805  HSID  gwB8B0z7IH8QPgYVz
    .google.com TRUE  / TRUE  1568431805  SSID  MPo7SOfkphRl9uqG0
    .google.com TRUE  / FALSE 1569505294  SID mEGqexZoGBVnTyO1NgPkdKI3zl10O6MmEGqexZoGBVnTyO1NgPkdKI3zl10O6MmGmDcN3G2
    
  4. Scrape messages from the Google Group. Depending on the size of your group, this step will take a few hours. Since it is a long process, it might be wise to backup the googlegroup-export folder after the scrape has finished.

    cd /var/www/discourse/script/import_scripts/
    RAILS_ENV=production bundle exec ruby googlegroup.rb my-list /var/www/discourse/script/import_scripts/cookies.txt
    
  5. Setup your data directory for the import.

    mkdir -p /var/www/discourse/script/import_scripts/mbox-import
    cd /var/www/discourse/script/import_scripts/mbox-import
    
  6. Edit /var/www/discourse/script/import_scripts/mbox/settings.yml to point to data directory. The file will look like this:

    data_dir: /var/www/discourse/script/import_scripts/mbox-import
    default_trust_level: 1
    split_regex: "/^From (.*) at/"
    
  7. Move all the my-list emails into a my-category folder. my-category will be created by the import script automatically.

    cp -r /var/www/discourse/script/import_scripts/googlegroup-export/my-list/mbox my-category
    chmod -R 777 /var/www/discourse/script/import_scripts/mbox-import 
    
  8. Edit /var/www/discourse/script/import_scripts/mbox/importer.rb to make sure users are not staged. The create_users method will look like this:

    create_users(rows, total: total_count, offset: offset) do |row|
      {
        id: row['email'],
        email: row['email'],
        name: row['name'],
        trust_level: @settings.trust_level,
        staged: false,
        created_at: to_time(row['date_of_first_message'])
      }
    end
    

3. Import mbox files into Discourse

  1. Run the import. Depending on the size of your group, this step may take a few hours. The driver seems to be how many images need to be compressed by pngquant/pngout.

    su - discourse
    cd /var/www/discourse/script/import_scripts/
    RAILS_ENV=production bundle exec ruby mbox-experimental.rb mbox/settings.yml
    
  2. That’s it! All the messages from my-list should now be in my-category!

Frequently asked questions

  • Import is failing on a particular topic, how do I recover?

    The import may fail when creating a particular topic. Usually, it’s because there are some HTML or Unicode characters that can’t be parsed. Make a note of the number (e.g., 123) where the import failed, then run sqlite3 index.db to get into the import database. A query like this will show you the bad message.

    SELECT msg_id, from_email, from_name, subject, email_date, attachment_count
    FROM email
    WHERE in_reply_to IS NULL
    ORDER BY DATE(email_date)
    LIMIT 1
    OFFSET 123;
    

    Remove the bad message from /var/www/discourse/script/import_scripts/mbox-import, delete index.db, and re-run the import.

  • How do I add members who signed up for the Google Group, but didn’t send messages?

    The scrape only creates accounts for members who have sent messages to the Google Group.

    If you’d like to create accounts for your “silent” members, you’ll need to copy and paste the names, emails, and sign up dates from the Google Group’s member list. You must copy and paste because Export Members doesn’t work on large groups, so scroll to the bottom of the member list and then copy all.

    Then clean up the data as you see fit and load data into the users table. The data should look like this:

    ssmith@gmail.com, Sally Smith, 2012-05-18T00:00:01-00:00
    molly.mathers@gmail.com, Molly Mathers, 2010-01-05T00:00:01-00:00
    

    Finally, edit /var/www/discourse/script/import_scripts/mbox/importer.rb to only import users.

    def execute
      # index_messages
      # import_categories
      import_users
      # import_posts
    end
    
  • How do I update the Google Group scrape with new messages?

    Modify the googlegroup.rb script so instead of running the wget command, you run the update command. The file will look like this:

    puts "This stage takes longer than the first pass and can take hours, depending on the size of your Google Group\n\n".blue
    
    system './crawler.sh -rss > update.sh'
    system 'chmod +x ./update.sh'
    system './update.sh'
    
    system "chmod -R 777 #{ENV["_GROUP"]}"
    
15 Likes

Thanks @yanokwa for the name-check on my google-group migration script! This is a great HOWTO you’ve done here.

I keep meaning to update the importer script, but lack of time and a profusion of other projects keep getting in the way. I’ll get to it eventually. I can provide google-group=> discourse migration services and as part of any future work I’ll update and publish the script and hopefully put in some tests and proper production values so it can be maintained and pulled into Discourse core.

I took a look at it recently but I was a little confused by the current state of the mbox importer scripts - there are now two of them in discourse/script/import-scripts/ but there’s no documentation that I can find anywhere either in the source code, or in Meta, or any README.md anywhere.

As far as I can tell, discourse/script/import_scripts/mbox.rb is the original mbox importer, which was contributed to by (among others) @eviltrout @pfaffman @sam. I tried to use it for a straight mbox archive import recently and it didn’t work on discourse 1.9, but this may have been a problem with the mbox files I was importing.

Then we have discourse/script/import_scripts/mbox-experimental.rb, which has a little bit of documentation in the source code, but is not referenced anywhere in Meta apart from in this post. This script looks like it uses the code in discourse/script/import_scripts/mbox/ and has a yaml file for the import settings. It was contributed to by @gerhard @tgxworld @techAPJ, and would seem to have been updated much more recently.

It would be nice to have a bit more clarity around which one is the ‘official’ Discourse mbox importer (the older non-experimental one that works with 1.8 but not 1.9, or the newer ‘experimental’ one), and perhaps an updated/wikified howto for 1.9 mbox imports?

5 Likes

Both are “official”, but the “experimental” one – which was my playground for a little while :slight_smile: – will surely replace the older one in the next month or two.

I think this could be added as a setting to the import script. IMHO there shouldn’t be a need for manually editing code. You can send a PR if you want. :wink:

6 Likes

Thanks @gerhard good to know what the roadmap is

PR has been merged: https://github.com/discourse/discourse/commit/77a92e8878e550a3c9bb3e435f827c8941d7f8a3

8 Likes