An Importer for Google Groups

Hi,

I would like to open a new thread to continue the discussion here about an importer for Google Groups. In short, I have a simple python scripts based on Scrapy that scraps all messages of a group to a JSON file. I hope someone who knows Ruby and Discourse API could make it into a real importer (json-discourse importer).

That prior discussion mentioned an importer too. But it seems to me that it can not scroll down to the bottom of a page to load complete messages. It sounds like that one doesnā€™t really work. I report here that my script can follow links like ā€œmore topicsā€ until there is no more, thus scraping all messages.

Another trick is that I found the Google Group url has the format of base_forum_url+range. Not sure if anyone noticed, but that basically works pretty well for me. For example, you can use

https://groups.google.com/forum/#!forum/ats-lang-users[10-20]

to get a range of topics from index 10 to index 20. Therefore you have a way to iterate all topics in a group.

Anyway, I would like to share my scripts as a starting point. I hope someone who has experience with Discourse can pick it up and make it one-click to import Google Groups to Discourse.

The Github Repo is here: https://github.com/steinwaywhw/google-group-exporter
There is a python script, depends on scrapy.org <- it should be a clickable link, but Iā€™m a new user, I canā€™ put more than two links in a post ā€¦
There is an example Google Group topic page as seen by the scraper
There is an example output from my scripts.

BTW, ā€œNew users can only have two links in a postā€ - why is that? itā€™s annoying.

3 Likes

Its a spam protection measure, you can disable in site settings, on import you generally bypass all validation so this should not be an issue for import.

4 Likes

If you want to check out a couple examples of importers that use JSON files as the starting point, there are at least a few precedents:

2 Likes

You are right. I just realized spams are all about posting links.

Those are really helpful. Thanks a lot and i will take time learn some new skills :slight_smile:

1 Like

Did you look into either of these projects?

They output mbox data, which Discourse already has a script for:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox.rb

So as long as you could get either of those projects to output clean mbox data that the importer script can work with, weā€™re golden :sunglasses:

p.s. @pacharanero also took the scraper approach and successfully migrated several sites with it.

2 Likes

Thanks for the pointers. I think it largely works, but the mbox importer has some null pointer exceptions at https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox.rb#L210 and https://github.com/discourse/discourse/blob/master/script/import_scripts/base.rb#L218. It seems that the connection to sqlite db is somehow not working. I might need some time to figure it out.

Alright, I figured it out and I hope this could help others.

Iā€™m using https://github.com/icy/google-group-crawler with the latest version (b42f28d4c36eddce1df039566f3615b0753a2dc6) of mbox import script.

Crawler Part

  1. export _GROUP=your_group_name
  2. ./crawler.sh -sh > wget.sh
  3. ./wget.sh note that chmod +x wget.sh might be needed

Import Part

  1. Need to install sqlite3
    a. inside docker, find Gemfile at /var/www/discourse and add dependency for sqlite3
    b. apt-get install -y sqlite3 libsqlite3-dev
    c. bundle install --no-deployment as root
  2. Need to modify mbox.rb in script/import_scripts
    a. find MBOX_DIR and update it accordingly.
    b. find #{MBOX_DIR}/messages/* in all_messages and modify the subdir to be mbox as this is what the crawler names it.
    c. comment all calls all_records_exist, namely, line 210, line 252, and line 317. These lines causing troubles connecting to sqlite3 database. Not sure if you will see the same problem. You can play with these.
  3. su discourse
  4. RAILS_ENV=production ruby ./mbox.rb

Refresh your Discourse webpage in the browser, and good to go.

Actually, I hope someone with experience could tell me whatā€™s going on with that ā€œall_records_exist?ā€ error. That function is from the base.rb file, and the exception basically says the connection object is null.

Update (July 2016)

Hi, I just discussed with the author of google_group_crawler and he confirmed that now the email addresses seen by the crawler is not full address, but only a portion of it. Probably for security reasons from Google side. Therefore, I extended the importer script as shown here on glot.io: Google Group Importer for Discourse - Ruby Snippet - glot.io, which will load a separate users list csv file (can be exported from Google group as a group owner/manager), and run a cross scan to recover emails. Hope it helps.

10 Likes

Continued here: