An Importer for Google Groups

steinwaywhw · June 13, 2016, 11:20pm

Hi,

I would like to open a new thread to continue the discussion here about an importer for Google Groups. In short, I have a simple python scripts based on Scrapy that scraps all messages of a group to a JSON file. I hope someone who knows Ruby and Discourse API could make it into a real importer (json-discourse importer).

That prior discussion mentioned an importer too. But it seems to me that it can not scroll down to the bottom of a page to load complete messages. It sounds like that one doesn’t really work. I report here that my script can follow links like “more topics” until there is no more, thus scraping all messages.

Another trick is that I found the Google Group url has the format of base_forum_url+range. Not sure if anyone noticed, but that basically works pretty well for me. For example, you can use

https://groups.google.com/forum/#!forum/ats-lang-users[10-20]

to get a range of topics from index 10 to index 20. Therefore you have a way to iterate all topics in a group.

Anyway, I would like to share my scripts as a starting point. I hope someone who has experience with Discourse can pick it up and make it one-click to import Google Groups to Discourse.

The Github Repo is here: https://github.com/steinwaywhw/google-group-exporter
There is a python script, depends on scrapy.org <- it should be a clickable link, but I’m a new user, I can’ put more than two links in a post …
There is an example Google Group topic page as seen by the scraper
There is an example output from my scripts.

BTW, “New users can only have two links in a post” - why is that? it’s annoying.

sam · June 14, 2016, 12:06am

Its a spam protection measure, you can disable in site settings, on import you generally bypass all validation so this should not be an issue for import.

mcwumbly · June 14, 2016, 2:19am

If you want to check out a couple examples of importers that use JSON files as the starting point, there are at least a few precedents:

steinwaywhw · June 14, 2016, 3:21am

You are right. I just realized spams are all about posting links.

steinwaywhw · June 14, 2016, 3:22am

Those are really helpful. Thanks a lot and i will take time learn some new skills

erlend_sh · June 14, 2016, 3:16pm

Did you look into either of these projects?

They output mbox data, which Discourse already has a script for:

https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox.rb

So as long as you could get either of those projects to output clean mbox data that the importer script can work with, we’re golden

p.s. @pacharanero also took the scraper approach and successfully migrated several sites with it.

steinwaywhw · June 14, 2016, 5:09pm

Thanks for the pointers. I think it largely works, but the mbox importer has some null pointer exceptions at https://github.com/discourse/discourse/blob/master/script/import_scripts/mbox.rb#L210 and https://github.com/discourse/discourse/blob/master/script/import_scripts/base.rb#L218. It seems that the connection to sqlite db is somehow not working. I might need some time to figure it out.

steinwaywhw · June 14, 2016, 6:21pm

Alright, I figured it out and I hope this could help others.

I’m using https://github.com/icy/google-group-crawler with the latest version (b42f28d4c36eddce1df039566f3615b0753a2dc6) of mbox import script.

Crawler Part

export _GROUP=your_group_name
./crawler.sh -sh > wget.sh
./wget.sh note that chmod +x wget.sh might be needed

Import Part

Need to install sqlite3
a. inside docker, find Gemfile at /var/www/discourse and add dependency for sqlite3
b. apt-get install -y sqlite3 libsqlite3-dev
c. bundle install --no-deployment as root
Need to modify mbox.rb in script/import_scripts
a. find MBOX_DIR and update it accordingly.
b. find #{MBOX_DIR}/messages/* in all_messages and modify the subdir to be mbox as this is what the crawler names it.
c. comment all calls all_records_exist, namely, line 210, line 252, and line 317. These lines causing troubles connecting to sqlite3 database. Not sure if you will see the same problem. You can play with these.
su discourse
RAILS_ENV=production ruby ./mbox.rb

Refresh your Discourse webpage in the browser, and good to go.

Actually, I hope someone with experience could tell me what’s going on with that “all_records_exist?” error. That function is from the base.rb file, and the exception basically says the connection object is null.

Update (July 2016)

Hi, I just discussed with the author of google_group_crawler and he confirmed that now the email addresses seen by the crawler is not full address, but only a portion of it. Probably for security reasons from Google side. Therefore, I extended the importer script as shown here on glot.io: Google Group Importer for Discourse - Ruby Snippet - glot.io, which will load a separate users list csv file (can be exported from Google group as a group owner/manager), and run a cross scan to recover emails. Hope it helps.

erlend_sh · July 11, 2016, 7:56am

Continued here:

Topic		Replies	Views
Import Usenet group? Support	4	1055	December 25, 2023
Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc) Migrating to Discourse how-to	106	23695	January 22, 2025
Importing HTML into Discourse Migration	1	443	October 29, 2023
Migration from Yahoo! Groups Migration	25	6405	November 19, 2023
Forwarding new posts from googlegroup to Discourse Support	17	2713	July 11, 2023

An Importer for Google Groups

Crawler Part

Import Part

Update (July 2016)

Related topics