I would like to open a new thread to continue the discussion here about an importer for Google Groups. In short, I have a simple python scripts based on Scrapy that scraps all messages of a group to a JSON file. I hope someone who knows Ruby and Discourse API could make it into a real importer (json-discourse importer).
That prior discussion mentioned an importer too. But it seems to me that it can not scroll down to the bottom of a page to load complete messages. It sounds like that one doesnāt really work. I report here that my script can follow links like āmore topicsā until there is no more, thus scraping all messages.
Another trick is that I found the Google Group url has the format of base_forum_url+range. Not sure if anyone noticed, but that basically works pretty well for me. For example, you can use
to get a range of topics from index 10 to index 20. Therefore you have a way to iterate all topics in a group.
Anyway, I would like to share my scripts as a starting point. I hope someone who has experience with Discourse can pick it up and make it one-click to import Google Groups to Discourse.
The Github Repo is here: https://github.com/steinwaywhw/google-group-exporter
There is a python script, depends on scrapy.org<- it should be a clickable link, but Iām a new user, I canā put more than two links in a post ā¦
There is an example Google Group topic page as seen by the scraper
There is an example output from my scripts.
BTW, āNew users can only have two links in a postā - why is that? itās annoying.
Its a spam protection measure, you can disable in site settings, on import you generally bypass all validation so this should not be an issue for import.
./wget.shnote that chmod +x wget.sh might be needed
Import Part
Need to install sqlite3
a. inside docker, find Gemfile at /var/www/discourse and add dependency for sqlite3
b. apt-get install -y sqlite3 libsqlite3-dev
c. bundle install --no-deployment as root
Need to modify mbox.rb in script/import_scripts
a. find MBOX_DIR and update it accordingly.
b. find #{MBOX_DIR}/messages/* in all_messages and modify the subdir to be mbox as this is what the crawler names it.
c. comment all calls all_records_exist, namely, line 210, line 252, and line 317. These lines causing troubles connecting to sqlite3 database. Not sure if you will see the same problem. You can play with these.
su discourse
RAILS_ENV=production ruby ./mbox.rb
Refresh your Discourse webpage in the browser, and good to go.
Actually, I hope someone with experience could tell me whatās going on with that āall_records_exist?ā error. That function is from the base.rb file, and the exception basically says the connection object is null.
Update (July 2016)
Hi, I just discussed with the author of google_group_crawler and he confirmed that now the email addresses seen by the crawler is not full address, but only a portion of it. Probably for security reasons from Google side. Therefore, I extended the importer script as shown here on glot.io: Google Group Importer for Discourse - Ruby Snippet - glot.io, which will load a separate users list csv file (can be exported from Google group as a group owner/manager), and run a cross scan to recover emails. Hope it helps.