Import from Google Groups to Discourse


(Joel Lamotte) #1

Continuing the discussion from Discourse in the news…:

I’m in the process of migrating my OSS project’s Google Groups (which is used mostly by me and a few followers) to Discourse. As there is not a lot of history yet and Google will keep it for some years I think (and I have backup in my mail account), I planned to just install the Discourse forums and start from scratch.

However, it would be a bit nicer if I could import the Google Groups discussions into Discourse.
Does such importer already exists?

Edit: There is no an officially supported tutorial on migrating from Google Groups to Discourse:


Free Hosted Option?
Support for migration of a Yahoo! group
(Michael Poon) #2

afaik, there is not even a good way to export from Google Groups. The best I’ve come across are scrapers. If there is something out there, I, too, would be very interested.


(Erlend Sogge Heggen) #3

One weird idea:

Once Discourse supports posting by email, maybe a simple bot could be created that’s added to your mailinglist. This bot would forward every conversation to your forum. It’s hardly perfect, as it would be difficult to make separate posts for each email, assign authors and such, but at least you’d have the whole conversations and they’d be searchable, which is the most important part.

@uckelman since you seem to have a fair bit of knowledge on this subject, do you think this sounds feasible?


(Joel Uckelman) #4

[quote=“erlend_sh, post:3, topic:7307”]
@uckelman since you seem to have a fair bit of knowledge on this subject, do you think this sounds feasible?[/quote]
Yes. You have an address which you subscribe to the mailing list. Then you pipe all of the mail coming to that address to the script Discourse would use to process incoming mail. This isn’t so different from how some list archiving software works. But it wouldn’t help you archive old posts, only incoming new posts.


MOSS Roadmap - Mailing lists
(Gabriel Mazetto) #5

I’ve searched and the only feasible way is to make a scrapping bot to recover every list content.
The robots.txt allow you to do it, but bare in mind that depending on how big your forum history is it’s a job that can take many hours through several days.


(Tobias Eigen) #6

Actually @uckelman and @erlend_sh there might be a way to feed in old emails using an mbox file as the source, as explained in this post about FUD Forum. I haven’t tried it yet but will be looking at it at some stage.

http://fudforum.org/forum/index.php?t=msg&goto=21573&&srch=maillist.php+mbox


(Marcus Baw) #7

I am working on a scraper for Google Groups that exports to Discourse.

The difficulty with Google Groups is…
The difficultiES with Google Groups are:

  1. there is no API
  2. there is no API
  3. the entire content is rendered in-browser from JS so HTML requesting tools such as the Ruby Mechanize gem don’t work (you can log in but you can’t see any content)
  4. the HTML tags are (it seems deliberately) obfuscated - they are meaningless in English so it’s hard to work out what CSS selectors to go for when scraping the page
  5. I’m told there are Captchas if you go over a certain rate limit for page requests (although I didn’t encounter this problem)

So I have developed an approach using the Ruby Selenium bindings, automating a Firefox browser, so that the content can be rendered. I worked out which CSS/Xpath selectors are required in order to get the right bits of content. I scrape it all a Topic at a time and pipe it into Discourse.

The code is open source (naturalmente) and MIT licence.
https://github.com/pacharanero/google_group.to_discourse
I’ll get round to writing proper documentation for it soon.

The code is hacky but it works, I migrated 246 Topics from a medium sized commercially important Google Group for a client. I’m happy to accept pull requests if you’ve improved it

and I’m also interested in taking on “We Scrape Any Google Group” contract work for this, PM me if interested.

Marcus


(Michael Downey) #8

This is fantastic progress. I assume the original dates are not preserved, though?


(Marcus Baw) #9

No, unfortunately not, I don’t think Discourse’s API allows one to create a post in ‘the past’. If that was added in the future then we might be able to preserve the time course properly. As it is I include the Original Posting Date in a header string. Actually it would be nice to have a Master API key that allowed #add_post_as_user() and #add_post-at-date() or something.

The date is stripped out of the Google Groups HTML element as a string, although I do know a way (which I’ve tried successfully) that yields a DateTime object that would be able to be passed to an API.

Here’s how the import looks as it is now - note the header with original import information


(Michael Downey) #10

I suppose that’s the next best thing to actual dates, unless there’s something hidden in the API that someone can share. Are the posts imported in chronological order so that the newest Groups threads are imported last?


(Marcus Baw) #11

They are actually not but I agree this is the way I should have done it. It’s as simple as changing a .each to a .reverse_each in the code.


#12

I have a Google Group that I would like to have scraped to then import into a Forum. Can you do this for me to save my pain?

Thanks

TudorUser


(Allen - Watchman Monitoring) #13

@TudorUser not sure who you’re asking… but I’d love to have a google group of mine loaded into a forum of mine.

Maybe someone who’s “specialized” in this can post their service, as I wouldn’t expect it to be free.


#14

I was trying to get Pacherano, as I could not work out how Pm him/her


(Zack Piper) #15

Using @THE_USERNAME will mention them and they’ll get a notification, you can PM people via their profile but it looks like a restriction on new users prevent this, for obvious reasons such as PM spam which can’t be moderated easily because people don’t (usually) look through people’s PMs.


(Allen - Watchman Monitoring) #16

Thanks for that… it didn’t occur to me that @TudorUser was new, and couldn’t mention @pacharanero directly.

But, that’s taken care of now :wink:


(Marcus Baw) #17

Hello @TudorUser & @watchmanmonitor!

I have done a bit of Google Group scraping into Discourse having done this commercially for a customer for several different forums. Google Groups, as you would imagine, makes this as difficult as possible, but it is certainly possible and I’d be happy to quote for this service. Please PM me in Discourse Meta and we can discuss further:

Each scrape is unique as different customers have different requirements as to how much detail they want to retain, whether they want to save attachments, and also simply the size of the groups, because there is a rate limit on requests, so a bigger google group simply takes longer.

[For those with the technical knowledge to do their own scraping, the full code for my automated Ruby scraper is on GitHub GitHub - pacharanero/google_group.to_discourse: Import script from a private Google Group into a Discourse forum and I’m happy to accept PRs on improvements, especially ones which make it easier for non-technical people to scrape Google Groups.]


An Importer for Google Groups
MOSS Roadmap - Mailing lists
#18

I own a group with about 160 members. There are 350 topics that I would like to convert into threads. This is not a commercial venture. The group is for owners of a make of yacht. Having said that i would pay to get this done!
I am having a website built with WordPress. The Google group is private so I need your email to get you access. Can pm it to me?
Thanks
TU


(Erlend Sogge Heggen) #19

Thanks for sharing! Would you mind creating a new How-to topic dedicated to your script? Basically just a copy & paste of your README.md. You’ll have to leave out the parts about asking for donations since that verges on commercialisation which doesn’t belong in the howto category, but you could put up a separate topic about your services in the Marketplace.


(Marcus Baw) #20

OK I will do this but I can’t guarantee it will be soon! I’m a bit busy
with other stuff. It’s a good idea though.

Marcus