Migration from Yahoo! Groups

A lot of Yahoo groups are being affected by Yahoo’s decision to get rid of uploaded content, and are looking for alternative providers. The obvious one is groups.io, which offers feature parity and will automagically migrate all the group content, but (1) costs $220 for the first year to do the migration (after that, it’s free if the storage requirements are < 1 GB), and (2) still uses the same email-centric format of Yahoo Groups. I see mentioned in another topic (Yahoo Groups to Discourse migration?) that there is a migration script, but from what I can see (admittedly not much at this point), it only migrates messages.

Since the Yahoo change that’s going to be driving groups to move is the loss of file and photo storage, this leaves two questions:

  • Is there an automated way to move uploaded content from a Yahoo Group into a Discourse instance?
  • What options are there in Discourse to make that content available/visible to other users? The only one I’m seeing would be for a user’s uploaded content to be put into one or more topics by that user, probably in distinct categories. For photos that would be straightforward enough; for other documents uploaded to the group (mostly PDFs, some .doc and .xls files) it looks like that would require some config changes but nothing too serious. But is there another option?
5 Likes

Interesting… I did not know about yahoo groups getting rid of uploaded content. I have an old yahoo group I still maintain for my neighborhood and it’s been back of mind to migrate it to a discourse instance… maybe it’s coming time.

The easiest way to migrate methinks would be to export the subscriber list to csv and then use that to create your user base in discourse - that should be fairly straightforward.

For the content, do you have a complete history of messages sent to the list in your email? If so, you could use an app like thunderbird to download all the messages and save them to MBOX. Once you have that then there are scripts for importing. I think this recipe will help you: Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

Not sure about what you are describing as uploaded content - I haven’t used yahoo groups that way myself. I do not know what your options are for getting those out of yahoo and preparing those for discourse. It may be a manual process… and maybe a good opportunity to get organized and get rid of stuff that is no longer needed.

But yes, discourse is discussion oriented and so all content is in topics. It is possible to set topics as wikis so that they can be maintained by a group, including adding/removing attachments. There are also personal messages which could be used to talk to yourself or a handpicked selection of other people, where I suppose people could keep some content. It may be easier for you to look at another tool with SSO for file sharing. In my community we use wordpress which has a plugin that handles SSO, which works quite seamlessly. If you’re talking about alot of files, you could set up a nextcloud instance.

Have fun!

update: whoa… certainly looks like it’s time to move. They are really taking drastic action to limit the usefulness of yahoo groups, very soon. No more new content as of 28 October, in two days! And content to be removed on 14 December.

https://help.yahoo.com/kb/groups/SLN31010.html?impressions=true

4 Likes

I don’t, though I’m not an admin for any of the lists I’m dealing with. But migrating the messages would appear to be addressed by https://github.com/discourse/discourse/blob/master/script/import_scripts/yahoogroup.rb and https://github.com/jonbartlett/yahoo-groups-export. And a group admin would be able to export the group messages with the email addresses attached (which a normal user can’t). Although it doesn’t look quite pushbutton simple, the messages appear to be the least of the concerns at this point.

Yahoo groups provide (well, provided) storage space for photos (100 GB) and other files (2 GB) for the group. Groups I’m in use these for pictures of group members and items of interest to the group, and for various other files. Anything that would be sent privately among members in a Y group would likely have gone as an email, and Y wouldn’t have any record of that; I wouldn’t think migrating that would even be possible, much less a priority. But there is a lot of information stored there in many groups, which they’d want to preserve in a migration.

Could be. Again, there’s one (and only one, AFAIK) site that appears to be a turnkey replacement, but it maintains the old format of Y groups with only minor changes. I’m thinking that, if a group has to migrate anyway, it might be nice to migrate to something more modern, and Discourse still plays pretty nicely with email (which many other forum packages don’t), meaning that the old-school folks like me who are used to getting the email delivered to them, and replying by email, can still do it that way. And saving a few bucks would be nice as well.

4 Likes

So, further developments. This tool:
https://github.com/IgnoredAmbience/yahoo-group-archiver

seems to work quite nicely in bulk-downloading the contents of a group–it gets all the messages, files, attachments, etc. The messages are downloaded in two .json files each, one “raw” and the other HTML. The first looks like:

{
    "userId": 185744666,
    "authorName": "vhsproducts@aol.com",
    "from": "vhsproducts@...",
    "profile": "vhsproducts",
    "replyTo": "LIST",
    "senderId": "fc-T6L4xNaFRDleu_7gutRzgA_WWujKXanij68LOf7iz0WXh-BolDsmiqlo19adwRPTjwe0FpCYycg",
    "spamInfo": {
        "isSpam": false,
        "reason": "0"
    },
    "subject": "Re: [MicroTrak] Mint-Trak300 completed",
    "postDate": "1181013131",
    "msgId": 4,
    "canDelete": false,
    "contentTrasformed": false,
    "systemMessage": false,
    "headers": {
        "messageIdInHeader": "PGM3ZC5lNWZlOTFjLjMzOTYyZThiQGFvbC5jb20+"
    },
    "prevInTopic": 3,
    "nextInTopic": 6,
    "prevInTime": 3,
    "nextInTime": 5,
    "topicId": 3,
    "numMessagesInTopic": 4,
    "msgSnippet": "Outstanding work! I see you have the first gen of the Micro-Trak ( although we still sell them for people with TT3 SMT s) How long will a 9 volt run your GPS? ",
    "rawEmail": "Return-Path: &lt;VHSProducts@...&gt;\r\nX-Sender: VHSProducts@...\r\nX-Apparently-To: MicroTrak@yahoogroups.com\r\nReceived: (qmail 18487 invoked from network); 5 Jun 2007 03:13:19 -0000\r\nReceived: from unknown (66.218.67.36)\n  by m50.grp.scd.yahoo.com with QMQP; 5 Jun 2007 03:13:19 -0000\r\nReceived: from unknown (HELO imo-m23.mx.aol.com) (64.12.137.4)\n  by mta10.grp.scd.yahoo.com with SMTP; 5 Jun 2007 03:13:19 -0000\r\nReceived: from VHSProducts@...\n\tby imo-m23.mx.aol.com (mail_out_v38_r9.2.) id r.c7d.e5fe91c (29679)\n\t for &lt;MicroTrak@yahoogroups.com&gt;; Mon, 4 Jun 2007 23:12:11 -0400 (EDT)\r\nMessage-ID: &lt;c7d.e5fe91c.33962e8b@...&gt;\r\nDate: Mon, 4 Jun 2007 23:12:11 EDT\r\nTo: MicroTrak@yahoogroups.com\r\nMIME-Version: 1.0\r\nContent-Type: multipart/alternative; boundary=&quot;-----------------------------1181013131&quot;\r\nX-Mailer: 9.0 Security Edition for Windows sub 5365\r\n(snip)"
}

…while the latter looks like:

{
    "userId": 185744666,
    "authorName": "vhsproducts@aol.com",
    "from": "vhsproducts@...",
    "profile": "vhsproducts",
    "replyTo": "LIST",
    "senderId": "oChpSVZSELyeHvFRyDX_nG5dfpdVZTLBKFMDvOg33fSsrDk5l-zpPohl42rhz6OhM9tFfSjAxxGsRg",
    "spamInfo": {
        "isSpam": false,
        "reason": "0"
    },
    "subject": "Re: [MicroTrak] Mint-Trak300 completed",
    "postDate": "1181013131",
    "msgId": 4,
    "canDelete": false,
    "contentTrasformed": false,
    "systemMessage": false,
    "headers": {
        "messageIdInHeader": "PGM3ZC5lNWZlOTFjLjMzOTYyZThiQGFvbC5jb20+"
    },
    "prevInTopic": 3,
    "nextInTopic": 6,
    "prevInTime": 3,
    "nextInTime": 5,
    "topicId": 3,
    "numMessagesInTopic": 4,
    "msgSnippet": "Outstanding work! I see you have the first gen of the Micro-Trak ( although we still sell them for people with TT3 SMT s) How long will a 9 volt run your GPS? ",
    "messageBody": "<div id=\"ygrps-yiv-810547383\">\n<html><head>\n \n</head> \n\n<font id=\"ygrps-yiv-810547383role_document\"\n face=\"Arial\" color=\"#000000\" size=\"2\">\n<div>Outstanding work! I see you have the first gen of the Micro-Trak ( although \nwe still sell them for people with TT3 SMT&#39;s) How long will a 9 volt run your \nGPS?</div>\n(snip)",
    "specialLinks": []
}

Depending on the group, there can be tens or even hundreds of thousands of these files. Yahoo, being Yahoo, masks the email addresses from “normal” users–group owners can see them, and maybe moderators, but the rest can’t. Now to see if there’s a relatively-straightforward way to bulk-import these into a Discourse instance, or if it’d be better to use the tools I mentioned above.

Files and Photos are also downloaded by this tool, along with polls, calendars, and other stuff that I don’t really care about but no doubt others would.

One other point–a more careful reading of Yahoo’s message indicates that not only are they getting rid of files and photos, they’re also doing away with message archives. That’s really going to make them useless for any purpose.

3 Likes

Please do migrate to Discourse.

We need more independent websites and fewer people relying on the major monopoly platforms.

Just to say that I migrated my local community from a Yahoo Group to Discourse years ago and we have never looked back. The personal touches you can then add to your shared communication resource are worth it alone, but the additional features are icing on the cake.

Unfortunately I can’t offer you useful experience from migration as we simply started anew, save for the email list. Why not just leave the old Yahoo Group site up and provide a link? How many attachments do you really need to keep? Target the most important ones?

Best of luck, you’ll be fine!

3 Likes

Not my call directly, but I have inclinations that way. And for the group I’m most concerned about, I don’t expect the files/photos to be a major problem–I now have them all downloaded, and there are few enough of them that manually pulling them into topics shouldn’t be a big problem.

Yes, because we’re seeing right now one of the risks of that.

Because in six weeks, all the data is disappearing from there.

4 Likes

I can write an importer to read those json files, but I cannot compete with $200. I typically charge 10x that to write an importer and import a moderate sized forum (a few hundred thousand posts).

2 Likes

So it sounds like I’d be better off using:
https://github.com/jonbartlett/yahoo-groups-export
followed by:
https://github.com/discourse/discourse/blob/master/script/import_scripts/yahoogroup.rb
…once I figure them out.

2 Likes

(major edits below–second attempt)

Working on the message import process, using the instructions at Migrating to Discourse from another Forum software. As I understand them, the process should look like:

  • Set up development environment using Beginners Guide to Install Discourse on Ubuntu for Development
  • Install MongoDB on that system
  • On that system, as the same non-privileged user that’s running Discourse, git clone the yahoo-group-export script
  • As the same user, gem install mechanize; gem install mongo. Then edit .config.yaml to give the Yahoo credentials and group name, and run ruby bin/yg-export.rb.
  • Have a cup (or two) of your beverage of choice.
  • Once yg-export finishes, in the Discourse directory, take a look at script/import_scripts/yahoogroup.rb. Edit it to point to the correct MONGODB_HOST (localhost).
  • In the discourse directory, run bundle exec ruby script/import_scripts/yahoogroup.rb
  • Verify it’s imported correctly
  • Backup and restore to a live server

Steps 2-4 are inferred. But does this look like the right steps to follow? Believing they were, I proceeded. It ran fine through step 4–yg-export.rb ran for about an how, reporting SUCCESS for everything, and saving ~38k messages. The database syncro is present with ~85MB of data in it. At that point, I took a snapshot of the VM.

I’m having trouble with the import script, though. When I run bundle exec ruby script/import_scripts/yahoogroup.rb, I get this:

dan@ubuntu:~/discourse$ bundle exec ruby script/import_scripts/yahoogroup.rb
Traceback (most recent call last):
script/import_scripts/yahoogroup.rb: Bootsnap::LoadPathCache::FallbackScan
        7: from script/import_scripts/yahoogroup.rb:4:in `<main>'
        6: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/zeitwerk-2.1.10/lib/zeitwerk/kernel.rb:23:in `require'
        5: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:26:in `require'
        4: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:40:in `rescue in require'
        3: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:21:in `require_with_bootsnap_lfi'
        2: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/loaded_features_index.rb:89:in `register'
        1: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `block in require_with_bootsnap_lfi'
/home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `require': cannot load such file -- mongo (LoadError)

Strange, I thought I’d already installed the mongo gem. Well, I’ll do it again:

dan@ubuntu:~/discourse$ gem install mongo
Successfully installed mongo-2.10.2
Parsing documentation for mongo-2.10.2
Done installing documentation for mongo after 4 seconds
1 gem installed

Run the import script again, same result. Does it matter if I install it at the system level?

dan@ubuntu:~/discourse$ sudo apt install ruby-mongo
[sudo] password for dan: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ruby-mongo is already the newest version (2.5.1-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Curiouser and curiouser…

You need to add the gems to Gemfile and run bundle install, otherwise running bundle exec ruby script/import_scripts/yahoogroup.rb won’t find the gems.

6 Likes

That was the piece I was missing, now the import is running merrily along. Thanks!

Edit: OK, the import process ran 75 minutes, and the posts are now there. Excellent. It also created users, which I’d wondered about. However, I’m seeing a couple of problems with users:

  • It seems that all the Yahoo user names imported correctly (I recognize many of them from my membership in the list), but attached themselves to the wrong messages. They did this consistently–all of the messages that were posted by me now appear as having been posted by the same other user–but nonetheless it’s a significant error that would be a major pain to clean up manually.
  • All the imported users are suspended for the next 200 years.

I suspect both of these stem from the lack of valid email addresses in the data downloaded from Yahoo, which is because I’m not a group admin–that’s explicitly stated as the cause for the latter issue, but I don’t know if it would cause the former as well. Thoughts?

If that’s the issue, it gives me something to run down, but also a potential problem–I know there are two living moderators for the group, but the owner died within the past year. Hopefully someone has access to that information…

5 Likes

Hello,

There is a new plugin for that.

I have exported and imported all the Yahoo! messages from a development version as a category with these script.

Now there was the problem with the double or false users.
With this Plugin: Merge Users Plugin
You can easly merge the Yahoo users with your Discourse users.

And now we have only the problem with the attachments from Yahoo.

2 Likes

That doesn’t quite do what I’d need–the only users present on this instance, other than the admin user, are the users imported from Yahoo. The problem is that the wrong username is attached to the wrong posts–my posts are connected (consistently, AFAICT) to someone else’s username, and someone else’s posts are connected to mine.

I have now been given moderator access to this group, which is enough to download messages with actual email addresses attached. I’ll rebuild (VMs are great for that), redo the import, and see if that addresses the issue.

1 Like

Hey, okay, that’s not the same what I found out. At my community was, nothing of the imported posts are connected to anyone of the existing community.
If I imported the category as described in the linked topic. Then I only had double users (or someone new) in the user list.

The importer set the posts to the false user, but he do it in the right way. I mean, if the posts, they i wrote by myself in Yahoo!, are set to “Hans”. Then all posts of mine was set to “Hans”

I have user ID 1 in my comunity forum, but that’s not the same name as in the Discourse forum for development I set up. So my account was not overritten, but there was another account with the same name. But this account is linked to the false post.

Now I use the linked plugin to merge all users separately to the right person in my forum. It take not much time. But it´s hard to figure out which post belongs to which user.

1 Like

We’re possibly in different situations–in my case, there’s no “existing community.” Rather, everything is being imported from the Y group.

OK, I have the download of the group messages including full email addresses. And after some issues with the dev environment, I’ve been able to work on the import again. And I’m still noticing a few issues:

  • The issue with the wrong usernames being connected with the wrong posts remains.
  • Possibly as the (or “a”) cause of the above, most of the users being imported are determined to have invalid email addresses. In the Mongo database generated by the yahoo-export script, the From field (from which the import script appears to be attempting to read email addresses) reads for most users like:
First Last &lt;user@domain.com&gt;

…which Discourse rejects as an invalid email address. As a result, most users are assigned an email address like 5dc3e1b4f4d821bd7de3ce456eaf26d5@email.invalid–exceptions appear to be those users who sent emails without their full names attached.

  • The imported messages contain a number of HTML entities, particularly for quotes and greater-than and less-than signs.

  • Many, but definitely not all, of the imported messages have a group name in the subject line: Re: [SpareOom] some subject. It’d be nice to remove those.
    For the last three bullets, I’m wondering if a simple find/replace throughout the database would take care of it–and if so, how to go about doing that bearing in mind that I’ve never touched MongoDB before.

  • A separate matter is importing the messages into a designated category. The comments at the top of yahoogroup.rb say that you can do export CATEGORY_ID=<CATEGORY_ID> before running the script to do this, but doesn’t indicate what <CATEGORY_ID> refers to. I’ve tried the regular name of the category, as well as the “category slug” (both are the same except for capitalization), but in both cases the import script fails with:

         1: from /home/dan/discourse/lib/topic_creator.rb:36:in `create'
/home/dan/discourse/lib/topic_creator.rb:115:in `setup_topic_params': category (Discourse::InvalidParameters)
1 Like

That sounds a lot like my first mbox import. It took me a couple of months.

Yes, you can probably fix some stuff with some replacements.

If you put .json at the end of a category url you can find a category id. It’s an integer.

You’ll need to look at what the user creator is using for an identifier and what is getting used by the post function to find users. Or maybe they just don’t match.

1 Like

Looking through yahoogroup.rb, it’s definitely expecting the From field in the message to be a bare email address. Since most users configure their email clients to send a name as well (e.g.,

Fred Flintstone <fred@flintstone.com>

), that’s problem #1. A bit of Googling suggests that this can be addressed using the Mail gem, which would change that line in the import script to read:

        email: Mail::ToField.new(user_info["ygData"]["from"]), # mandatory

…which would extract only the email address. But as noted above, the angle brackets are stored as HTML entities instead, which breaks this method. Further Googling tells me there’s a HTMLEntities gem that would take care of it, leading me to try this:

        email: Mail::ToField.new(HTMLEntities.new.decode(user_info["ygData"]["from"])), # mandatory

But that’s failing due to lack of a downcase method.

Edit: I tried to avoid that by going a different route; I saw lots of suggestions for Nokogiri. But however useful it is, the suggestions I found didn’t decode the angle bracket entities, which was (and is) my most immediate need. So back to HTMLEntities I go. I added require 'mail' and require 'htmlentities' to the top of the Yahoo import script, and changed line 75 (was line 73 before I added the requires) to read as above. I’m still getting an error, but what I’d missed previously is that it does actually properly parse and import one user before it dies:

dan@ubuntu:~/discourse$ bundle exec ruby script/import_scripts/yahoogroup.rb
Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
(snip)
connected to db....

Importing from Mongodb....

Importing users
User created: user@host.tld
Traceback (most recent call last):
        8: from script/import_scripts/yahoogroup.rb:163:in `<main>'
        7: from /home/dan/discourse/script/import_scripts/base.rb:47:in `perform'
        6: from script/import_scripts/yahoogroup.rb:39:in `execute'
        5: from script/import_scripts/yahoogroup.rb:58:in `import_users'
        4: from /home/dan/discourse/script/import_scripts/base.rb:247:in `create_users'
        3: from /home/dan/discourse/script/import_scripts/base.rb:247:in `each'
        2: from /home/dan/discourse/script/import_scripts/base.rb:259:in `block in create_users'
        1: from /home/dan/discourse/script/import_scripts/base.rb:290:in `create_user'
/home/dan/discourse/script/import_scripts/base.rb:385:in `find_existing_user': undefined method `downcase' for #<Mail::ToField:0x00005575597e63b8> (NoMethodError)

(the email address in this output is masked, but it’s in the source database with the full name and the entities for the angle brackets–so it appears my changes to the script had exactly the desired effect). This has me a little confused, as I’d understood downcase should be available by default.

Edit 2: Well, it parses the user, but doesn’t actually import the user into the Discourse instance.

3 Likes

The email thing still has me stumped, but I decided to leave that aside for the time being and see if I could apply HTMLEntities to the topic title and message text. In the yahoogroup.rb script, I changed line 110 to read:

        topic_title = HTMLEntities.new.decode(topic_post["ygData"]["subject"])

…and line 116 to read:

        raw: HTMLEntities.new.decode(topic_post["ygData"]["messageBody"]),

(both line numbers are +2 compared to the original script, due to my adding the two requires lines I mentioned above). It worked perfectly. The terminal output isn’t changed (that would have been on line 105, which I didn’t notice until I’d started it running), but the topic titles and text in the imported instance are nice and clean.

So this method appears to be working perfectly in cleaning up topic titles and message bodies, but it isn’t working for the email addresses. Any ideas what I should be looking for on that front? I’m kind of stumped there.

2 Likes

It might be easier to use Importing mailing lists (mbox, Listserv, Google Groups, emails, ...) to import data from Yahoo Groups. Either by using mbox files or maybe by converting the JSON files you mentioned into individual MSG files containing the raw email text.

The mbox import script handles mbox files as well as emails stored in individual files and it might already have solved all the problems you are encountering right now.

4 Likes