Migration from Yahoo! Groups

Please do migrate to Discourse.

We need more independent websites and fewer people relying on the major monopoly platforms.

Just to say that I migrated my local community from a Yahoo Group to Discourse years ago and we have never looked back. The personal touches you can then add to your shared communication resource are worth it alone, but the additional features are icing on the cake.

Unfortunately I can’t offer you useful experience from migration as we simply started anew, save for the email list. Why not just leave the old Yahoo Group site up and provide a link? How many attachments do you really need to keep? Target the most important ones?

Best of luck, you’ll be fine!

3 Likes

Not my call directly, but I have inclinations that way. And for the group I’m most concerned about, I don’t expect the files/photos to be a major problem–I now have them all downloaded, and there are few enough of them that manually pulling them into topics shouldn’t be a big problem.

Yes, because we’re seeing right now one of the risks of that.

Because in six weeks, all the data is disappearing from there.

4 Likes

I can write an importer to read those json files, but I cannot compete with $200. I typically charge 10x that to write an importer and import a moderate sized forum (a few hundred thousand posts).

2 Likes

So it sounds like I’d be better off using:
https://github.com/jonbartlett/yahoo-groups-export
followed by:
https://github.com/discourse/discourse/blob/master/script/import_scripts/yahoogroup.rb
…once I figure them out.

2 Likes

(major edits below–second attempt)

Working on the message import process, using the instructions at Migrating to Discourse from another Forum software. As I understand them, the process should look like:

  • Set up development environment using Beginners Guide to Install Discourse on Ubuntu for Development
  • Install MongoDB on that system
  • On that system, as the same non-privileged user that’s running Discourse, git clone the yahoo-group-export script
  • As the same user, gem install mechanize; gem install mongo. Then edit .config.yaml to give the Yahoo credentials and group name, and run ruby bin/yg-export.rb.
  • Have a cup (or two) of your beverage of choice.
  • Once yg-export finishes, in the Discourse directory, take a look at script/import_scripts/yahoogroup.rb. Edit it to point to the correct MONGODB_HOST (localhost).
  • In the discourse directory, run bundle exec ruby script/import_scripts/yahoogroup.rb
  • Verify it’s imported correctly
  • Backup and restore to a live server

Steps 2-4 are inferred. But does this look like the right steps to follow? Believing they were, I proceeded. It ran fine through step 4–yg-export.rb ran for about an how, reporting SUCCESS for everything, and saving ~38k messages. The database syncro is present with ~85MB of data in it. At that point, I took a snapshot of the VM.

I’m having trouble with the import script, though. When I run bundle exec ruby script/import_scripts/yahoogroup.rb, I get this:

dan@ubuntu:~/discourse$ bundle exec ruby script/import_scripts/yahoogroup.rb
Traceback (most recent call last):
script/import_scripts/yahoogroup.rb: Bootsnap::LoadPathCache::FallbackScan
        7: from script/import_scripts/yahoogroup.rb:4:in `<main>'
        6: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/zeitwerk-2.1.10/lib/zeitwerk/kernel.rb:23:in `require'
        5: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:26:in `require'
        4: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:40:in `rescue in require'
        3: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:21:in `require_with_bootsnap_lfi'
        2: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/loaded_features_index.rb:89:in `register'
        1: from /home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `block in require_with_bootsnap_lfi'
/home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/bootsnap-1.4.4/lib/bootsnap/load_path_cache/core_ext/kernel_require.rb:22:in `require': cannot load such file -- mongo (LoadError)

Strange, I thought I’d already installed the mongo gem. Well, I’ll do it again:

dan@ubuntu:~/discourse$ gem install mongo
Successfully installed mongo-2.10.2
Parsing documentation for mongo-2.10.2
Done installing documentation for mongo after 4 seconds
1 gem installed

Run the import script again, same result. Does it matter if I install it at the system level?

dan@ubuntu:~/discourse$ sudo apt install ruby-mongo
[sudo] password for dan: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ruby-mongo is already the newest version (2.5.1-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Curiouser and curiouser…

You need to add the gems to Gemfile and run bundle install, otherwise running bundle exec ruby script/import_scripts/yahoogroup.rb won’t find the gems.

6 Likes

That was the piece I was missing, now the import is running merrily along. Thanks!

Edit: OK, the import process ran 75 minutes, and the posts are now there. Excellent. It also created users, which I’d wondered about. However, I’m seeing a couple of problems with users:

  • It seems that all the Yahoo user names imported correctly (I recognize many of them from my membership in the list), but attached themselves to the wrong messages. They did this consistently–all of the messages that were posted by me now appear as having been posted by the same other user–but nonetheless it’s a significant error that would be a major pain to clean up manually.
  • All the imported users are suspended for the next 200 years.

I suspect both of these stem from the lack of valid email addresses in the data downloaded from Yahoo, which is because I’m not a group admin–that’s explicitly stated as the cause for the latter issue, but I don’t know if it would cause the former as well. Thoughts?

If that’s the issue, it gives me something to run down, but also a potential problem–I know there are two living moderators for the group, but the owner died within the past year. Hopefully someone has access to that information…

5 Likes

Hello,

There is a new plugin for that.

I have exported and imported all the Yahoo! messages from a development version as a category with these script.

Now there was the problem with the double or false users.
With this Plugin: Merge Users Plugin
You can easly merge the Yahoo users with your Discourse users.

And now we have only the problem with the attachments from Yahoo.

2 Likes

That doesn’t quite do what I’d need–the only users present on this instance, other than the admin user, are the users imported from Yahoo. The problem is that the wrong username is attached to the wrong posts–my posts are connected (consistently, AFAICT) to someone else’s username, and someone else’s posts are connected to mine.

I have now been given moderator access to this group, which is enough to download messages with actual email addresses attached. I’ll rebuild (VMs are great for that), redo the import, and see if that addresses the issue.

1 Like

Hey, okay, that’s not the same what I found out. At my community was, nothing of the imported posts are connected to anyone of the existing community.
If I imported the category as described in the linked topic. Then I only had double users (or someone new) in the user list.

The importer set the posts to the false user, but he do it in the right way. I mean, if the posts, they i wrote by myself in Yahoo!, are set to “Hans”. Then all posts of mine was set to “Hans”

I have user ID 1 in my comunity forum, but that’s not the same name as in the Discourse forum for development I set up. So my account was not overritten, but there was another account with the same name. But this account is linked to the false post.

Now I use the linked plugin to merge all users separately to the right person in my forum. It take not much time. But it´s hard to figure out which post belongs to which user.

1 Like

We’re possibly in different situations–in my case, there’s no “existing community.” Rather, everything is being imported from the Y group.

OK, I have the download of the group messages including full email addresses. And after some issues with the dev environment, I’ve been able to work on the import again. And I’m still noticing a few issues:

  • The issue with the wrong usernames being connected with the wrong posts remains.
  • Possibly as the (or “a”) cause of the above, most of the users being imported are determined to have invalid email addresses. In the Mongo database generated by the yahoo-export script, the From field (from which the import script appears to be attempting to read email addresses) reads for most users like:
First Last &lt;user@domain.com&gt;

…which Discourse rejects as an invalid email address. As a result, most users are assigned an email address like 5dc3e1b4f4d821bd7de3ce456eaf26d5@email.invalid–exceptions appear to be those users who sent emails without their full names attached.

  • The imported messages contain a number of HTML entities, particularly for quotes and greater-than and less-than signs.

  • Many, but definitely not all, of the imported messages have a group name in the subject line: Re: [SpareOom] some subject. It’d be nice to remove those.
    For the last three bullets, I’m wondering if a simple find/replace throughout the database would take care of it–and if so, how to go about doing that bearing in mind that I’ve never touched MongoDB before.

  • A separate matter is importing the messages into a designated category. The comments at the top of yahoogroup.rb say that you can do export CATEGORY_ID=<CATEGORY_ID> before running the script to do this, but doesn’t indicate what <CATEGORY_ID> refers to. I’ve tried the regular name of the category, as well as the “category slug” (both are the same except for capitalization), but in both cases the import script fails with:

         1: from /home/dan/discourse/lib/topic_creator.rb:36:in `create'
/home/dan/discourse/lib/topic_creator.rb:115:in `setup_topic_params': category (Discourse::InvalidParameters)
1 Like

That sounds a lot like my first mbox import. It took me a couple of months.

Yes, you can probably fix some stuff with some replacements.

If you put .json at the end of a category url you can find a category id. It’s an integer.

You’ll need to look at what the user creator is using for an identifier and what is getting used by the post function to find users. Or maybe they just don’t match.

1 Like

Looking through yahoogroup.rb, it’s definitely expecting the From field in the message to be a bare email address. Since most users configure their email clients to send a name as well (e.g.,

Fred Flintstone <fred@flintstone.com>

), that’s problem #1. A bit of Googling suggests that this can be addressed using the Mail gem, which would change that line in the import script to read:

        email: Mail::ToField.new(user_info["ygData"]["from"]), # mandatory

…which would extract only the email address. But as noted above, the angle brackets are stored as HTML entities instead, which breaks this method. Further Googling tells me there’s a HTMLEntities gem that would take care of it, leading me to try this:

        email: Mail::ToField.new(HTMLEntities.new.decode(user_info["ygData"]["from"])), # mandatory

But that’s failing due to lack of a downcase method.

Edit: I tried to avoid that by going a different route; I saw lots of suggestions for Nokogiri. But however useful it is, the suggestions I found didn’t decode the angle bracket entities, which was (and is) my most immediate need. So back to HTMLEntities I go. I added require 'mail' and require 'htmlentities' to the top of the Yahoo import script, and changed line 75 (was line 73 before I added the requires) to read as above. I’m still getting an error, but what I’d missed previously is that it does actually properly parse and import one user before it dies:

dan@ubuntu:~/discourse$ bundle exec ruby script/import_scripts/yahoogroup.rb
Loading existing groups...
Loading existing users...
Loading existing categories...
Loading existing posts...
Loading existing topics...
(snip)
connected to db....

Importing from Mongodb....

Importing users
User created: user@host.tld
Traceback (most recent call last):
        8: from script/import_scripts/yahoogroup.rb:163:in `<main>'
        7: from /home/dan/discourse/script/import_scripts/base.rb:47:in `perform'
        6: from script/import_scripts/yahoogroup.rb:39:in `execute'
        5: from script/import_scripts/yahoogroup.rb:58:in `import_users'
        4: from /home/dan/discourse/script/import_scripts/base.rb:247:in `create_users'
        3: from /home/dan/discourse/script/import_scripts/base.rb:247:in `each'
        2: from /home/dan/discourse/script/import_scripts/base.rb:259:in `block in create_users'
        1: from /home/dan/discourse/script/import_scripts/base.rb:290:in `create_user'
/home/dan/discourse/script/import_scripts/base.rb:385:in `find_existing_user': undefined method `downcase' for #<Mail::ToField:0x00005575597e63b8> (NoMethodError)

(the email address in this output is masked, but it’s in the source database with the full name and the entities for the angle brackets–so it appears my changes to the script had exactly the desired effect). This has me a little confused, as I’d understood downcase should be available by default.

Edit 2: Well, it parses the user, but doesn’t actually import the user into the Discourse instance.

3 Likes

The email thing still has me stumped, but I decided to leave that aside for the time being and see if I could apply HTMLEntities to the topic title and message text. In the yahoogroup.rb script, I changed line 110 to read:

        topic_title = HTMLEntities.new.decode(topic_post["ygData"]["subject"])

…and line 116 to read:

        raw: HTMLEntities.new.decode(topic_post["ygData"]["messageBody"]),

(both line numbers are +2 compared to the original script, due to my adding the two requires lines I mentioned above). It worked perfectly. The terminal output isn’t changed (that would have been on line 105, which I didn’t notice until I’d started it running), but the topic titles and text in the imported instance are nice and clean.

So this method appears to be working perfectly in cleaning up topic titles and message bodies, but it isn’t working for the email addresses. Any ideas what I should be looking for on that front? I’m kind of stumped there.

2 Likes

It might be easier to use Importing mailing lists (mbox, Listserv, Google Groups, emails, ...) to import data from Yahoo Groups. Either by using mbox files or maybe by converting the JSON files you mentioned into individual MSG files containing the raw email text.

The mbox import script handles mbox files as well as emails stored in individual files and it might already have solved all the problems you are encountering right now.

4 Likes

I don’t have mbox files, and I’m not aware of any way to get them–Yahoo certainly won’t let me download them. Do you know of something that would convert JSON to mbox? Google shows a number of tools for going in the other direction, but I don’t see anything that covers this quickly.

I’d expected that, since there were existing scripts designed to migrate Yahoo groups specifically, those scripts would actually work, and that would be the most straightforward way to accomplish this task. It appears my expectation was optimistic–the scripts “work” in that they migrate the messages, and they kind of migrate the users, but missing most of the email addresses and assigning most of the messages to the wrong user is a bit of a problem.

The thing that’s frustrating me is that it seems like this should be a trivial fix for someone who actually knows a thing or two about Ruby–but unfortunately I’m not such a person (I’m trying, but there’s never enough time for everything). My group is small enough that I can probably fix it manually if I need to–but I’d rather not need to, and even more to the point, I’m trying to come up with a general method that other Yahoo groups owners can use.

Edit: I guess I should be glad that I’m managing as much as I am in a language I really don’t know anything about, but I still feel like there’s something major (that should be obvious) that I’m missing. I’ve tried using a different method with the Mail gem. The portion of import_users that I’ve edited reads as follows:

    create_users(profiles.to_a) do |u|

      user_id = user_id + 1

      # fetch last message for profile to pickup latest user info as this may have changed
      user_info = @collection.find("ygData.profile": u["_id"]["profile"]).sort("ygData.msgId": -1).limit(1).to_a[0]

      # Store user_id to profile lookup
      @user_profile_map.store(user_info["ygData"]["profile"], user_id)

      puts "User created: #{user_info["ygData"]["profile"]}"
      
      user_email = Mail::Address.new(HTMLEntities.new.decode(user_info["ygData"]["from"]))

      user =
       {
        id: user_id,  # yahoo "userId" sequence appears to have changed mid forum life so generate this
        username: user_info["ygData"]["profile"],
        name: user_info["ygData"]["authorName"],
        email: user_email.address, # mandatory
        created_at: Time.now
      }
      user
    end

And it works! Well, mostly. Of 302 distinct users counted by the script, it imports 289. They show up on the admin page with the correct usernames, full names (when provided), and email addresses. The script says it imports all 302 and reports no errors. But when it starts importing topics, I get this:

Importing discussions
Topic: 1 / 12232  (0.01%)  Subject: Newspapers
Topic: 2 / 12232  (0.02%)  Subject: Ents
Traceback (most recent call last):
	8: from script/import_scripts/yahoogroup.rb:168:in `<main>'
	7: from /home/dan/discourse/script/import_scripts/base.rb:47:in `perform'
	6: from script/import_scripts/yahoogroup.rb:40:in `execute'
	5: from script/import_scripts/yahoogroup.rb:101:in `import_discussions'
	4: from script/import_scripts/yahoogroup.rb:101:in `each_with_index'
	3: from script/import_scripts/yahoogroup.rb:101:in `each'
	2: from script/import_scripts/yahoogroup.rb:132:in `block in import_discussions'
	1: from /home/dan/discourse/script/import_scripts/base.rb:535:in `create_post'
/home/dan/.rbenv/versions/2.6.2/lib/ruby/gems/2.6.0/gems/activerecord-6.0.0/lib/active_record/core.rb:177:in `find': Couldn't find User with 'id'=298 (ActiveRecord::RecordNotFound)

…which isn’t surprising, since the highest user id is 290.

2 Likes

Would Discourse have any logs that would indicate which users hadn’t been created and why? Where would those be?

1 Like

Emphasis added on my error. Turns out Yahoo does let you download them, but it’s a bit of a process, and nowhere does it tell you that mbox files are what you’ll get. Yahoo has a “Get my data” tool. Go there, log in, submit a request, and wait until they notify you (about a week for me). They’ll send you an email with a URL, where you’ll go to download a .zip file that appears to contain most of the contents of every group of which you’re a member (the photos appear to be missing). Somewhat surprisingly, the .mbox files contain full email addresses even for groups of which you aren’t a moderator.

So, @gerhard, it appears I was premature in disregarding your suggestion–my apologies.

Edit: Yes, the .mbox process seems to work much better. Some messages are getting skipped (~100 for the apparent lack of a date, for example), but almost all the 38k messages made it, all the users made it (and a spot check indicates they’re all associated with the correct posts), all with the correct email addresses. It isn’t perfect at keeping topics together (the other script wasn’t either), but it’s doing pretty well. And, as a bonus, it makes for a simpler method to document than what I’d been trying to do. Only downside I see so far is the delay for Yahoo to make your stuff available to download.

10 Likes

Wow! That’s pretty wild. I guess they figure that if you’ve been on the list, you already have the email addresses.

This is good news - I just did the download and it looks like I have a pretty comprehensive archive of messages for my yahoogroup that I’d like to hang on to, in handy and portable mbox format. Sweet!

6 Likes