Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc)

justin · January 8, 2021, 4:34pm

Have you had a look at the database? Gut feeling on this issue is for some reason the email field is not getting created correctly there and thus can’t be read.

github.com

discourse/discourse/blob/f6e87e1e5ebdd6b6cfafa9e23cdd0a29190d1a7c/script/import_scripts/mbox/importer.rb#L72


      
          
            batches do |offset|
              rows, last_email = @database.fetch_users(last_email)
              break if rows.empty?
          
              next if all_records_exist?(:users, rows.map { |row| row['email'] })
          
              create_users(rows, total: total_count, offset: offset) do |row|
                {
                  id: row['email'],
                  email: row['email'],
                  name: row['name'],
                  trust_level: @settings.trust_level,
                  staged: @settings.staged,
                  active: !@settings.staged,
                  created_at: to_time(row['date_of_first_message'])
                }
              end
            end
          end

See 2.3 in the OP for checking the index database.

Jonathan5 · May 14, 2021, 12:23pm

The Mailman 2 list that I am considering importing into Discourse has had (for part of its existence) from_is_list set to Munge From, so that the “From:” header is

From: Listname <listname-bounces@listdomain.com> On Behalf Of [Original sender's name]

instead of

From: [Original sender's name] <username@example.com>

This made me think the importer would import each of these messages as if from the same user (with email address listname-bounces@listdomain.com)… BUT…

The initial line marking the beginning of a new email in the mbox file still begins with

From username@example.com [Date time group]

(and the Hyperkitty archives show the original sender’s email address as normal also).

So my question is – does the importer script take the sender’s address from the “From:” header or the "From " line? Thank you.

I discussed this briefly in a previous topic: Working on a mailman2 to discourse migration script - #10 by dachary

gerhard · May 14, 2021, 12:33pm

It’s using the From: header.

Jonathan5 · May 14, 2021, 12:36pm

Thanks for the quick reply! How hard would it be to change this? Not necessarily officially – though it might help others – but just for me to change the script before running it. I don’t know any Ruby (yet!) but if it’s just changing a colon to a space…

gerhard · May 14, 2021, 12:46pm

It’s not a simple change, but it should be doable. You don’t necessarily have to implement it in the import script. If you know another scripting language, I’m sure it won’t be too hard to update the From: headers in the mbox files before running the import…

But, feel free to fix it in the import script. A PR is welcome!
A good starting point for fixing the header should be the each_mail method…

github.com

discourse/discourse/blob/e7892df10d6892c275e8da84eccbaa8beb555dd3/script/import_scripts/mbox/support/indexer.rb#L157-L177

    
      
          def each_mail(filename)
            raw_message = +''
            first_line_number = 1
            last_line_number = 0
          
          
  each_line(filename) do |line|
              if line.scrub =~ @split_regex
                if last_line_number > 0
                  yield raw_message, first_line_number, last_line_number
                  raw_message = +''
                  first_line_number = last_line_number + 1
                end
              else
                raw_message << line
              end
          
          
    last_line_number += 1
            end
          
          
  yield raw_message, first_line_number, last_line_number if raw_message.present?
          end

Jonathan5 · May 14, 2021, 1:26pm

Cheers. Looks like this is what currently decides it, from line 69-70 of indexer.rb:

parsed_email = receiver.mail
from_email, from_display_name = receiver.parse_from_field(parsed_email)

Would it be possible at that point to obtain the first line of the mbox email (i.e. the “From [email address] [date time]” line) from parsed_email and extract the email address from that?

gerhard · May 14, 2021, 1:28pm

No, that line is filtered when the mbox is split into individual messages. You need to save that value in the each_mail method in order to use it later.

Jonathan5 · May 14, 2021, 9:22pm

I had some fun trying to do this, before spotting that Mailman stores the emails in the mbox in their original, unadulterated form, so that the “From:” line contains the same (original sender’s) email address as the "From " line in all cases, even when the email has been sent “From: listname-bounces@listname.domain.com”).

I was limited by not having a development Discourse installation, or even Ruby, but was able to make some headway with https://rubular.com/ and https://replit.com/languages/ruby (and DuckDuckGo). If you would be willing to have a look at it, I’d be grateful if you’d let me know whether this would have worked (or nearly worked) had it been necessary…

    def each_mail(filename)
      raw_message = +''
      first_line_number = 1
      last_line_number = 0

      each_line(filename) do |line|
        if line.scrub =~ @split_regex
          if last_line_number > 0
            #We're at the start of the NEXT email now
            yield raw_message, first_line_number, last_line_number
            raw_message = +''
            first_line_number = last_line_number + 1
          else
            #We're at the start of THIS email now, so get the email address 
            new_email = line.match(/^From (\S+@\S+).*/).captures
          end
        else
          raw_message << line
        end

        last_line_number += 1
      end

      #Get old email ("From:" line) 
      old_email = raw_message.match(/^From: .*<(\S+@\S+)>/).captures

      #Put "From " address into "From:" line
      raw_message = raw_message.sub(old_email, new_email)

      yield raw_message, first_line_number, last_line_number if raw_message.present?
    end

gerhard · May 14, 2021, 9:31pm

Well, let’s call it nearly…

Jonathan5 · May 14, 2021, 9:39pm

Haha… “So you’re telling me there’s a chance!?”

dachary · June 10, 2021, 3:45pm

After a successful import of mail archives (mbox), the content of the messages will display email addresses that would have been obfuscated by Gmane or the mailman2 archive server. This allows bots collecting addresses to harvest them and I’m looking for a way to avoid this.

globally removing email in the posts (a display plugin maybe?)
some site setting that already does that
another idea?

Thanks in advance for your help!

Jonathan5 · July 5, 2021, 7:04pm

Is it an either/or thing?

When I tried to import my MM2 mbox to MM3, about a quarter of the emails were orphaned (replies were wrongly treated as beginning new threads) because they didn’t have the right headers. Pipermail in MM2 can structure the archive using Subject (if there is no Message-ID or whatever the other header is called - I forget) but, the last time I checked, Postorius in MM3 ignores the Subject. So ideally your script would do the same as Pipermail and mostly get it right on my list

Also – if emails get imported higgledypiggledy, as noted above, is there any way in Discourse to fix that? Or is the only answer to try index_only and either add headers to the mbox file or rejig the index.db as suggested in the post quoted below?

Thanks.

gerhard · July 14, 2021, 8:47am

Yes, it is.

Not really. Well, you could move posts, but that is tedious even with automation.

I think that’s the best way to solve your problem unless you feel comfortable working on the import script and adding some kind of hybrid mode that groups by Message-ID and subject in case the former is missing.

gerhard · October 6, 2021, 5:38pm

Importing from Google Groups is currently broken, because Google changed the UI and removed the AJAX crawling scheme they deprecated back in 2015.

Has anyone managed to use Google Takeout for exporting mbox files yet?

Anjana_Raghavendra_P · October 30, 2021, 3:28pm

Hi,

How can we use this to import the google groups to SAAS discourse instead of on-premise?

pfaffman · October 30, 2021, 7:42pm

If you pay for business hosting for a year then they’ll do it for free. Otherwise, you do it on your own server and upload the backup to your instance and email support to ask them to restore it.

The Google group script can be finicky to get the authentication to work just right. Last time I used it, I had to fiddle with the log in endpoint to get it to work.

Anjana_Raghavendra_P · October 31, 2021, 1:00pm

Do you remember the change you did to make the login work? I am getting the following error even though I used the same extension as mentioned in the initial steps to generate the cookies file. By the way it is private domain group I am working with.

Logging in...
2021-10-31 12:54:41 WARN Selenium [DEPRECATION] [:browser_options] :options as a parameter for driver initialization is deprecated. Use :capabilities with an Array of value capabilities/options if necessary instead.
Traceback (most recent call last):
        31: from script/import_scripts/google_groups.rb:293:in `<main>'
        30: from script/import_scripts/google_groups.rb:237:in `crawl'
        29: from script/import_scripts/google_groups.rb:181:in `login'
        28: from script/import_scripts/google_groups.rb:196:in `add_cookies'
        27: from script/import_scripts/google_groups.rb:196:in `each'
        26: from script/import_scripts/google_groups.rb:200:in `block in add_cookies'
        25: from /usr/local/lib/ruby/gems/2.7.0/gems/selenium-webdriver-4.0.3/lib/selenium/webdriver/common/manager.rb:61:in `add_cookie'
        24: from /usr/local/lib/ruby/gems/2.7.0/gems/selenium-webdriver-4.0.3/lib/selenium/webdriver/remote/bridge.rb:349:in `add_cookie'
#0 0x557491640f93 <unknown>: invalid cookie domain: Cookie 'domain' mismatch (Selenium::WebDriver::Error::InvalidCookieDomainError)

gerhard · October 31, 2021, 1:37pm

I’m sorry, but fixing the login isn’t enough.

anon73664359 · October 31, 2021, 1:44pm

Did the more recent redesign fix anything?

gerhard · October 31, 2021, 1:49pm

No, unless they re-added the feature in the last 25 days. I don’t think they will, so the scraper will need a complete overhaul.

Topic		Replies	Views
Yahoo Groups Importation Errors Migration	7	1357	January 18, 2020
Migrate a phpBB3 forum to Discourse Migrating to Discourse how-to	458	95772	March 13, 2025
Migration from Yahoo! Groups Migration	25	6349	November 19, 2023
Migrate a XenForo forum to Discourse Sysadmins how-to	96	19832	February 25, 2025
[bounty] Google+ (private ) communities: export screenscraper + importer Marketplace	100	7786	May 25, 2019

Migrate a mailing list to Discourse (mbox, Listserv, Google Groups, etc)

Related topics