HOWTO: Import MBOX (mailing list) files


(Gunnar Helliesen) #33

And another thing: How does Discourse/sidekiq decide which messages to send out in digests? The reason I ask is that we’ve run into a bit of a Y2K issue (16 years after the fact!)

Some of our older messages that were imported (going back to 1993) had date headers with 2-digit year fields. So “93” instead of 1993. Discourse seems to think that these messages were posted in the future, in 2093. Will those messages now get included in every digest going forward?


(Jeff Atwood) #34

I don’t recommend you enable digest emails for migrated sites with large old user bases going back a decade or more. What I favor is marking any account that hasn’t posted in the last (x) month as unvalidated, meaning they require the user to log in again and verify they control that email address. This also prevents unwanted bulk summary / digest emails being sent.

The 1993 issue we would need a PR for or you can massage your data via SQL.

(Gunnar Helliesen) #35

Agreed, but here’s what I’d like to do: Turn off digest emails for all users. Those who want it can turn it back on. How would I best go about doing that?

When we imported all of our 1.8 million emails lots of users were automatically created from email addresses that haven’t been in use for years, or even decades. However, because of the Y2K issue mentioned above, some of these are flagged as having been active recently. For us, the only possible solution is to turn off digests for everyone, ASAP.

We’re actually getting flagged as spammers due to our very high rate of bounces, so this is quite urgent.


(Jeff Atwood) #36

No need, because if you mark accounts unvalidated they can’t receive email by definition.

(Gunnar Helliesen) #37

I can’t do that now, as we’re in production since a couple of days back and a few thousand of the users have already used the “forgot password” mechanism and logged in, and started using the forums. If I were to mark them as unvalidated, they’d have to re-verify their email addresses, correct?

Isn’t there a way of just turning off digests for everyone in the database?



(Jeff Atwood) #38

Yes but that process is basically identical to forgot password so from the users perspective it is the same.

(Gunnar Helliesen) #39

Still, I can’t do that, too many users have already successfully started using the new forums. We squandered a lot of good will with the whole email disaster, so I don’t want to antagonize the user base further.

Any ideas on how to turn off digest for everyone in bulk?


(Jeff Atwood) #40

Yes you can do that: only users who have not already logged in would be affected. It is trivial to make the query clause on last post date or last seen (requires login to be seen…)

(David Warner) #41

If one wanted to import a mailing list into an already existing discourse instance, would following the steps here result in wiping out my existing instance? And, if that’s the case, would it make sense to follow the steps here to import the archives and then use the Topic and Category Export/Import to move the mailing list archives instead?


(Jay Pfaffman) #42

The way that I’d do it is to

  1. back up discourse,
  • freeze discourse,
  • import that database on your development machine,
  • import the mailing list on the dev machine,
  • backup
  • restore that backup on the production machine.

(David Warner) #43

Makes sense. Thanks!

(bastian meissner) #44

I am trying to import about 142000 emails using this script, yet some emails just block everything for unknown reasons.
Is there any way to modify the script so that emails are skipped if they take too long to process?

(Jay Pfaffman) #45

There should be a way to modify it so that it processes or ignores those messages. Is there something about those messages that seems different?

(bastian meissner) #46

not really. I can’t see anything different about them. Might be something with CCs or so, but that’s processed in other emails just fine…

(M K) #47

Is it possible to make the SQLite-to-Discourse step “verbose”? The “creating forum topics” part of the process halts reliably after a very limited number of created topics. I do not understand how to troubleshoot this part of the import.

importing users
Skipping 20 already imported users
Skipping 12 already imported users

creating forum topics
        2 / 192 (  1.0%)  [1093 items/min] 

After CTRL + V:

 ^C/home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:182:in `gsub!': Interrupt
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:182:in `block in preprocess!'
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:181:in `each'
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:181:in `preprocess!'
        from /home/discourse/.rbenv/versions/2.3.4/lib/ruby/gems/2.3.0/gems/email_reply_trimmer-0.1.7/lib/email_reply_trimmer.rb:33:in `trim'
        from /home/discourse/discourse/lib/email/receiver.rb:205:in `select_body'
        from script/import_scripts/mbox.rb:426:in `block (2 levels) in create_forum_topics'
        from /home/discourse/discourse/script/import_scripts/base.rb:432:in `block in create_posts'
        from /home/discourse/discourse/script/import_scripts/base.rb:431:in `each'
        from /home/discourse/discourse/script/import_scripts/base.rb:431:in `create_posts'
        from script/import_scripts/mbox.rb:419:in `block in create_forum_topics'
        from /home/discourse/discourse/script/import_scripts/base.rb:784:in `block in batches'
        from /home/discourse/discourse/script/import_scripts/base.rb:783:in `loop'
        from /home/discourse/discourse/script/import_scripts/base.rb:783:in `batches'
        from script/import_scripts/mbox.rb:413:in `create_forum_topics'
        from script/import_scripts/mbox.rb:57:in `execute'
        from /home/discourse/discourse/script/import_scripts/base.rb:45:in `perform'
        from script/import_scripts/mbox.rb:555:in `<main>'

(Jay Pfaffman) #48

Sure. Just add in some

puts "#{somevariable}"

Statements at the top of that loop.

(M K) #49

Thanks, that works well. Is there a place where the maximum body length of an imported message is defined?

Edit: Nevermind. I think I’ve narrowed the problem down to forward slashes (/) in the content to be imported.

(Jay Pfaffman) #50

There is a sitesetting.

If you grep other importers you can find how to set a sitesetting in the importer.

(Marcus Baw) #51

@pfaffman I’m doing an import of a mbox gnu archive, the importer does all the message indexing and creates the SQLite DB index.db although there is no message content.

I suspect it will have something to do with the errors I’m getting all through the import:

Ignoring bad email address at (Foo Bar) in

I’ve had a look at the source around that error message, which is checking that the from_email matches the regex, so presumably something in those email addresses (it affects them all) is not matching.

There is a bit of email processing code in the #extract_name method, which is clearly designed to sort out real names in parentheses and replace _at_ with @, but clearly something isn’t working as intended.

Any pointers gratefully received.

(Gerhard Schlager) unlisted #52