Importing mailing lists (mbox, Listserv, emails, ...)

import

(Gerhard Schlager) #1

This guide is for you if you want to migrate a mailing list to Discourse.

1. Importing using Docker container

This is the recommended way for importing content from your mailing lists into Discourse.

1.1. Installing Discourse

Install Discourse by following the official installation guide.
Afterwards it’s a good idea to go to the Admin section and configure a few settings:

  • Enable login_required if imported topics shouldn’t be visible to the public
  • Enable hide_user_profiles_from_public if user profiles shouldn’t be visible to the public.
  • Disable download_remote_images_to_local if you don’t want Discourse to download images embedded in posts.
  • Enable disable_edit_notifications if you enabled download_remote_images_to_local and don’t want your users to get lots of notifications about posts edited by the system user.
  • Change the value of slug_generation_method if most of the topic titles use characters which shouldn’t be mapped to ASCII (e.g. Arabic). See this post for more information.

:bangbang: The following steps assume that you installed Discourse on Ubuntu and that you are connected to the machine via SSH or have direct access to the machine’s terminal.

1.2. Preparing the Docker container

Copy the container configuration file app.yml to import.yml and edit it with your favorite editor.

cd /var/discourse
cp containers/app.yml containers/import.yml
nano containers/import.yml

Add - "templates/import/mbox.template.yml" to the list of templates. Afterwards it should look something like this:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"
  - "templates/import/mbox.template.yml"

That’s it. You can save the file, close the editor and build the container.

/var/discourse/launcher stop app
/var/discourse/launcher rebuild import

Building the container creates an import directory within the container’s shared directory. It looks like this:

/var/discourse/shared/standalone/import
├── data
└── settings.yml

1.3. Configuring the importer

You can configure the importer by editing the example settings.yml file that has been copied into the import directory.

nano /var/discourse/shared/standalone/import/settings.yml

The settings file is well documented and comes with sensible defaults, but here are a few tips anyway:

  • The settings file contains multiple examples on how to split data files:

    • mbox files usually are separated by a From header. Choose a regular expression that works for your files.
    • If each of your files contains only one message, set the split_regex to an empty string.
    • There’s also an example for files for the popular Listserv mailing list software.
  • prefer_html allows you to configure if the import should use the HTML part of emails when it exists. You should choose what suits you best – it heavily depends on the emails sent to your mailing list.

  • By default each user imported from the mailing list is created as staged user. You can disable that behaviour by setting staged to false.

  • If your emails do not contain a Message-ID header (like messages stored by Listserv), you should enable the group_messages_by_subject setting.

1.4. Prepare files

Each subdirectory of /var/discourse/shared/standalone/import/data gets imported as its own category and each directory should contain the data files you want to import. The file names of those do not matter.

Example: The import directory should look like this if you want to import two mailing lists with multiple mbox files:

/var/discourse/shared/standalone/import
├── data
│   ├── list 1
│   │   ├── foo
│   │   ├── bar
│   ├── list 2
│   │   ├── 2017-12.mbox
│   │   ├── 2018-01.mbox
└── settings.yml

1.5. Executing the import script

:bulb: Tip: It’s a good idea to start the import inside a tmux or screen session so that you can reconnect to the session in case of SSH connection loss.

Let’s start the import by entering the Docker container and launching the import script inside the Docker container.

/var/discourse/launcher enter import
import_mbox.sh # inside the Docker container

Depending on the size of your mailing lists it’s now time for some :coffee: or :sleeping:
The import script will show you a message like this when it’s finished: Done (00h 26min 52sec)

:bulb: Tip: You can abort the import anytime you want by pressing Ctrl+C
When you restart the import it will continue where it left off.

You can exit and stop the Docker container after the import has finished.

exit # inside the Docker container
/var/discourse/launcher stop import

1.6. Starting Discourse

Let’s start the app container and take a look at the imported data.

/var/discourse/launcher start app

Discourse will start and Sidekiq will begin post-processing all the imported posts. This can take a considerate amount of time. You can watch the progress by logging in as admin and visiting http://discourse.example.com/sidekiq

1.7. Clean up

So, you are satisfied with the result of the import and want to free some disk space? The following commands will delete the Docker container used for importing as well as all the files used during the import.

/var/discourse/launcher destroy import
rm /var/discourse/containers/import.yml
rm -R /var/discourse/shared/standalone/import

1.8. The End

Now it’s time to celebrate and enjoy your new Discourse instance! :tada:


Problem Importing mailman archive into Discourse
HOWTO: Import MBOX (mailing list) files
Forwarding long email chains to Discourse
(Matthew Needham) #12

The import script completed successfully, and Sidekiq processed 16,610 posts. Sidekiq reported two failures, is there a way for me to find details on those?

I noticed is that the topic subject contains the mailman tag. I see that the old Howto has a setting for LIST_NAME that can be set to remove those on import. How would I go about stripping that with this method?

The document is very clear, easy to follow, and has worked perfectly once I had a clean droplet. Thanks!


(Gerhard Schlager) #13

You have two options, if you want to remove [Foo] from the topic title during the import.

  • rename the directory that contains the mbox files to Foo or
  • create a metadata.yml file within that directory with the following content:
    name: "Foo"
    description: "The description is optional and will be used for the 'About category' topic"
    

(Matthew Needham) #14

If I want to run a new import with the same mbox, how do I prevent it from seeing the posts as already having been imported? Creating a fresh “launcher rebuild import” and copying the folder/mbox back doesn’t seem to do the trick.


(Gerhard Schlager) #15

You’ll need to stop the container, delete all data and start again. The following commands conveniently delete everything except the mbox files and the importer configuration.

cd /var/discourse

./launcher stop app
./launcher stop import

rm -r ./shared/standalone/!(import)
rm ./shared/standalone/import/data/index.db

./launcher rebuild import

./launcher enter import
import_mbox.sh # inside the Docker container

(Matthew Needham) #16

I tried renaming the directory to “Archive” so that the category would be named “Archive” AND created metadata.yml with the mailinglist name. The result was successful removal of the subject tag, but with a category mirroring for the list name. Is there any way to mix the options, or otherwise name the archive something different at import?


(Gerhard Schlager) #17

You don’t need both. The name in metadata.yml always overrides the directory name.

So, you want the category to have a different name than “Archive”? No problem. After the import you can rename it within Discourse.


(Yaw Anokwa) #18

Thanks for the great work on this, @gerhard! It worked great for me and I have two feature requests that would make this even better.

  1. It’d be nice to have the users active when I import them. I’ve sent in a PR here: Add an option to set users to active on mbox import by yanokwa · Pull Request #5640 · discourse/discourse · GitHub
  2. After a mbox import, I always need to import users who were on the list but never sent a message. I do this by shoving them into the index.db. It’d be nice if we could also include a members.csv with name, email, created_at that would get imported too.

(Matthew Needham) #19

After the initial test import went well, it was decided to create profiles with all of the current subscribers so that we could preset some of their options (moderation, mailing list mode, etc). This allowed us to have discourse running for everyone before the mailman turndown date, and for everyone to continue receiving discussions pretty seamlessly. Now that the mailman list had been disabled, I need to import the archive.

I had earlier tested the creation of accounts for current subscribers after the mbox import, but ran into problems with duplicate usernames due to the staged users. I blew away everything and my subscriber import worked fine.

Back to the mbox import… I started with a backup of the production instance (which included imported mailman subscribers), and created a test instance. We’ve since decided that we’re not all that concerned about linking users that existed 10 years ago but are no longer active, so I disabled staging in import/settings.yml. It go to the point of indexing the mbox, and then failed. I tried several different things, exiting the container, editing settings.yml, and restarting the import, but I consistently get the same error. I’m running out of ideas, can anyone else make sense of this?

root@testforum-MboxImport-AfterSubscriberMigration-import:/var/www/discourse# import_mbox.sh
The mbox import is starting...

loading existing groups...
loading existing users...
loading existing categories...
loading existing posts...
loading existing topics...

creating index
indexing files in /shared/import/data/Hdf-forum
indexing /shared/import/data/Hdf-forum/hdf-forum_lists.hdfgroup.org.mbox
/var/www/discourse/lib/email/receiver.rb:304:in `each': undefined method `remove' for nil:NilClass 
(NoMethodError)
from /var/www/discourse/lib/email/receiver.rb:304:in `extract_from_outlook'
from /var/www/discourse/lib/email/receiver.rb:270:in `select_body'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:68:in `block in index_emails'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:122:in `block (2 levels) in all_messages'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:154:in `block in each_mail'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:173:in `block in each_line'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:172:in `each_line'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:172:in `each_line'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:149:in `each_mail'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:115:in `block in all_messages'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:108:in `foreach'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:108:in `all_messages'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:64:in `index_emails'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:23:in `block in execute'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:20:in `each'
from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:20:in `execute'
from /var/www/discourse/script/import_scripts/mbox/importer.rb:40:in `index_messages'
from /var/www/discourse/script/import_scripts/mbox/importer.rb:26:in `execute'
from /var/www/discourse/script/import_scripts/base.rb:46:in `perform'
from script/import_scripts/mbox-experimental.rb:14:in `<module:Mbox>'
from script/import_scripts/mbox-experimental.rb:8:in `<module:ImportScripts>'
from script/import_scripts/mbox-experimental.rb:7:in `<main>'

Thanks!


(Gerhard Schlager) #20

We improved the detection of email signatures a few days ago. It looks like this is causing problems in some cases.

I just committed an improved import script that logs and ignores errors during the indexing phase. Could you please update your Discourse container and try again. It should print an error message to the console when it encounters a problematic email.

Failed to index message in /data/some-list/1711.mbox at lines 470-830

I’d appreciate if you could send me the email mentioned in the error message as a PM, so that we can fix the problem.


(Matthew Needham) #21

Thanks. I never saw that actual message, but I did send you two emails that were apparently problematic. Here’s the actual error text I saw:

creating topics and posts
 5703 / 10785 ( 52.9%)  [97 items/min]  Exception while creating post 20150604130305.GI1412@pi-x230.     Skipping.
Terminated during callback
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:182:in `eval_unsafe'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:182:in `block (2 levels) in eval'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:262:in `timeout'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:181:in `block in eval'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:69:in `block in with_lock'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:69:in `synchronize'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:69:in `with_lock'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:179:in `eval'
/var/www/discourse/lib/pretty_text.rb:193:in `block in markdown'
/var/www/discourse/lib/pretty_text.rb:381:in `block in protect'
/var/www/discourse/lib/pretty_text.rb:380:in `synchronize'
/var/www/discourse/lib/pretty_text.rb:380:in `protect'
/var/www/discourse/lib/pretty_text.rb:136:in `markdown'
/var/www/discourse/lib/pretty_text.rb:236:in `cook'
/var/www/discourse/app/models/post_analyzer.rb:31:in `cook'
/var/www/discourse/app/models/post.rb:257:in `cook'
/var/www/discourse/lib/post_creator.rb:254:in `before_create_tasks'
/var/www/discourse/app/models/post.rb:548:in `block in <class:Post>'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:413:in `instance_exec'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:413:in `block in make_lambda'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:197:in `block (2 levels) in halting'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:601:in `block (2 levels) in default_terminator'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:600:in `catch'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:600:in `block in default_terminator'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:198:in `block in halting'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:507:in `block in invoke_before'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:507:in `each'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:507:in `invoke_before'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:130:in `run_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:827:in `_run_create_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/callbacks.rb:340:in `_create_record'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/timestamp.rb:95:in `_create_record'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/persistence.rb:563:in `create_or_update'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/callbacks.rb:336:in `block in create_or_update'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:131:in `run_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:827:in `_run_save_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/callbacks.rb:336:in `create_or_update'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/persistence.rb:129:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/validations.rb:44:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/attribute_methods/dirty.rb:35:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:308:in `block (2 levels) in save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:384:in `block in with_transaction_returning_status'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/database_statements.rb:233:in `transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:210:in `transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:381:in `with_transaction_returning_status'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:308:in `block in save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:323:in `rollback_active_record_state!'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:307:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/suppressor.rb:42:in `save'
/var/www/discourse/lib/post_creator.rb:459:in `save_post'
/var/www/discourse/lib/post_creator.rb:165:in `block in create'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:5:in `synchronize'
/var/www/discourse/lib/post_creator.rb:319:in `block in transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/database_statements.rb:235:in `block in transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/transaction.rb:194:in `block in within_new_transaction'
/usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/transaction.rb:191:in `within_new_transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/database_statements.rb:235:in `transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:210:in `transaction'
/var/www/discourse/lib/post_creator.rb:313:in `transaction'
/var/www/discourse/lib/post_creator.rb:162:in `create'
/var/www/discourse/script/import_scripts/base.rb:531:in `create_post'
/var/www/discourse/script/import_scripts/base.rb:484:in `block in create_posts'
/var/www/discourse/script/import_scripts/base.rb:471:in `each'
/var/www/discourse/script/import_scripts/base.rb:471:in `create_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:95:in `block in import_posts'
/var/www/discourse/script/import_scripts/base.rb:824:in `block in batches'
/var/www/discourse/script/import_scripts/base.rb:823:in `loop'
/var/www/discourse/script/import_scripts/base.rb:823:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:81:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:89:in `import_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:33:in `execute'
/var/www/discourse/script/import_scripts/base.rb:46:in `perform'
script/import_scripts/mbox-experimental.rb:14:in `<module:Mbox>'
script/import_scripts/mbox-experimental.rb:8:in `<module:ImportScripts>'
script/import_scripts/mbox-experimental.rb:7:in `<main>'
     7629 / 10785 ( 70.7%)  [103 items/min]  Exception while creating post 957a0f650344476ca5485a3cc0993dbb@MI-MBX-PROD2.minet.ae. Skipping.
Terminated during callback
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:182:in `eval_unsafe'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:182:in `block (2 levels) in eval'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:262:in `timeout'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:181:in `block in eval'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:69:in `block in with_lock'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:69:in `synchronize'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:69:in `with_lock'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/mini_racer-0.1.15/lib/mini_racer.rb:179:in `eval'
/var/www/discourse/lib/pretty_text.rb:193:in `block in markdown'
/var/www/discourse/lib/pretty_text.rb:381:in `block in protect'
/var/www/discourse/lib/pretty_text.rb:380:in `synchronize'
/var/www/discourse/lib/pretty_text.rb:380:in `protect'
/var/www/discourse/lib/pretty_text.rb:136:in `markdown'
/var/www/discourse/lib/pretty_text.rb:236:in `cook'
/var/www/discourse/app/models/post_analyzer.rb:31:in `cook'
/var/www/discourse/app/models/post.rb:257:in `cook'
/var/www/discourse/lib/post_creator.rb:254:in `before_create_tasks'
/var/www/discourse/app/models/post.rb:548:in `block in <class:Post>'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:413:in `instance_exec'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:413:in `block in make_lambda'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:197:in `block (2 levels) in halting'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:601:in `block (2 levels) in default_terminator'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:600:in `catch'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:600:in `block in default_terminator'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:198:in `block in halting'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:507:in `block in invoke_before'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:507:in `each'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:507:in `invoke_before'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:130:in `run_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:827:in `_run_create_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/callbacks.rb:340:in `_create_record'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/timestamp.rb:95:in `_create_record'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/persistence.rb:563:in `create_or_update'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/callbacks.rb:336:in `block in create_or_update'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:131:in `run_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activesupport-5.1.4/lib/active_support/callbacks.rb:827:in `_run_save_callbacks'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/callbacks.rb:336:in `create_or_update'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/persistence.rb:129:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/validations.rb:44:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/attribute_methods/dirty.rb:35:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:308:in `block (2 levels) in save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:384:in `block in with_transaction_returning_status'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/database_statements.rb:233:in `transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:210:in `transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:381:in `with_transaction_returning_status'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:308:in `block in save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:323:in `rollback_active_record_state!'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:307:in `save'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/suppressor.rb:42:in `save'
/var/www/discourse/lib/post_creator.rb:459:in `save_post'
/var/www/discourse/lib/post_creator.rb:165:in `block in create'
/var/www/discourse/lib/distributed_mutex.rb:21:in `synchronize'
/var/www/discourse/lib/distributed_mutex.rb:5:in `synchronize'
/var/www/discourse/lib/post_creator.rb:319:in `block in transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/database_statements.rb:235:in `block in transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/transaction.rb:194:in `block in within_new_transaction'
/usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/transaction.rb:191:in `within_new_transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/connection_adapters/abstract/database_statements.rb:235:in `transaction'
/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/activerecord-5.1.4/lib/active_record/transactions.rb:210:in `transaction'
/var/www/discourse/lib/post_creator.rb:313:in `transaction'
/var/www/discourse/lib/post_creator.rb:162:in `create'
/var/www/discourse/script/import_scripts/base.rb:531:in `create_post'
/var/www/discourse/script/import_scripts/base.rb:484:in `block in create_posts'
/var/www/discourse/script/import_scripts/base.rb:471:in `each'
/var/www/discourse/script/import_scripts/base.rb:471:in `create_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:95:in `block in import_posts'
/var/www/discourse/script/import_scripts/base.rb:824:in `block in batches'
/var/www/discourse/script/import_scripts/base.rb:823:in `loop'
/var/www/discourse/script/import_scripts/base.rb:823:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:81:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:89:in `import_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:33:in `execute'
/var/www/discourse/script/import_scripts/base.rb:46:in `perform'
script/import_scripts/mbox-experimental.rb:14:in `<module:Mbox>'
script/import_scripts/mbox-experimental.rb:8:in `<module:ImportScripts>'
script/import_scripts/mbox-experimental.rb:7:in `<main>'
     7708 / 10785 ( 71.5%)  [103 items/min]  Parent message 20150604130305.GI1412@pi-x230 doesn't exist. Skipping BY2PR0701MB195750752681B7DB84D3B67EBDB30@BY2PR0701MB1957.namprd07.prod.outlook.com: [problem configuring parallel installatio
     7709 / 10785 ( 71.5%)  [103 items/min]  Parent message 20150604130305.GI1412@pi-x230 doesn't exist. Skipping BY2PR0701MB19574D4CB90CF97CC15A88C1BDB30@BY2PR0701MB1957.namprd07.prod.outlook.com: [problem configuring parallel installatio
     7710 / 10785 ( 71.5%)  [103 items/min]  Parent message 20150604130305.GI1412@pi-x230 doesn't exist. Skipping CB7DE6D7-3D7F-44F8-8A33-2B11C895A852@gmail.com: [problem configuring parallel installatio
     7711 / 10785 ( 71.5%)  [103 items/min]  Parent message 20150604130305.GI1412@pi-x230 doesn't exist. Skipping CANDEscrHLQdjuS7UnddJz9==qy1oXRn9BMQ0AU8XxcHN_TMc6w@mail.gmail.com: [problem configuring parallel installatio

(Gerhard Schlager) #22

So the indexing seems to have worked this time. Those are different errors and happened during the import phase. I’ve been looking at the example (quite lengthy) emails you sent me and my guess is that there was a timeout while cooking the post. There’s not much I can do about it. Maybe try importing on a machine that has a faster CPU or simply ignore those errors.


(Matthew Needham) #23

Well, that’s curious. is there anything I can do to help investigate that error? I’m planning to schedule the real import into our production instance on Sunday, and I’d like to do what I can to avoid a mystery failure then.

I may try truncating the long messages, is there any reason that would cause problems with the import?


(Gerhard Schlager) #24

Those long messages have roughly 200k characters – the default for max_post_length is 32k. It depends on the CPU speed how many characters the Markdown engine can process before there’s a timeout.

You could try to find large messages within the index.db. It’s a SQLite 3 database. The following query lists all emails that are longer than 100k characters.

WITH long_emails AS (
    SELECT
      msg_id,
      length(body) + length(elided) AS length,
      category,
      filename,
      first_line_number,
      last_line_number
    FROM email
)
SELECT *
FROM long_emails
WHERE length > 100000

(Andreas Dorfer) #25

importing a batch of >10k messages, i am stuck in something like this. How do i get more info, which message causes the issue?

/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/statement.rb:108:in `step': NOT NULL constraint failed: user.date_of_first_message (SQLite3::ConstraintException)
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/statement.rb:108:in `block in each'
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/statement.rb:107:in `loop'
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/statement.rb:107:in `each'
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/database.rb:152:in `map'
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/database.rb:152:in `block in execute'
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/database.rb:95:in `prepare'
        from /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/sqlite3-1.3.13/lib/sqlite3/database.rb:137:in `execute'
        from /var/www/discourse/script/import_scripts/mbox/support/database.rb:130:in `fill_users_from_emails'
        from /var/www/discourse/script/import_scripts/mbox/support/indexer.rb:35:in `execute'
        from /var/www/discourse/script/import_scripts/mbox/importer.rb:40:in `index_messages'
        from /var/www/discourse/script/import_scripts/mbox/importer.rb:26:in `execute'
        from /var/www/discourse/script/import_scripts/base.rb:46:in `perform'
        from script/import_scripts/mbox-experimental.rb:14:in `<module:Mbox>'
        from script/import_scripts/mbox-experimental.rb:8:in `<module:ImportScripts>'
        from script/import_scripts/mbox-experimental.rb:7:in `<main>'

(Gerhard Schlager) #26

I fixed the import script. Emails without a valid date were already ignored in most queries.
FIX: mbox importer should ignore emails without date · discourse/discourse@9b651ad · GitHub

That said, you can connect to the index.db (SQLite 3 database) and execute the following query if you want to find the message that’s causing this error:

SELECT msg_id, category, filename, first_line_number, last_line_number
FROM email
WHERE email_date IS NULL

(Andreas Dorfer) #27

it would help to have some kind of error-handle: Moving files with “problems” to a different folder and treat them later after manual check/correction.


(Gerhard Schlager) #28

You can always enable index_only in settings.yml and take a look at the index.db before you run the actual import. Emails with an empty from_email in the email table are also ignored during the import.

:bulb: Pro tip: You can use SQL to update missing values in the database if you want. That way you don’t need to reindex any messages. The script uses only data from the index.db during the import phase. Simply disable the index_only option when you are done and rerun the importer. It will skip the indexing if none of the mbox files were changed, recalculate the content of the user and email_order tables and start the actual import process.


(Derek Magill) #29

I managed to mangle my import. I forgot to name the directories properly to remove the [HEADER] piece (or to use the yams file), so after the import I said “OK, I’m going to delete all of the posts and re-import”.

I’ve deleted all of the imported posts from the production instance and I’ve followed the steps above to re-import, but no matter what I do I cannot get it to not realize it’s already imported those messages.

I’ve even gone so far as to use “docker rmi import” to remove the docker image completely on top of removing the data in standalone/import.

What am I missing?


(Jay Pfaffman) #30

If you did what I think, the import container uses the standalone data directories.

You need to delete standalone/[pr]*