(Superseded) Import MBOX (mailing list) files

Thanks! one more question: What’s the role of sidekiq, after mbox.rb has finished running? In my (one list only) test run, sidekiq also took a long time to settle down after the initial import. By that I mean that I was looking at status on the /sidekiq URL and it was working like crazy for many hours.

It does stuff like bake the posts so that URLs pointing to images and such get baked into the right HTML. I think that if you don’t wait for it to happen on your devel box, it’ll happen on production, but, well, you probably don’t want a bunch of bare URLs where images are supposed to be on your production site. And with 1.7e6 messages, it takes a little while to open them all up.

I think that this gives Sidekiq as much resources as is advisable if you’ve got lots of ram and whatnot, but admit that it’s all a bit mysterious to me. :slight_smile:

bundle exec sidekiq -c 100 -q critical,4 -q default,2 -q low

I saw somewhere that the developer of sidekiq strongly recommends not going above 50 threads (-c 50), but yes. I’m hoping it’ll use all 50 as we have 64 cores available. :slight_smile:

I think I saw the same page and that it also said that as many as 100 was OK, but it sounds like you know at least as much about it as I do. :slight_smile:

Do you know if sidekiq can be safely restarted, once it has been started and is processing messages?

Yes. It just processes a queue, from what I can discern.

1 Like

Ah, ok. We’re done with the mbox.rb import, and running sidekiq since last night. The dev site now shows 348,000 topics in “uncategorized” and none in any of the actual categories (which map to the mailing lists on the old server). This despite having put all the mbox files in individual subdirs of MBOX_SUBDIR and mapping them in mbox.rb as instructed.

I should mention that all the categories have been created though, with their proper names. They’re just empty.

Is this normal, will sidekiq move them to their proper categories in the GUI as it progresses? Do you think it’s worth it to wait and see? Or should we stop the process now and go back to the drawing board before wasting a ton of CPU hours on this?

Sidekiq isn’t going to change that.

I’d start over with, say, one month (or a hundred) messages in each category.

Thanks. That means we have a challenge. I think the only way forward is to do the following:

  • Import the last month (or two, three) of each mailing list into the production site. Say from 9/1/16.
  • Go live with the production site.
  • Backfill each list starting with August and then going backwards month by month until we’ve reached the beginning (November 1993).
  • Doing this backfill on the live production site.

Can you see a way of doing this? If we can’t, it’ll take us forever to make the switch.

I just did, on GitHub. :slight_smile:

Gunnar

5 Likes

Nice! Thanks, @Gunnar. The top post is a wiki, so please feel free to edit whatever seems unclear.

3 Likes

New issue. :slight_smile:

After importing 1.8 million messages and 38K users on my dev node, and restoring the backup to my production system, I went over all the site settings just to make everything was OK. I unchecked the ‘disable emails’ setting and watched my server blow up.

Our archives go back to 1993, and Discourse decided to send multiple digest emails to all the 38K users. The immediate result was of course a suspension from AWS SES, and having to shut down all outgoing email.

I think it would be a good idea to mark all messages and users as “up to date” as the last part of the import process.

How can I stop Discourse from sending out all the digests that are already queued?

Gunnar

2 Likes

And another thing: How does Discourse/sidekiq decide which messages to send out in digests? The reason I ask is that we’ve run into a bit of a Y2K issue (16 years after the fact!)

Some of our older messages that were imported (going back to 1993) had date headers with 2-digit year fields. So “93” instead of 1993. Discourse seems to think that these messages were posted in the future, in 2093. Will those messages now get included in every digest going forward?

Thanks,
Gunnar

I don’t recommend you enable digest emails for migrated sites with large old user bases going back a decade or more. What I favor is marking any account that hasn’t posted in the last (x) month as unvalidated, meaning they require the user to log in again and verify they control that email address. This also prevents unwanted bulk summary / digest emails being sent.

The 1993 issue we would need a PR for or you can massage your data via SQL.

2 Likes

Agreed, but here’s what I’d like to do: Turn off digest emails for all users. Those who want it can turn it back on. How would I best go about doing that?

When we imported all of our 1.8 million emails lots of users were automatically created from email addresses that haven’t been in use for years, or even decades. However, because of the Y2K issue mentioned above, some of these are flagged as having been active recently. For us, the only possible solution is to turn off digests for everyone, ASAP.

We’re actually getting flagged as spammers due to our very high rate of bounces, so this is quite urgent.

Thanks,
Gunnar

No need, because if you mark accounts unvalidated they can’t receive email by definition.

I can’t do that now, as we’re in production since a couple of days back and a few thousand of the users have already used the “forgot password” mechanism and logged in, and started using the forums. If I were to mark them as unvalidated, they’d have to re-verify their email addresses, correct?

Isn’t there a way of just turning off digests for everyone in the database?

Thanks!

Gunnar

Yes but that process is basically identical to forgot password so from the users perspective it is the same.

Still, I can’t do that, too many users have already successfully started using the new forums. We squandered a lot of good will with the whole email disaster, so I don’t want to antagonize the user base further.

Any ideas on how to turn off digest for everyone in bulk?

Thanks!

Yes you can do that: only users who have not already logged in would be affected. It is trivial to make the query clause on last post date or last seen (requires login to be seen…)