Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

import

(Derek Magill) #31

I deleted the entire standalone directory tree.


(Jay Pfaffman) #32

And rebuild after that?


(Derek Magill) #33

yup, rebuilt a couple of different ways. Just plain ./launcher rebuild import and also before I did that I went so far as to remove the docker images using docker rmi on top of deleting /var/discourse/shared/standalone… still sees everything as already imported.

One note is that I’m on AWS and my postgres db is using RDS. Wonder if that has anything to do with it.


(Gerhard Schlager) #34

I’m not familiar with RDS, but I guess it doesn’t store it’s data in the standalone directory. You’ll need to drop the database. Unfortunately I can’t help you with that.


(Derek Magill) #35

Since I copied the app.yml file into import.yml and that’s what’s connecting to the postgres db, I wonder if I updated the import.yml file to use a local postgres instance instead of RDS if that would do it. I’m going to give that a try here in a second.


(Derek Magill) #36

OK, so for the record, the procedure as outlined assumes a local postgres install in the container.

I was trying to be cute and use the AWS supported RDS postgres DB, so when I copied over my app.yml file and added the importer template, it carried over the connection to the still running DB, which seems to have been the source of it understanding previously seen posts. After rebuilding the importer container and having it use a local postgres instance, the importer ran as expected.

I wish I could report on the whole process, but due to my bungling I hit my Let’s Encrypt cert request limit and now have to wait a week to get the whole site back up. Can’t seem to get it up with normal HTTP. Oh well.

Thanks for the help!


(Yaw Anokwa) #37

For folks importing large mboxes or mboxes that might have data problems (e.g. corrupt images), splitting the mbox into individual files makes it easier to find the offending email and remove it from the import. I think the import is also faster, but I didn’t time it.

I use formail which comes with procmail to split them.

apt install procmail;
export FILENO=0000;
formail -ds sh -c 'cat &gt; split/msg.$FILENO' < mbox;

(Gerhard Schlager) #38

We’ve added a new script to Discourse (google_groups.rb) that lets you scrape Google Groups and integrates with the mbox import script. And the latter finally lost its “experimental” flag and replaced the old import script. :tada:

The Howto in the first post has been updated with some additional instructions for Google Groups.


(Sruly Markowitz) #39

I thought this would only delete the data in the import container, but it has completely nuked my entire installation to start from scratch.

Is there a way to do multiple imports iteratively without deleting everything first. So I want to import all messages from a Google Groups, and then at a later date import newer messages from the same group. I also would need to do imports from a different group as well, and it is not working after the first initial import which did work.


(Gerhard Schlager) #40

Sure, doing incremental imports should work as long as you don’t delete any data from the /var/discourse/shared/standalone/import directory. Only add data to that directory as described in step 1.5.

If you want to import messages from the same group, put the messages in the existing folder in the import/data directory. Create a new directory in import/data when you want to import a new group.


(Sruly Markowitz) #41

Thanks. If I am using the Google Groups importer, do I just run that script again. It will know not to download the previously downloaded ones?


(Gerhard Schlager) #42

Yes, it will skip already downloaded topics.


(Mark A Schmucker) #43

My html tables are not importing properly from Google Groups. In Google Groups, it looks like a well-formatted html table (first few rows shown):

image

And inspecting the Google Group table element, it also looks normal:

image

But the imported table, in Discourse, is not a table at all- it’s a series of paragraph elements:

image

And inspecting the Discourse element:

image

In the importer’s settings.yml, I have prefer_html = true (which is the default).

In the site settings, I have “Incoming email prefer html” also true (which is the default).

I’ve shown one table as an example, but all my html tables have the same issue- not just this one.

Any ideas?


(Mark A Schmucker) #44

What is best practice for ensuring the list of users does not fragment as a result of the import?

For example, if I have an existing Discourse user JohnSmith with email johnsmith@example.com, and I import from Google Groups where his username was John_Smith and email was johnsmith@example.com, will Discourse recognize this is the same user, based on the common email address?

What if I import first, and John Smith joins Discourse later, with username JohnS123- again will it recognize this is the same user?

I’ll be using SSO with a Wordpress site if that matters.


(Gerhard Schlager) #45

Users are matched by email address, so it will work.


(Mark A Schmucker) #46

Thanks @gerhard. And there is no particular advantage to import-first-join-later, or vice versa?


(Gerhard Schlager) #47

I don’t think so. It should work either way. Choose whatever you prefer…


#48

I am new to Discourse and my Docker and Linux knowledge is very basic. But I am willing and eager to learn!
I have setup a Discourse instance and I managed to install the importer and also launch it.
However I am stuck at 1.5:
I have a test mbox for importing on my desktop. How to I tell the importer where to look for it or rather how to I upload the files to standalone/import/data?
Any help appreciated.


(Gerhard Schlager) #49

You can use a SCP client (I recommend WinSCP if you are using Windows) and upload the mbox to /var/discourse/standalone/import/data


#50

Thanks Gerhard.

As all our mailing list posts start with the same characters in the subject line, i.e xxx-Forum: Is there a way to truncate the first 11 characters from the subject line on import to Discourse?