Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

import
(Gerhard Schlager) #72

It sounds like you skipped the following step and the messages downloaded from your Google Group do not contain email addresses and proper message ids.

Also, make sure that the split_regex is an empty string. Enabling group_messages_by_subject is most likely a very bad idea.

3 Likes

(alexknowshtml) #73

Hm, the account I used to download the messages is a manager for the group.

I’m doing a fresh download to a fresh install now. Is there something I can look for in the downloads to ensure that they contain the correct information before running the import?

0 Likes

(Gerhard Schlager) #74

The messages must contain the following headers:

Date: Thu, 7 Feb 2019 12:51:12 -0800 (PST)
From: <a valid email address
To: <a valid email address>
Message-Id: <some id>

Replies need to have an In-Reply-To: or References: header.

I’ve seen that only when the email addresses in the headers were censored or the split_regex was wrong…

3 Likes

(alexknowshtml) #75

I think I found the problem - I was using a login with the correct manager credentials, but had 2FA on that email. That “error” was silent so I totally missed it the first time.

Disabled that and I think the new download worked correctly - all of the individual .eml files look perfect. Thanks for your help.

5 Likes

(alexknowshtml) #76

Howdy, me again with a different question!

I’ve got the google groups importer running beautifully and it imported tens of thousands of threads. I was even able to re-run the importer again, and it nabbed new threads.

But today I went to run the importer again and it came back empty handed. Didn’t detect the new threads. I don’t see any obvious errors, but was wondering if it’s possible that maybe the google groups XML that the importer pulls from is being cached or something along those lines? Or is there somewhere else I can look to periodically “sync” the google group with the discourse while we complete the migration process?

1 Like

(alexknowshtml) #77

Friendly bump - I tried this again and ~24 hours later (in case I needed google’s XML cache to refresh or something) and it still isn’t getting the latest posts.

As far as I can tell, the problem is happening in the google_groups.rb and new .eml files aren’t being generated for the latest posts. I haven’t been able to figure out what might cause it by examining the script, though.

Any suggestions? Thank you!

0 Likes

(Gerhard Schlager) #78

It’s hard to tell, but I suspect it might have problems logging in again. Maybe Google wants a verification code. I’ve seen that happen from time to time even when 2FA is disabled. :man_shrugging:

1 Like

(alexknowshtml) #79

hm - I think you might be right, but I’m not sure how to get around it.

is there a way to see more closely what selenium (I think that’s what’s doing the actual logging in, right?) is doing in the background?

0 Likes

(alexknowshtml) #80

I’m sorry to keep coming back with this challenge, I’ve tried different accounts with manager permissions and the script still isn’t seeing the newest messages.

Any clues about files that I might try to move or remove that would force it to get the feed fresh? I’m cautious because we have done a LOT of work to categorize the topics that are already imported, so I don’t want to accidentally do something that breaks the “resumable” part of the importer any further.

0 Likes

(alexknowshtml) #81

Good news! I figured out the issue, and this may be something to account for in the scraper:

I had a pinned topic that announced the migration, and it seems like the scraper was seeing that and somehow missing the new posts since that one. Unpinning that topic, and the scraper started working again.

Bad news! I have a new problem :slight_smile: I’m hoping you can help with.

We’ve been working to categorize posts that were imported, and that’s going well. I’ve also been merging staged users into some early registered regulars’ accounts to that their post history and join date would be accurate. But I think that is causing an issue with the import_mbox.sh command. This is the error I’m getting.

If I absolutely have to, I’ll freeze the old group to new posts and come up with a manual migration plan for anything posted since the migration announcement. But I would like to avoid this if possible.

Thanks again for the help so far!

2 Likes

(Jay Pfaffman) #82

That email address is indeed invalid. What do you want to happen? Do you want to skip the user or create it with a bogus email address? (I thought I’d seen that there was some logic that dealt with this, but maybe this is a new kind of missing email address.)

0 Likes

(alexknowshtml) #83

So here’s what happened - I believe the importer created that bogus/staged username previously, and I merged that staged user into a REAL user, and that staged username is gone, so it doesn’t know what to do. I can imagine that being a new-ish kind of missing email address, since I imagine the importer keeps track of users it created during past runs of the importer.

What I’d like to see happen when re-running the import, one of two options:

1 - create a new staged user. This would be perfectly fine IMO, since I can manually merge again later.
2 - if I’m allowed to dream, an active prompt me for me to: give it a real/active username in the system, create a new staged user (just like option 1), or skip. this would give me the most control, but i also recognize likely a LOT more work.

Again, the former would help a lot. The latter is likely overkill for 99.9% of scenarios, probably even mine :slight_smile:

0 Likes

(Tim Sawyer) #84

I’ve started having a problem with mbox import. Below is the last few lines before import appears to stop. I did exit and stop importer, then rebuilt app and started to import again. Got this same crash the second time.

	 3: from /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/activesupport-5.2.2/lib/active_support/callbacks.rb:198:in `block (2 levels) in halting'
	 2: from /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/activesupport-5.2.2/lib/active_support/callbacks.rb:426:in `block in make_lambda'
	 1: from /var/www/discourse/app/models/user_option.rb:35:in `set_defaults'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/activemodel-5.2.2/lib/active_model/attribute_methods.rb:430:in `method_missing': undefined method `email_always=' for #<UserOption:0x0000558b89836f70> (NoMethodError)
Did you mean?  email_level_was
root@community-ord-import:/var/www/discourse#
1 Like

(Tim Sawyer) #85

Hmmm, looks like I’m having the same problem that @alexknowshtml is having.

0 Likes

(Gerhard Schlager) #86

Thanks for reporting that issue. I’ll fix it.

3 Likes

(Gerhard Schlager) #87

@tisawyer Did you rebuild the app container or upgrade the app without rebuilding the import container? Try rebuilding both containers. I can’t reproduce that error.

@alexknowshtml It looks like the Google Groups scraper fails to login or the user doesn’t have the right permissions to see email addresses. It needs to be a Manager or Owner, otherwise you’ll get censored, invalid email addresses which look like foo...@example.com.

I’ve updated the scraper to warn about missing permission and it tries to detect if the login failed. Can you try again?

You might need to delete the index.db file and all the topic URLs from status.yml starting with topics that were downloaded with censored email addresses. You’ll need to start from scratch when you aren’t sure which lines to delete. The script should detect and skip existing users and posts during the import.

Regarding your problems with users… The import script should find existing users by email address. So, as long as the data from Google Groups contains valid email addresses, everything should work. But honestly, I’ve never tried a lot of merging of users during imports. This is kinda uncharted territory. You might need to start hacking the import script if it doesn’t work the way you want. :wink:

3 Likes

(Tim Sawyer) #88

Rebuilding app and import fixed it. Thank you!

I did rebuild app without rebuilding import. I’ve been doing that since the initial install of import. Is it necessary to always rebuild import when rebuilding app?

2 Likes

(Gerhard Schlager) #89

Rebuilding both containers is always a good idea because diverging code can produce errors like the one you encountered.

3 Likes