Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

So here’s what happened - I believe the importer created that bogus/staged username previously, and I merged that staged user into a REAL user, and that staged username is gone, so it doesn’t know what to do. I can imagine that being a new-ish kind of missing email address, since I imagine the importer keeps track of users it created during past runs of the importer.

What I’d like to see happen when re-running the import, one of two options:

1 - create a new staged user. This would be perfectly fine IMO, since I can manually merge again later.
2 - if I’m allowed to dream, an active prompt me for me to: give it a real/active username in the system, create a new staged user (just like option 1), or skip. this would give me the most control, but i also recognize likely a LOT more work.

Again, the former would help a lot. The latter is likely overkill for 99.9% of scenarios, probably even mine :slight_smile:

I’ve started having a problem with mbox import. Below is the last few lines before import appears to stop. I did exit and stop importer, then rebuilt app and started to import again. Got this same crash the second time.

	 3: from /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/activesupport-5.2.2/lib/active_support/callbacks.rb:198:in `block (2 levels) in halting'
	 2: from /var/www/discourse/vendor/bundle/ruby/2.5.0/gems/activesupport-5.2.2/lib/active_support/callbacks.rb:426:in `block in make_lambda'
	 1: from /var/www/discourse/app/models/user_option.rb:35:in `set_defaults'
/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/activemodel-5.2.2/lib/active_model/attribute_methods.rb:430:in `method_missing': undefined method `email_always=' for #<UserOption:0x0000558b89836f70> (NoMethodError)
Did you mean?  email_level_was
root@community-ord-import:/var/www/discourse#
1 Like

Hmmm, looks like I’m having the same problem that @alexknowshtml is having.

Thanks for reporting that issue. I’ll fix it.

3 Likes

@tisawyer Did you rebuild the app container or upgrade the app without rebuilding the import container? Try rebuilding both containers. I can’t reproduce that error.

@alexknowshtml It looks like the Google Groups scraper fails to login or the user doesn’t have the right permissions to see email addresses. It needs to be a Manager or Owner, otherwise you’ll get censored, invalid email addresses which look like foo...@example.com.

I’ve updated the scraper to warn about missing permission and it tries to detect if the login failed. Can you try again?

You might need to delete the index.db file and all the topic URLs from status.yml starting with topics that were downloaded with censored email addresses. You’ll need to start from scratch when you aren’t sure which lines to delete. The script should detect and skip existing users and posts during the import.

Regarding your problems with users… The import script should find existing users by email address. So, as long as the data from Google Groups contains valid email addresses, everything should work. But honestly, I’ve never tried a lot of merging of users during imports. This is kinda uncharted territory. You might need to start hacking the import script if it doesn’t work the way you want. :wink:

3 Likes

Rebuilding app and import fixed it. Thank you!

I did rebuild app without rebuilding import. I’ve been doing that since the initial install of import. Is it necessary to always rebuild import when rebuilding app?

2 Likes

Rebuilding both containers is always a good idea because diverging code can produce errors like the one you encountered.

3 Likes

@gerhard is it possible to do this after an instance of Discourse is stood up but not in full use yet?

Yes, that usually works. Create a backup and give it a try.

3 Likes

@gerhard how would you envision the identity matching between Google Groups and Discourse? If a user exists on both platforms, it seems like an easy match based on the user’s email address, however if the user does not yet exist within Discourse, would it be imported as anonymous or somehow held in pending until that user signs up?

The import scripts creates a new user for every email address it finds in the data (mbox file, emails from Google Group, etc.) unless there’s already an existing user with that email address. Newly created users are staged users by default, but you can change it to create regular users in the settings. And no, it doesn’t create anonymized users.

3 Likes

I’m trying to do the import from Google Groups as described in the first post, but I am running into the following exception. I’m running it with a groups user ID that has Owner permissions. Any suggestions?

Fetching gem metadata from https://rubygems.org/.........
Resolving dependencies...
Using rake 12.3.2
Using bundler 1.17.3
Using childprocess 1.0.1
Using connection_pool 2.2.2
Using mini_portile2 2.4.0
Using rubyzip 1.2.3
Using net-http-persistent 3.0.1
Using nokogiri 1.10.3
Using selenium-webdriver 3.142.3
Using webdrivers 4.1.0

Logging in...
Traceback (most recent call last):
	16: from script/import_scripts/google_groups.rb:266:in `<main>'
	15: from script/import_scripts/google_groups.rb:217:in `crawl'
	14: from script/import_scripts/google_groups.rb:137:in `login'
	13: from script/import_scripts/google_groups.rb:48:in `get'
	12: from script/import_scripts/google_groups.rb:30:in `driver'
	11: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver.rb:88:in `for'
	10: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/common/driver.rb:46:in `for'
	 9: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/common/driver.rb:46:in `new'
	 8: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/chrome/driver.rb:43:in `initialize'
	 7: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:56:in `handshake'
	 6: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:102:in `create_session'
	 5: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 4: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 3: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/default.rb:82:in `request'
	 2: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/persistent.rb:54:in `response_for'
	 1: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/default.rb:71:in `http'
/usr/local/lib/ruby/gems/2.6.0/gems/net-http-persistent-3.0.1/lib/net/http/persistent.rb:706:in `start': wrong number of arguments (given 0, expected 1) (ArgumentError)
3 Likes

Thanks for reporting the error. I fixed the import script.

4 Likes

Thanks, it got further this time, but now it looks like it is unable to find where to inject the 2FA code.

Logging in...

2-Step Verification is required.
Unlock on your phone and press Enter
or enter the code from your authenticator app
or enter the code you received via SMS (without the G- prefix)
Enter code: XXXXXX
2019-07-10 20:27:08 WARN Selenium [DEPRECATION] Selenium::WebDriver::Error::TimeOutError is deprecated. Use Selenium::WebDriver::Error::TimeoutError (ensure the driver supports W3C WebDriver specification) instead.
Failed to detect 'code' input on login page

Nevermind. I fiddled with my google account a bit more (disabled google prompt so it will default to the authenticator app) and tried again. It’s scraping messages now.

3 Likes

Just a quick follow-up to report that my import succeeded, although with a few stumbling blocks along the way. First some background: Over 100k posts were imported, in about 30k topics, spanning back about 20 years. The group was originally a mailman list, and IIRC something else before that. Archives were imported into Groups back when we made the switch, so malformed messages due to that transition may be part of the issues.

I received a few exceptions like the following, although most of them were spam messages and/or very old messages I didn’t mind losing, so I just deleted the posts in groups and restarted. I’m mentioning this in case you want to handle these kinds of exceptions with a log message and skip to the next.

2019-07-11 13:12:22 WARN Selenium [DEPRECATION] Selenium::WebDriver::Error::ElementNotVisibleError is deprecated. Use Selenium::WebDriver::Error::ElementNotInteractableError (ensure the driver supports W3C WebDriver specification) instead.
Failed to scrape topic at https://groups.google.com/forum/?_escaped_fragment_=topic/wxpython-users/HQuLjYFpkPg
Traceback (most recent call last):
	22: from script/import_scripts/google_groups.rb:263:in `<main>'
	21: from script/import_scripts/google_groups.rb:217:in `crawl'
	20: from script/import_scripts/google_groups.rb:73:in `crawl_categories'
	19: from script/import_scripts/google_groups.rb:73:in `each'
	18: from script/import_scripts/google_groups.rb:80:in `block in crawl_categories'
	17: from script/import_scripts/google_groups.rb:80:in `each'
	16: from script/import_scripts/google_groups.rb:80:in `block (2 levels) in crawl_categories'
	15: from script/import_scripts/google_groups.rb:98:in `crawl_topic'
	14: from script/import_scripts/google_groups.rb:98:in `each'
	13: from script/import_scripts/google_groups.rb:98:in `block in crawl_topic'
	12: from script/import_scripts/google_groups.rb:110:in `crawl_message'
	11: from script/import_scripts/google_groups.rb:65:in `find'
	10: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/common/search_context.rb:62:in `find_element'
	 9: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/w3c/bridge.rb:547:in `find_element_by'
	 8: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/w3c/bridge.rb:567:in `execute'
	 7: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 6: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 5: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
	 4: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
	 3: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
	 2: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
	 1: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok'
#0 0x555f0836b6e9 <unknown>: no such element: Unable to locate element: {"method":"css selector","selector":"pre"} (Selenium::WebDriver::Error::NoSuchElementError)
  (Session info: headless chrome=75.0.3770.100)

I’ve also seen a few of these while the import_mbox script was running:

undefined method `hex' for nil:NilClass
/var/www/discourse/app/models/upload.rb:137:in `base62_sha1'
/var/www/discourse/app/models/upload.rb:364:in `short_url_basename'
/var/www/discourse/app/models/upload.rb:120:in `short_url'
/var/www/discourse/lib/email/receiver.rb:1088:in `attachment_markdown'
/var/www/discourse/lib/email/receiver.rb:1049:in `block in add_attachments'
/var/www/discourse/lib/email/receiver.rb:1021:in `each'
/var/www/discourse/lib/email/receiver.rb:1021:in `add_attachments'
/var/www/discourse/script/import_scripts/mbox/importer.rb:137:in `format_raw'
/var/www/discourse/script/import_scripts/mbox/importer.rb:121:in `map_post'
/var/www/discourse/script/import_scripts/mbox/importer.rb:159:in `map_reply'
/var/www/discourse/script/import_scripts/mbox/importer.rb:105:in `block (2 levels) in import_posts'
/var/www/discourse/script/import_scripts/base.rb:501:in `block in create_posts'
/var/www/discourse/script/import_scripts/base.rb:500:in `each'
/var/www/discourse/script/import_scripts/base.rb:500:in `create_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:97:in `block in import_posts'
/var/www/discourse/script/import_scripts/base.rb:880:in `block in batches'
/var/www/discourse/script/import_scripts/base.rb:879:in `loop'
/var/www/discourse/script/import_scripts/base.rb:879:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:83:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:91:in `import_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:35:in `execute'
/var/www/discourse/script/import_scripts/base.rb:49:in `perform'
script/import_scripts/mbox.rb:16:in `<module:Mbox>'
script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
script/import_scripts/mbox.rb:9:in `<main>'

All in all I think the conversion went well, considering the huge amount of posts and the questionable source of some of it. Thanks for creating and maintaining these import tools.

Robin

4 Likes

Excellent detailed feedback @RobinD42 we will definitely try to fold these changes in for future importers! Thank you. :hugs:

Hm, I tried running the google_groups.rb script, but got rejected by Google. The Selenium driver’s URL takes me to this error page, which states that I’m not using an official browser, and it can’t proceed with the script.

Has anyone run into this, and found a workaround?

Hello, I tried running the google_groups.rb script, but failed to login and I have manager privileges. I just get this

Logging in...
2019-08-14 21:08:10 WARN Selenium [DEPRECATION] Selenium::WebDriver::Error::TimeOutError is deprecated. Use Selenium::WebDriver::Error::TimeoutError (ensure the driver supports W3C WebDriver specification) instead.
Failed to login

Please can anyone help me out

1 Like

Unfortunately Google seems to have made some changes to the login flow and prevents automated logins with a headless browser. I’m planning to fix it next week or the week after. As a workaround you can run the script in a development environment by removing the headless parameter.

1 Like