Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

@gerhard how would you envision the identity matching between Google Groups and Discourse? If a user exists on both platforms, it seems like an easy match based on the user’s email address, however if the user does not yet exist within Discourse, would it be imported as anonymous or somehow held in pending until that user signs up?

The import scripts creates a new user for every email address it finds in the data (mbox file, emails from Google Group, etc.) unless there’s already an existing user with that email address. Newly created users are staged users by default, but you can change it to create regular users in the settings. And no, it doesn’t create anonymized users.

3 Likes

I’m trying to do the import from Google Groups as described in the first post, but I am running into the following exception. I’m running it with a groups user ID that has Owner permissions. Any suggestions?

Fetching gem metadata from https://rubygems.org/.........
Resolving dependencies...
Using rake 12.3.2
Using bundler 1.17.3
Using childprocess 1.0.1
Using connection_pool 2.2.2
Using mini_portile2 2.4.0
Using rubyzip 1.2.3
Using net-http-persistent 3.0.1
Using nokogiri 1.10.3
Using selenium-webdriver 3.142.3
Using webdrivers 4.1.0

Logging in...
Traceback (most recent call last):
	16: from script/import_scripts/google_groups.rb:266:in `<main>'
	15: from script/import_scripts/google_groups.rb:217:in `crawl'
	14: from script/import_scripts/google_groups.rb:137:in `login'
	13: from script/import_scripts/google_groups.rb:48:in `get'
	12: from script/import_scripts/google_groups.rb:30:in `driver'
	11: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver.rb:88:in `for'
	10: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/common/driver.rb:46:in `for'
	 9: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/common/driver.rb:46:in `new'
	 8: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/chrome/driver.rb:43:in `initialize'
	 7: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:56:in `handshake'
	 6: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:102:in `create_session'
	 5: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 4: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 3: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/default.rb:82:in `request'
	 2: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/persistent.rb:54:in `response_for'
	 1: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/default.rb:71:in `http'
/usr/local/lib/ruby/gems/2.6.0/gems/net-http-persistent-3.0.1/lib/net/http/persistent.rb:706:in `start': wrong number of arguments (given 0, expected 1) (ArgumentError)
3 Likes

Thanks for reporting the error. I fixed the import script.

4 Likes

Thanks, it got further this time, but now it looks like it is unable to find where to inject the 2FA code.

Logging in...

2-Step Verification is required.
Unlock on your phone and press Enter
or enter the code from your authenticator app
or enter the code you received via SMS (without the G- prefix)
Enter code: XXXXXX
2019-07-10 20:27:08 WARN Selenium [DEPRECATION] Selenium::WebDriver::Error::TimeOutError is deprecated. Use Selenium::WebDriver::Error::TimeoutError (ensure the driver supports W3C WebDriver specification) instead.
Failed to detect 'code' input on login page

Nevermind. I fiddled with my google account a bit more (disabled google prompt so it will default to the authenticator app) and tried again. It’s scraping messages now.

3 Likes

Just a quick follow-up to report that my import succeeded, although with a few stumbling blocks along the way. First some background: Over 100k posts were imported, in about 30k topics, spanning back about 20 years. The group was originally a mailman list, and IIRC something else before that. Archives were imported into Groups back when we made the switch, so malformed messages due to that transition may be part of the issues.

I received a few exceptions like the following, although most of them were spam messages and/or very old messages I didn’t mind losing, so I just deleted the posts in groups and restarted. I’m mentioning this in case you want to handle these kinds of exceptions with a log message and skip to the next.

2019-07-11 13:12:22 WARN Selenium [DEPRECATION] Selenium::WebDriver::Error::ElementNotVisibleError is deprecated. Use Selenium::WebDriver::Error::ElementNotInteractableError (ensure the driver supports W3C WebDriver specification) instead.
Failed to scrape topic at https://groups.google.com/forum/?_escaped_fragment_=topic/wxpython-users/HQuLjYFpkPg
Traceback (most recent call last):
	22: from script/import_scripts/google_groups.rb:263:in `<main>'
	21: from script/import_scripts/google_groups.rb:217:in `crawl'
	20: from script/import_scripts/google_groups.rb:73:in `crawl_categories'
	19: from script/import_scripts/google_groups.rb:73:in `each'
	18: from script/import_scripts/google_groups.rb:80:in `block in crawl_categories'
	17: from script/import_scripts/google_groups.rb:80:in `each'
	16: from script/import_scripts/google_groups.rb:80:in `block (2 levels) in crawl_categories'
	15: from script/import_scripts/google_groups.rb:98:in `crawl_topic'
	14: from script/import_scripts/google_groups.rb:98:in `each'
	13: from script/import_scripts/google_groups.rb:98:in `block in crawl_topic'
	12: from script/import_scripts/google_groups.rb:110:in `crawl_message'
	11: from script/import_scripts/google_groups.rb:65:in `find'
	10: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/common/search_context.rb:62:in `find_element'
	 9: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/w3c/bridge.rb:547:in `find_element_by'
	 8: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/w3c/bridge.rb:567:in `execute'
	 7: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 6: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 5: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
	 4: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
	 3: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
	 2: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
	 1: from /usr/local/lib/ruby/gems/2.6.0/gems/selenium-webdriver-3.142.3/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok'
#0 0x555f0836b6e9 <unknown>: no such element: Unable to locate element: {"method":"css selector","selector":"pre"} (Selenium::WebDriver::Error::NoSuchElementError)
  (Session info: headless chrome=75.0.3770.100)

I’ve also seen a few of these while the import_mbox script was running:

undefined method `hex' for nil:NilClass
/var/www/discourse/app/models/upload.rb:137:in `base62_sha1'
/var/www/discourse/app/models/upload.rb:364:in `short_url_basename'
/var/www/discourse/app/models/upload.rb:120:in `short_url'
/var/www/discourse/lib/email/receiver.rb:1088:in `attachment_markdown'
/var/www/discourse/lib/email/receiver.rb:1049:in `block in add_attachments'
/var/www/discourse/lib/email/receiver.rb:1021:in `each'
/var/www/discourse/lib/email/receiver.rb:1021:in `add_attachments'
/var/www/discourse/script/import_scripts/mbox/importer.rb:137:in `format_raw'
/var/www/discourse/script/import_scripts/mbox/importer.rb:121:in `map_post'
/var/www/discourse/script/import_scripts/mbox/importer.rb:159:in `map_reply'
/var/www/discourse/script/import_scripts/mbox/importer.rb:105:in `block (2 levels) in import_posts'
/var/www/discourse/script/import_scripts/base.rb:501:in `block in create_posts'
/var/www/discourse/script/import_scripts/base.rb:500:in `each'
/var/www/discourse/script/import_scripts/base.rb:500:in `create_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:97:in `block in import_posts'
/var/www/discourse/script/import_scripts/base.rb:880:in `block in batches'
/var/www/discourse/script/import_scripts/base.rb:879:in `loop'
/var/www/discourse/script/import_scripts/base.rb:879:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:83:in `batches'
/var/www/discourse/script/import_scripts/mbox/importer.rb:91:in `import_posts'
/var/www/discourse/script/import_scripts/mbox/importer.rb:35:in `execute'
/var/www/discourse/script/import_scripts/base.rb:49:in `perform'
script/import_scripts/mbox.rb:16:in `<module:Mbox>'
script/import_scripts/mbox.rb:10:in `<module:ImportScripts>'
script/import_scripts/mbox.rb:9:in `<main>'

All in all I think the conversion went well, considering the huge amount of posts and the questionable source of some of it. Thanks for creating and maintaining these import tools.

Robin

4 Likes

Excellent detailed feedback @RobinD42 we will definitely try to fold these changes in for future importers! Thank you. :hugs:

Hm, I tried running the google_groups.rb script, but got rejected by Google. The Selenium driver’s URL takes me to this error page, which states that I’m not using an official browser, and it can’t proceed with the script.

Has anyone run into this, and found a workaround?

Hello, I tried running the google_groups.rb script, but failed to login and I have manager privileges. I just get this

Logging in...
2019-08-14 21:08:10 WARN Selenium [DEPRECATION] Selenium::WebDriver::Error::TimeOutError is deprecated. Use Selenium::WebDriver::Error::TimeoutError (ensure the driver supports W3C WebDriver specification) instead.
Failed to login

Please can anyone help me out

1 Like

Unfortunately Google seems to have made some changes to the login flow and prevents automated logins with a headless browser. I’m planning to fix it next week or the week after. As a workaround you can run the script in a development environment by removing the headless parameter.

1 Like