Importing mailing lists (mbox, Listserv, Google Groups, emails, ...)

import

(Mark A Schmucker) #51

I’m scraping about 20,000 threads from Google Groups. I’ve done this successfully twice, just for practice, and it worked perfectly. Today I’m doing it for real, and of course it has failed twice so far. I have the tracebacks and can provide those if you like.

My immediate question is whether it’s safe to up-arrow and resume the scraping. It seems to be working- it skipped a bunch and is now scraping more.


(Gerhard Schlager) #52

Maybe. :crystal_ball: It depends on the errors. Use your judgement. :wink:


(Gerhard Schlager) #53

You’ll find an answer in the 3rd post of this topic. :wink: Not the perfect solution, but it works.


(Mark A Schmucker) #54

It’s not working at all now. I don’t see what’s changed since my practice imports. I’ve checked the obvious things- Google Groups is accessible, plenty of disk space, etc. The errors change somewhat but here are a couple of representative tracebacks:

 Failed to scrape topic at https://groups.google.com/forum/?_escaped_fragment_=topic/[i'm redacting the topic name]
Traceback (most recent call last):
        11: from script/import_scripts/google_groups.rb:192:in `<main>'
        10: from script/import_scripts/google_groups.rb:147:in `crawl'
         9: from script/import_scripts/google_groups.rb:72:in `crawl_categories'
         8: from script/import_scripts/google_groups.rb:72:in `each'
         7: from script/import_scripts/google_groups.rb:72:in `step'
         6: from script/import_scripts/google_groups.rb:79:in `block in crawl_categories'
         5: from script/import_scripts/google_groups.rb:79:in `each'
         4: from script/import_scripts/google_groups.rb:79:in `block (2 levels) in crawl_categories'
         3: from script/import_scripts/google_groups.rb:97:in `crawl_topic'
         2: from script/import_scripts/google_groups.rb:97:in `each'
         1: from script/import_scripts/google_groups.rb:97:in `block in crawl_topic'
script/import_scripts/google_groups.rb:109:in `crawl_message': undefined method [] for nil:NilClass (NoMethodError)

The undefined method referenced above should read open-bracket close-bracket, i.e. an empty list. Can’t get it to format correctly using blockquote.

and another one:

Scraping https://groups.google.com/forum/?_escaped_fragment_=topic//[i'm redacting the topic name]
Failed to scrape topic at https://groups.google.com/forum/?_escaped_fragment_=topic/[i'm redacting the topic name]
Traceback (most recent call last):
        18: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/platform.rb:141:in `block in exit_hook'
        17: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/service.rb:67:in `block in start'
        16: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/service.rb:77:in `stop'
        15: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/service.rb:128:in `stop_server'
        14: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/service.rb:104:in `connect_to_server'
        13: from /usr/local/lib/ruby/2.5.0/net/http.rb:609:in `start'
        12: from /usr/local/lib/ruby/2.5.0/net/http.rb:910:in `start'
        11: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/service.rb:108:in `block in connect_to_server'
        10: from /usr/local/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.141.0/lib/selenium/webdriver/common/service.rb:128:in `block in stop_server'
         9: from /usr/local/lib/ruby/2.5.0/net/http.rb:1213:in `get'
         8: from /usr/local/lib/ruby/2.5.0/net/http.rb:1464:in `request'
         7: from /usr/local/lib/ruby/2.5.0/net/http.rb:1491:in `transport_request'
         6: from /usr/local/lib/ruby/2.5.0/net/http.rb:1491:in `catch'
         5: from /usr/local/lib/ruby/2.5.0/net/http.rb:1494:in `block in transport_request'
         4: from /usr/local/lib/ruby/2.5.0/net/http/response.rb:29:in `read_new'
         3: from /usr/local/lib/ruby/2.5.0/net/http/response.rb:40:in `read_status_line'
         2: from /usr/local/lib/ruby/2.5.0/net/protocol.rb:167:in `readline'
         1: from /usr/local/lib/ruby/2.5.0/net/protocol.rb:157:in `readuntil'
/usr/local/lib/ruby/2.5.0/net/protocol.rb:189:in `rbuf_fill': end of file reached (EOFError)
Traceback (most recent call last):
        8: from script/import_scripts/google_groups.rb:192:in `<main>'
        7: from script/import_scripts/google_groups.rb:147:in `crawl'
        6: from script/import_scripts/google_groups.rb:72:in `crawl_categories'
        5: from script/import_scripts/google_groups.rb:72:in `each'
        4: from script/import_scripts/google_groups.rb:72:in `step'
        3: from script/import_scripts/google_groups.rb:79:in `block in crawl_categories'
        2: from script/import_scripts/google_groups.rb:79:in `each'
        1: from script/import_scripts/google_groups.rb:79:in `block (2 levels) in crawl_categories'
script/import_scripts/google_groups.rb:97:in `crawl_topic': undefined method `each' for nil:NilClass (NoMethodError)

So my judgment might be that resuming is not safe, but I’ve already tried starting over, by deleting the directory with the eml files. That’s not working either- similar errors.


(Gerhard Schlager) #55

Maybe you got blocked or rate limited. :man_shrugging:
The scraper doesn’t handle that. Wait a little bit or try from a different IP or with a different Google account.


(Mark A Schmucker) #56

You may be right. Or network issues maybe. It resolved itself and has been running ok now.

I suspect I have duplicate eml files now, because I resumed as mentioned above. I expected 17,000, based on my practice runs, but I’m up to 20,000 and counting. Is there a mechanism for de-duplicating, either in the scraping step or in the import?


(Mark A Schmucker) #57

Do I understand correctly that there is no way to import a second time?

Here is what I’ve done (over several weeks):

  1. Clean Discourse install
  2. Added new content and users
  3. Made a backup
  4. Imported the old Google Groups as a dry run
  5. Reverted to the backup- good, the dry run is gone, I thought
  6. Added more content and users
  7. Now I’m ready for the final Google Group import. But that fails because it recognizes the posts as already imported, despite (5.). So I have to delete all data.

If I download my current backup, delete all data, and restore from the backup, will that then allow me to import again? Or will it remember those posts in the backup?


(Gerhard Schlager) #58

The scraper for Google Groups should skip topics that have already been downloaded. That’s what the status.yml file is for. And in theory it shouldn’t create duplicate files because it overwrites existing eml files, but I haven’t really tested that part extensively.

Are you sure? The import consists of two parts.

  1. Indexing: This will skip already indexed emails depending on the data in metadata.yml
  2. Importing: This will skip importing users and posts when they already exist.

As long as you restored the correct Discourse backup, it should work. If you want to make sure that the indexing of emails runs again, delete the import/data/<group_name>/metadata.yml and import/data/index.db files.


(Mark A Schmucker) #59

I take that back. My experiment was flawed, and I have no reason to believe that I had duplicate emails.


(Mark A Schmucker) #60

I don’t have an import/data/<group_name>/metadata.yml- I do have a status.yml. I deleted status.yml and index.db, then re-ran import_mbox.sh. It says ‘indexing /shared/import/data/<group_name>/<filename.eml>’ for thousands of eml files over 20 minutes, then ‘Skipping 1000 already import posts’ several times. It skipped all the posts.


(Gerhard Schlager) #61

Yeah, my bad. I mixed that up with another import. :slight_smile:

This will only happen when the posts already exist or the system at least believes they exist because the Message-IDs are found in the post_custom_fields table. Are you :100:% sure that you successfully restored a backup which was created before your dry run?


(Mark A Schmucker) #62

Not 100% at this point. You’ve convinced me maybe I didn’t.

I don’t see the posts in my Discourse- is that a sufficient test of whether they “exist”?

Is it possible to clean up the post_custom_fields table? Is this in the postgres database?

I have the Data Explorer plugin- I’ll poke around there.

The reason I’m resisting starting over is because I’ve done my step 6- added more content and users- so I’m invested in this install. Resistance may be futile though.


(Mark A Schmucker) #64

Well this is embarrassing. I’ve been working on the wrong server :roll_eyes: We had a development server and a production server and the DNS was set up opposite what it was supposed to be. The imports have been working perfectly on the development server (which nobody is watching anymore). Sorry for all the noise.


(Tim Sawyer) #65

This procedure worked great. The trick was to get a good Mbox of the Mailman archives. The script in this post

https://mail.python.org/pipermail/mailman-users/2012-October/074208.html

was the key.


Staged Users Migrated from Mailman
(Tim Sawyer) #66

Is there a way to run the importer and preserve existing posts, categories, etc?


(Tim Sawyer) #67

My mistake. Looks like the importer does not erase existing data.


#68

Hey @gerhard, I was importing big email archive and my process got killed.

This is what I see on the terminal after more than 12hrs waiting to import. I couldn’t find any related issue on this forum. Any idea to what might be the cause and how to fix it?

Thanks!

indexing replies and users

creating categories
       63 / 63 (100.0%)  [517 items/min]   ]
creating users
      192 / 192 (100.0%)  [1885 items/min]  ]
creating topics and posts
Killed
discourse@discourse-app:/var/www/discourse$ WARNING: V8 isolate was forked, it can not be disposed and memory will not be reclaimed till the Ruby process exits.
WARNING: V8 isolate was forked, it can not be disposed and memory will not be reclaimed till the Ruby process exits.

(Kane York) #69

The importer is resumable for that reason, you should be able to run it again and it will pick up where it left off.


(Gerhard Schlager) #70

You can resume the import, but my guess is that your system is running low on memory. The import most likely won’t work on a system with less than 4GB of RAM. I’d recommend at least 8GB of RAM. Imports are quite resource intensive. Also, make sure you have a swap file.


(alexknowshtml) #71

Hi - I’m running the google groups importer for the first time and things seem to be going smoothly except while spot checking the threads I’m noticing that the posts are being generated out of order. Even weirder, the timestamp on the posts/replies seem to be correct, but the “first post” is often buried in the replies instead of being the topic starting post.

I should note that I used the “match on subject” flag in the settings - my first attempt at importing put every imported post into it’s own topic instead of nesting comments under them.

I searched the forum for anything like this and couldn’t find it. Any suggestions of what I might have done wrong? Thanks for your work on this @gerhard!