I’ve been running this script daily for the past several months for a site that really needs to switch to subscribing the category to the group, but that’s not done that. It works fine except every so often I need to get a new cookies.txt file. About a month ago, something happened and it started complaining that “It looks like you do not have permissions to see email addresses. Aborting.” I did . . . something . . . and it started working again. Just over a week ago it happened again and I’ve re-downloaded the cookies with multiple browsers/cookie plugins and keep getting the no-email version of the posts. I can see the addresses when logged in with the web browser.
Has anyone else had trouble lately? Any ideas on what to do? I’ve tried playing around with what domains are in the add_cookies call in the script, but that hasn’t helped.
used to include the full email addresses, but now doesn’t. I can confirm that when I’m logged in, I can click about and see full email addresses in the google groups web interface, but if I hit the above URL that the scrape script is hitting in the very same web browser, it gets the data with redacted email addresses.
My guess is that they’ve increased privacy or something.
Here’s another clue: I can open that link in my browser and it works, but if I snag the “copy as cURL” that curl command isn’t getting the email addresses. Sigh. Well, I tried with another browser and the curl command worked. I can’t quite figure why the script isn’t getting the email addresses.
So maybe there’s some other browser-specific thing it’s doing?
Thanks. Well, a week ago it worked for a second site and then I updated my cookies file one more time and it downloaded the data for the first site. It seems to have only worked for a day or two, and now again it’s not working. For either site. I see the full email address in my browser, download the cookies for that tab, and no joy.
I’ll check out the takout. EDIT: Well, to get the mbox file it looks like you need to be a super administrator, not just an owner.
@gerhard any ideas how to adapt these instructions to using a dev install rather than the standard installation? I feel like I’ve come close to getting a listserv migration working using just a few commands, but I can’t get what I assume is the last step to work using either of:
Both fail to pull in all dependencies. See here for the full set of commands I used and the errors. Any ideas? Missing some d/bundle calls perhaps?
Next thing I will try is using an Ubuntu VM and doing a “standard install” there but this seems a bit overkill given the dev install otherwise works quite nicely.
I am a total discourse (and ruby, and mostly docker) newbie, so sorry if this is obvious or (worse) irrelevant!
I’ve never tried this with the Docker based dev install, but I guess you need to add "gem 'sqlite3'" to the Gemfile and execute apt install -y libsqlite3-dev inside the container before running d/bundle install
@gerhard thanks for the gentle nudge – I re-ran from absolute scratch (git clone onward) while adding gem 'sqlite3' to the end of /src/Gemfile, since I assumed this was the one you meant, and it worked! For the record here are the instructions I used (for mne_analysis listserv):
1. In Ubuntu host
git clone https://github.com/discourse/discourse.git
cd discourse
d/boot_dev --init
d/rails db:migrate RAILS_ENV=development
d/shell
vim /src/Gemfile # add gem 'sqlite3' to end
exit
d/bundle
@gerhard Can your mbox importer be used only when you first install Discourse or can it be used later after other users are using Discourse? If the importer is used when Discourse is being used by others will they see any side effects?
@gerhard I noticed after the import that although the messages look okay, there are no staged users at all, even though it seems like there should be (I used the default staged: true). The outputs look like:
Is there supposed to be a user counter shown as well?
I also tried running with staged: false and the same output was shown, and none of the mailing list users are in any groups. In case it helps to see what’s actually being processed, here is one of the many .mbox files that is being imported:
Have you had a look at the database? Gut feeling on this issue is for some reason the email field is not getting created correctly there and thus can’t be read.
The Mailman 2 list that I am considering importing into Discourse has had (for part of its existence) from_is_list set to Munge From, so that the “From:” header is
From: Listname <listname-bounces@listdomain.com> On Behalf Of [Original sender's name]
This made me think the importer would import each of these messages as if from the same user (with email address listname-bounces@listdomain.com)… BUT…
The initial line marking the beginning of a new email in the mbox file still begins with
From username@example.com [Date time group]
(and the Hyperkitty archives show the original sender’s email address as normal also).
So my question is – does the importer script take the sender’s address from the “From:” header or the "From " line? Thank you.
Thanks for the quick reply! How hard would it be to change this? Not necessarily officially – though it might help others – but just for me to change the script before running it. I don’t know any Ruby (yet!) but if it’s just changing a colon to a space…
It’s not a simple change, but it should be doable. You don’t necessarily have to implement it in the import script. If you know another scripting language, I’m sure it won’t be too hard to update the From: headers in the mbox files before running the import…
But, feel free to fix it in the import script. A PR is welcome!
A good starting point for fixing the header should be the each_mail method…
Would it be possible at that point to obtain the first line of the mbox email (i.e. the “From [email address] [date time]” line) from parsed_email and extract the email address from that?
No, that line is filtered when the mbox is split into individual messages. You need to save that value in the each_mail method in order to use it later.
I had some fun trying to do this, before spotting that Mailman stores the emails in the mbox in their original, unadulterated form, so that the “From:” line contains the same (original sender’s) email address as the "From " line in all cases, even when the email has been sent “From: listname-bounces@listname.domain.com”).
I was limited by not having a development Discourse installation, or even Ruby, but was able to make some headway with https://rubular.com/ and https://replit.com/languages/ruby (and DuckDuckGo). If you would be willing to have a look at it, I’d be grateful if you’d let me know whether this would have worked (or nearly worked) had it been necessary…
def each_mail(filename)
raw_message = +''
first_line_number = 1
last_line_number = 0
each_line(filename) do |line|
if line.scrub =~ @split_regex
if last_line_number > 0
#We're at the start of the NEXT email now
yield raw_message, first_line_number, last_line_number
raw_message = +''
first_line_number = last_line_number + 1
else
#We're at the start of THIS email now, so get the email address
new_email = line.match(/^From (\S+@\S+).*/).captures
end
else
raw_message << line
end
last_line_number += 1
end
#Get old email ("From:" line)
old_email = raw_message.match(/^From: .*<(\S+@\S+)>/).captures
#Put "From " address into "From:" line
raw_message = raw_message.sub(old_email, new_email)
yield raw_message, first_line_number, last_line_number if raw_message.present?
end