Obfuscating mailto:foo@bar.com when importing mail archives

dachary · June 11, 2021, 9:31pm

Bonjour,

After a successful import of mail archives ( mbox ), the content of the messages will display email addresses that would have been obfuscated by Gmane or the mailman2 archive server. This allows bots collecting addresses to harvest them in discourse and I’m looking for a way to avoid this.

globally removing email in the posts (a display plugin maybe?)
some site setting that already does that
another idea?

Thanks in advance for your help!

RGJ · June 11, 2021, 9:45pm

Why would email addresses be showing in the content of messages? Can you give an example?

dachary · June 11, 2021, 9:58pm

I’ll send you a link to the publicly available posts in a PM to not trigger bots more than they already are

Here is the obfuscated version of the message:

…
Le 23 décembre 2010 14:05, [redacted] l <[redacted]@gmail.com](mailto:[redacted]@gmail.com)
mailto:[redacted]@gmail.com> a écrit :

Pour info lorsqu’il y a des mises à jour voila comment je procede
…

RGJ · June 12, 2021, 5:46am

So this is a problem with the import which (if you still have the chance) should be fixed during the import phase. I have taken a look at your forum and it is full of broken content (email headers, wrong indentation) with respect to old emails not being cut off, but also emails that are replies to each other being put in different topics.

Either you have enabled show_trimmed_content (here) during import or your message format did not get recognized by the reply trimmer code (here). Although it looks like there a lot of other issues as well.

dachary · June 12, 2021, 6:30am

Good guess: I indeed set show_trimmed_content to true because the reply trimmer code frequently trims more than it should. Not just in the imported mbox, it also happened daily with replies by email. Although it should be possible to improve the trimmer it seemed like an uphill battle. People using mail will always (i) send weirdly formatted emails for whatever reason, (ii) expect them to be displayed in full.

There indeed are other issues in the import: it is far from perfect. Although I’d be happy to discuss them, they are not an immediate concern.

Since it looks like I did not miss an option in the import that would obfuscate emails there apparently are two options left:

Globally replace content in the posts with something like s/{email_regexp}/obfuscated/
Find/write a plugin that obfuscate displayed content (HTML converter ?) with s/{email_regexp}/obfuscated/

Or… I’m going in the wrong direction?

RGJ · June 12, 2021, 7:29am

This is the way I should go.

I would use PostgreSQL regexp_replace to replace all email addresses in posts.raw and posts.cooked.

dachary · June 12, 2021, 7:48am

I’ll do that and post the HOWTO in this topic, thanks for the advice!

dachary · June 12, 2021, 9:12am

make sure to backup discourse before trying the following

Here is how to replace all email addresses in posts with [email_redacted]. The regular expression is rather limited and is likely to miss some but I prefer an expression I can read and understand when modifying the content of all posts.

$ ./launcher enter app
/var/www/discourse# su - postgres -c psql
psql (13.2 (Debian 13.2-1.pgdg100+1))
Type "help" for help.

postgres=# \c discourse
You are now connected to database "discourse" as user "postgres".
discourse=# \set re '[0-9a-z._%+-]+@[a-z0-9.-]+\\.[a-z]{2,64}'
discourse=# update posts set raw = regexp_replace(raw, :'re', '[email_redacted]', 'gi') where raw ~ :'re';
UPDATE 1
discourse=# update posts set cooked = regexp_replace(cooked, :'re', '[email_redacted]', 'gi') where cooked ~ :'re';
UPDATE 1

Topic		Replies	Views
Import_mbox.sh not working with e-mails from Samsung phone sent via a listserv server Support	8	746	May 9, 2022
Troubleshooting omitted content from incoming emails Support	3	582	August 23, 2021
Where to discuss issues with email_reply_trimmer? Support email	8	832	August 23, 2023
Text of forwarded emails don't show up in posts Feature	28	8489	November 9, 2019
Remove extraneous content from incoming emails Site Management email , how-to	0	501	May 24, 2023

Obfuscating mailto:foo@bar.com when importing mail archives

make sure to backup discourse before trying the following

Related topics