Obfuscating mailto:foo@bar.com when importing mail archives

Bonjour,

After a successful import of mail archives ( mbox ), the content of the messages will display email addresses that would have been obfuscated by Gmane or the mailman2 archive server. This allows bots collecting addresses to harvest them in discourse and I’m looking for a way to avoid this.

  1. globally removing email in the posts (a display plugin maybe?)
  2. some site setting that already does that
  3. another idea?

Thanks in advance for your help!

Why would email addresses be showing in the content of messages? Can you give an example?

I’ll send you a link to the publicly available posts in a PM to not trigger bots more than they already are :stuck_out_tongue:

Here is the obfuscated version of the message:


Le 23 décembre 2010 14:05, [redacted] l <[redacted]@gmail.com](mailto:[redacted]@gmail.com)
mailto:[redacted]@gmail.com> a écrit :

Pour info lorsqu’il y a des mises à jour voila comment je procede

So this is a problem with the import which (if you still have the chance) should be fixed during the import phase. I have taken a look at your forum and it is full of broken content (email headers, wrong indentation) with respect to old emails not being cut off, but also emails that are replies to each other being put in different topics.

Either you have enabled show_trimmed_content (here) during import or your message format did not get recognized by the reply trimmer code (here). Although it looks like there a lot of other issues as well.

1 Like

Good guess: I indeed set show_trimmed_content to true because the reply trimmer code frequently trims more than it should. Not just in the imported mbox, it also happened daily with replies by email. Although it should be possible to improve the trimmer it seemed like an uphill battle. People using mail will always (i) send weirdly formatted emails for whatever reason, (ii) expect them to be displayed in full.

There indeed are other issues in the import: it is far from perfect. Although I’d be happy to discuss them, they are not an immediate concern.

Since it looks like I did not miss an option in the import that would obfuscate emails there apparently are two options left:

  • Globally replace content in the posts with something like s/{email_regexp}/obfuscated/
  • Find/write a plugin that obfuscate displayed content (HTML converter ?) with s/{email_regexp}/obfuscated/

Or… I’m going in the wrong direction?

This is the way I should go.

I would use PostgreSQL regexp_replace to replace all email addresses in posts.raw and posts.cooked.

3 Likes

I’ll do that and post the HOWTO in this topic, thanks for the advice!

:warning: make sure to backup discourse before trying the following :warning:

Here is how to replace all email addresses in posts with [email_redacted]. The regular expression is rather limited and is likely to miss some but I prefer an expression I can read and understand when modifying the content of all posts.

$ ./launcher enter app
/var/www/discourse# su - postgres -c psql
psql (13.2 (Debian 13.2-1.pgdg100+1))
Type "help" for help.

postgres=# \c discourse
You are now connected to database "discourse" as user "postgres".
discourse=# \set re '[0-9a-z._%+-]+@[a-z0-9.-]+\\.[a-z]{2,64}'
discourse=# update posts set raw = regexp_replace(raw, :'re', '[email_redacted]', 'gi') where raw ~ :'re';
UPDATE 1
discourse=# update posts set cooked = regexp_replace(cooked, :'re', '[email_redacted]', 'gi') where cooked ~ :'re';
UPDATE 1