Improving Mailman email parsing

We’ve noticed on a couple forums that use Discourse to mirror a public mailing list some posts are getting attributed to the wrong user:


from: [ruby-talk:444110] exif - photo metadata - ruby-talk - Ruby Mailing List Mirror

In this case, Discourse first staged a user with the name “Austin Ziegler via ruby-talk” with an email address matching the list submission address and that’s what shows up for every post like this.


from: txt.att.net outage? - #4 by Mailman - Mailman List Mirror (Read Only) - NANOG

In this case, Discourse first staged a user with the name “Mailman” with an email address matching the list submission address.

Upon investigation, our mail parsing is sometimes incorrect. The cause is that for DMARC compliance, Mailman will sometimes change the From header to itself and put the original sender into the reply-to:

To: Ryan Davis via ruby-talk 
X-MailFrom: tom@tomsdomain.com
X-Mailman-Version: 3.3.3
Reply-To: Ruby users <ruby-talk@ml.ruby-lang.org>
From: Tom Reilly via ruby-talk <ruby-talk@ml.ruby-lang.org>
Cc: Tom Reilly <tom@tomsdomain.com>
To: Jared Mauch <jared@jaredsdomain.com>
X-BeenThere: nanog@nanog.org
X-Mailman-Version: 2.1.39
From: Owen DeLong via NANOG <nanog@nanog.org>
Reply-To: Owen DeLong <owen@owensdomain.com>
Cc: nanog <nanog@nanog.org>

but leave it when it doesn’t need to change:

To: Jon Lewis <jlewis@jonsdomain.org>
X-BeenThere: nanog@nanog.org
X-Mailman-Version: 2.1.39
From: William Herrin <bill@billsdomain.us>
Cc: nanog@nanog.org

Seems there’s a lot of different options for behaviour here, so we’d like to come up with an algorithm to properly parse what Mailman sends out in every single case.

There’s potentially other options, for instance Mailman could post the unchanged message directly to a Discourse instance, but those are more complex to set up and may not be available to everyone.

Here’s the start of one:

  • if mailman-version < 3
    • if any of:
      • From address matches List-Id
      • From address matches List-Post
      • From address matches X-BeenThere
    • then use Reply-To as From
  • if mailman-version >= 3
    • if X-MailFrom exists
      • Use name from From header, stripping /via .*/
      • Use email from X-MailFrom

Also, when all this is said and done, is it possible to have a rake task re-process existing posts (probably only the ones matching the erroneous user) with this new logic?

4 Likes

The gist of it is that I’ve come up with an algorithm that works for all versions (I’ve seen in the wild).

  1. Get the mailing list email address from either List-Post or X-BeenThere header
  2. The mail of the sender will be in any of the following headers: From, Reply-To, X-MailFrom or X-Original-From. So iterate over those and return the first that doesn’t match the email address of the mailing list.
3 Likes

This seems to work great! :+1:
I used rake emails:fix_mailman_users to fix all posts that were attributed to the wrong user on https://rubytalk.org/

2 Likes

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.