Proposed plugin to improve reply-by-email accuracy

In recent months, there have been several requests across a variety of Discourse groups for improvements to inbound email parsing. Expressed as user stories, these requests can be broadly categorized as:

  • “I would like the ability to use the same HTML features when replying by email that I can when posting on the website.”
  • “I would like the ability to view and search the messages from our mailing list.”
  • “I would like content created by email, both via reply-by-email and bulk import, to consistently be nicely formatted and accurately parsed.”

I’ll link to real examples of these requests below, but for now the important thing to understand is that each of these three “different” requests is actually asking for the same underlying thing — more accurate inbound email parsing.

A few months ago I had emailed @sam about enabling Discourse to use the commercial email parsing API that our company offers. Sam suggested creating an exploratory post here to explain the benefits that integrating with our API would provide over Discourse’s existing inbound email parsing solution, and also how our API could be integrated as a plugin.

I’ll cover both of these topics in detail, starting with the state of Discourse’s current email parsing solution. And for the benefit of those who haven’t spent the last several years thinking about email parsing, I’ll also include some background context on the problem.

This post is pretty long, but feel free to skip around. Here is what will be covered:

  1. The current state of email parsing in Discourse
  2. The benefits of better email parsing
  3. Stakeholder user personas
  4. The FWD:Everyone Email Parsing API
    A. Stripping signatures and replies
    B. HTML markup normalization
    C. Language support
    D. Styling with CSS
  5. Proposed Integration
  6. Testing the API

The current state of email parsing in Discourse

Discourse already has a reply-by-email feature that will turn email replies from users into new forum posts within a topic. This feature works like so:

  1. A user gets an email notification containing a new post on a forum topic they are watching.
  2. The user replies to this email.
  3. This email reply is turned into a new post in the relevant forum topic.

Conceptually this is an invaluable feature; it’s the preferred workflow for many people, and is a must-have for many mailing-list-based communities who are considering migrating to Discourse.

The catch is that when these email replies are turned into forum posts, they will often be rendered with missing or incorrect formatting, or even missing text. This is deeply problematic, for reasons I’ll explore below.

Common issues include:

  • Bullet points not rendering correctly
  • Missing line breaks between text
  • Extra line breaks between text
  • Text the user wrote being entirely deleted

And when I say these issues are common, I don’t mean that they occasionally happen when sending emails in foreign languages using obscure email clients. Rather, I mean that they commonly happen when sending basic reply-by-email messages from Gmail and Outlook in English.

Here are two real-world examples of users complaining about these issues, both from the [Python-Dev] mailing list:

https://www.prettyfwd.com/t/Wco-c1ZCR7mUwiww0j6s9w/#message-5
https://www.prettyfwd.com/t/Wco-c1ZCR7mUwiww0j6s9w/#message-36

(Prettyfwd uses the FWD:Everyone Email Parsing API.)

Although I haven’t tried importing content from existing mailing lists with Discourse, I can say from experience that whatever the error rate is for reply-by-email, the error rate when rendering entire email threads is going to be at least an order of magnitude more. This is due to the increased complexity when stripping signatures and replies, accounting for inline replies, dealing with deeply nested markup, etc.

As a real example, this Mailman-to-Discourse migration retrospective written by Tanya Lattner (president of the LLVM foundation) alludes to these issues in the technical concerns section:

I asked, and it turns out the specific thing they’re upset about is the high percentage of emails that are missing content due to being prematurely truncated. Because the preexisting discussions and documentation from the past 19 years of the mailing list archive are invaluable, they feel like they won’t be able to sunset Mailman until this problem has been fully solved.

So, how do we know whether the current state of email parsing in Discourse is “good enough?” I would offer this as a three-part litmus test:

  1. Users need to fully trust that if they use the reply-by-email feature, their content will be accurately parsed and will look just as good as if they’d posted it via the web interface.
  2. Forum administrators need to trust that if they allow reply-by-email, it’s not going to create extra work and complaints.
  3. Discourse’s employees need to trust the feature enough to actively promote it as a first-class way to participate.

Unless we can say with full confidence that each of these conditions is being met, even if reply-by-email exists as a feature, the vast majority of the potential benefits will never accrue.

This is what is happening currently.

That is, I would characterize the existing email parsing code as an 80-20 solution, but in a context where an 80-20 solution doesn’t really make sense; the problem is that even if e.g. 80% of emails parse correctly, you’re unlikely to be getting even 10% of the potential benefits.

So even though reply-by-email (and bulk email import) already exist, users ultimately aren’t getting the experience they’re looking for, extra work is being created for moderators and staff, communities are losing out on valuable content and user growth, etc.

The benefits of better email parsing

Social software only succeeds to the extent that it fulfills human needs.

The reasons why folks post on web forums include wanting to share knowledge with others, wanting to influence their opinions, wanting to be seen as generally intelligent, as having domain expertise, as making valuable real-world contributions, etc.

And when it comes to text-based communication, the likelihood of achieving these outcomes depends not only on what is said, but also on the typography with which one says it.

This is why there are entire books about whitespace in Shakespeare. It’s partly why the NY Times is taken more seriously than the NY Post. And it’s a large part of why Facebook beat MySpace.[1]

When the text a user writes ends up badly misformatted through no fault of their own, the human needs that drive people to use social software are no longer being fulfilled. In fact, the opposite is happening; the users are made to look dumb.

Even folks not using the reply-by-email feature end up losing authority and respect if other posts in the topic (and the larger forum) end up looking like a dumpster fire.

Stakeholder User Personas

While everyone benefits when posts consistently render with aesthetically pleasing typography, user personas who may especially benefit from better inbound email parsing include:

  1. Folks who are currently members of mailing lists like [Python-Dev] and [Django-Dev], who fully understand the benefits of Discourse and are happy to see their communities moved to Discourse, but only if they are able to continue to participate in a way that’s indistinguishable from GNU Mailman, Google Groups, etc. Here is a real example of this type of request: https://www.prettyfwd.com/t/Wco-c1ZCR7mUwiww0j6s9w/#message-89
  2. Members of email-based communities who would generally be happy to migrate to Discourse, but who would be much more enthusiastic about doing so if their decades of existing content were easily searchable from within the same platform.
  3. Casual users who check forums intermittently. For example, on Growing Fruit I’m subscribed by email to all the topics about growing North American pawpaws. During summer and fall I visit that forum several times a day to read the constant stream of new posts in these topics, but outside of these months it’s mainly the email notifications on these topics that keep me engaged.
  4. Folks who only use the web only intermittently. It’s often assumed that if folks don’t regularly use the web then it’s somehow related to the digital divide, but often this isn’t the case. There are lots of folks who are both highly intelligent and technical, but who are insulated from needing to use the web on a regular basis due to being at the top of their fields. A real world example here is someone like Donald Knuth, who does not regularly use the web despite being one of he top living computer scientists. Every field has folks like this, and getting them to share their knowledge is invaluable. In my experience these folks are unlikely to become regular contributors to any forum, but if someone tells them there is a topic people are discussing that they’d be interested in then they will often subscribe by email and contribute to those specific topics.

The big picture is that improving inbound email parsing should not only increase engagement from folks who are already regular active Discourse contributors, but it should also unblock a lot of communities who would otherwise like to migrate to the platform, and also solicit highly valuable content from folks who would otherwise not contribute.

The FWD:Everyone Email Parsing API

The FWD:Everyone email parsing API does two things:

  1. Accurately strips signatures and replies from each email message, while still allowing for inline replies to quoted text.
  2. Takes the extremely complicated HTML markup generated by email clients, and normalizes that markup into the ~12 HTML tags that are typically allowed by user-generated content sites — all while preserving the author’s intent to the greatest extent possible.

I’ll explain both in more detail, but first here is a video I made that explains the problem by showing real-world email threads: https://www.youtube.com/watch?v=nPb3NQlz6V4

Stripping signatures and replies

The FWD:Everyone email parsing API works on both plaintext and HTML emails with equal accuracy. The API preferentially uses the HTML message part when available, because

  • The HTML formatting features (like bold, italics, blockquotes, code snippets, etc.) an author chooses to use are an essential part of the author’s message, as important as the text itself.
  • When email clients convert the HTML version of a message to the plaintext version, they often do so incorrectly. E.g. not only will email clients often not make any attempt to render HTML features like bullet lists in plain text, but often the text within HTML formatting elements will be entirely missing.

Of course some users prefer sending plaintext emails; because of this, plaintext-only emails must have their signatures and replies stripped with equal accuracy.

The FWD:Everyone Email Parsing API does this, including correctly handling inline replies in both plaintext and HTML emails.

In terms of accuracy, there are two types of errors that can occur in any email parsing library when stripping signatures and replies:

  • False positives — When text that is supposed to be included in the message is incorrectly excluded.
  • False negatives — When text that should not be included in the message is incorrectly included.

It’s difficult to give precise accuracy statistics because different discourse communities (lowercase d) use email so differently. But compared with Discourse’s current parsing solution, a realistic expectation might be:

  • 100x less false positives for stripping signatures and replies
  • 10x less false negatives for stripping replies
  • 1x - 10x less false negatives for stripping signatures — likely better, but not a full order of magnitude better.

For context, false positives are generally much worse than false negatives since they misrepresent what the person wrote. But false negatives are also very bad since they make the poster (and everyone else on the forum) look unprofessional at best, and outright dumb at worst.

The approach that the FWD:Everyone takes is to eschew any tricks to strip signatures that can lead to false positives; the ostensible increase in false negatives this would lead to is then largely balanced out by just having put a ton more work into getting the algorithm working in a legitimate way, without needing to cut corners.

The big picture reason as to why the FWD:Everyone Email Parsing API will generally be much more accurate than Discourse’s current solution is that our API was designed to parse entire email threads, which is a vastly more difficult problem than parsing one-off reply-by-email posts. The end result is that our product is highly overengineered, at least relative to both Discourse’s needs and to the existing prior art.

HTML Markup Normalization

In order for replies submitted by email (and imported email threads) to look the same as any other user-generated content, they must ultimately be rendered using the same subset of HTML that is allowed when users reply via the website.

This is surprisingly complicated.

Emails composed in email clients like Gmail and Outlook are encoded using some combination of ~50 HTML tags, ~25 HTML attributes, and ~175 CSS styles. Further, this markup is often heavily obfuscated; you might expect that a paragraph of text would look something like this:

<p>Some text!</p>

But instead, even simple paragraphs are often encoded using deeply nested and completely non–sensical combinations of divs, spans, tables, lists, etc. This is the main source of complexity for both stripping replies and for normalizing markup.

Regardless, after parsing, each message gets rendered using only the following markup:

Allowed block elements: <p>, <ul>, <ol>, <li>, <blockquote>, <pre>
Allowed inline elements: <code>, <a>, <b>, <i>, <u>, <s>, <span>

Notes:

  • The only allowed attributes (except on <a> tags) are 'style' and 'dir'.
  • The only allowed inline style is 'font-weight'.
  • <a> tags can also have 'href', 'rel', 'title', and 'target' attributes.
  • <span> elements are used only in limited cases to ensure that font-weights cascade correctly. As such, they’re always used with an inline 'font-weight'.
  • In the future, the <img> tag will also be used to display inline images.

Rendering posts in this limited subset of HTML allows any post submitted by email to be easily rendered using the exact same typography as posts submitted through the web interface.

This is all done while preserving the intent of the author to the greatest extent possible, while also ensuring that they can’t do things like adding dozens of unnecessary line breaks between paragraphs.

See also: The ‘CSS Styling’ section below.

Language Support

EmailReplyTrimmer currently has full or partial support for 13 languages:

English, Norwegian, French, German, Portuguese, Spanish, Italian, Dutch, Swedish, Chinese, Russian, Polish, Ukrainian

In contrast, the FWD:Everyone Email Parsing API currently supports 30+ languages, including every language that Discourse currently supports:

English, Spanish, Portuguese, Catalan, Dutch, French, German, Italian, Norwegian, Danish, Swedish, Finnish, Russian, Polish, Ukrainian, Turkish, Czech, Romanian, Hungarian, Hebrew, Arabic, Persian, Chinese, Japanese, Korean, Hindi, Indonesian, Thai, Filipino, Afrikaans

The FWD:Everyone Email Parsing API fully supports RTL languages. This means that not only will text correctly flow from right-to-left in languages like Arabic, but also the appropriate attributes are applied to the HTML markup so that features like bullet points will render on the correct side of the page.

The API will sometimes also work in additional languages depending on the email client used, but the official supported language set is at a minimum tested to work with Gmail, Outlook, and Apple Mail. Less popular email clients are explicitly tested in the languages where they have the most usage. And since the API is tested against thousands of email threads from public mailing lists, there are countless fixes for real-world erratic behavior of unknown provenance.

N.b. that supporting a wide variety of languages is important for more than just displaying text in those languages. It’s very common for people to write text in English, but have their email client configured to use e.g. Hebrew. So in cases like this, correctly parsing an English reply would require not only fully supporting Hebrew, but also supporting right-to-left languages more generally.

Supporting languages from a wide variety of language families also helps to ensure that unicode is being processed and stored correctly, rather than in ways that may cause problems in the future as support for more non-western languages is added.

CSS Styling

As mentioned above, a key strength of our API is its ability to normalize HTML markup in a thoughtful and logical way. This normalization process is designed to optimize text for readability and accessibility, while preserving the original author’s intent to the greatest extent possible.

As such, all text appears only within inline or block elements (no free-floating text), and all inline elements appear only within block elements. This makes it easy to style text, e.g. to ensure that different elements have the correct amount of whitespace between them.

As an example of how this is valuable, email clients will allow users to do silly things like inserting a bullet list directly before or after a line of text, with no line break in between. The (vastly simplified) code generated by an email client when doing so might look something like this:

<div>
    Some text
    <div> </div>
    <span>&nbsp;&nbsp;&nbsp;&nbsp;&#8226; A bullet point</span>
     <div> </div>
     Some more text
</div>

The FWD:Everyone Email Parsing API would then normalize the above markup to instead look like this:

<p>Some text</p>
<ul>
    <li>A bullet point</li>
</ul>
<p>Some more text</p>

This normalized markup is easy to understand and style, and visually there are now also line breaks before and after the bullet list. Affordances like these make the text better looking and easier to read, while preserving author intent. These types of user affordances ensure that great content submitted by email is consistently conferring social status, rather than undermining it.

The simplified, normalized markup generated by our API also ensures that when thinking about how to style text, designers and developers only need to think about what output the API allows, rather than how the original email might have been formatted. And since the allowed output from the API is virtually identical to what the Discourse web client allows, this should be close to a drop-in solution.

Proposed Integration

The reply-by-email functionality would be integrated with Discourse as a plugin, which could then be enabled by default for all hosted Discourse instances.

The existing email parsing code would be used for Discourse instances that do not have this plugin enabled.

Additionally, in the event that the FWD:Everyone Email Parsing API were to become temporarily unavailable, any incoming messages would be processed using the existing email parsing code. Then once the API is back online, any messages that had not been edited via the web interface since posting could then be reprocessed by the API.

The plugin could also be made available to self-hosted Discourse instances to be optionally enabled.

For groups migrating from existing mailing lists to Discourse, each email thread on the mailing list could also be parsed via the API, but this would likely be integrated into Discourse’s existing migration scripts and processes rather than done via a plugin.[2]

Testing the API

The API is fully available for anyone to test, albeit with a very low rate limit for non-authenticated users.

For those with Gmail accounts, the easiest ways to test the API are:

Key differences between these two web-based tools and the actual API are that the former:

  1. Will not process threads that contain messages styled using HTML tables
  2. Won’t strip replies on the first message in a thread. (E.g. if a thread has over 100 messages, so Gmail splits it into multiple threads.)

To test the API directly via code, there are starter scripts for both Python and Ruby:

And here is the relevant documentation, including known issues and the product roadmap:

[1] Viewing American class divisions through Facebook and MySpace.

[2] When bulk importing content from an existing mailing list, it’s worth first doing a quick sanity check on a few threads to ensure they are parsing correctly. Some groups will parse with near perfect accuracy as is, but others may greatly benefit from a couple hours of preemptive work. For example, some mailing list software requires a bit of custom code for each list to strip off any text appended to the bottom of each message, whereas for other mailing list software this can be done in a predictable way that will work for any list hosted on that platform. Because of potential issues like this, the bulk import process should preferably be run as part of a supervised migration rather than done via a plugin.

4 Likes

Have you made any progress with this? It sounds very interesting.

As an enthusiastic Discourse user, the one thing I’d like would be to be able to convert an existing email thread into a Discourse topic - with the authors and times preserved and all the gumpff removed.

The reason for this is the human tendency to kick off a group conversation by email, and then after 10-20 reply-alls someone realises and says “shouldn’t this be on the Forum”? By then it is too late…

Doing this manually is soul destroying. But if this API could be harnessed to do that somehow - wonderful!!!

@nathank I haven’t put any work into writing an API wrapper, just due to lack of interest so far. It’s not a huge lift, but it’s enough work that it doesn’t really make sense unless there is more concrete demand. That said, the API itself keeps getting better.

1 Like