Discourse email messages are incorrectly threaded

cameron-simpson · July 27, 2022, 11:20pm

Apologies in advance for some of the tone below. I sound exasperated,
because I am a little exasperated.

By Michael Brown via Discourse Meta at 27Jul2022 14:06:

Sorry, I’m just catching up now, here are some thoughts, some of which
have already been addressed…

Cameron Simpson:

This second commit seems to address another problem we have seen: if a
post comes from an email, the outbound message-id sent to email users is
not the message-id of the source message from the author.

Cameron Simpson:

The Message-ID is a single persistent identifier for a message

The difficulty here is that what is sent out from Discourse is a different message than the inbound. It has different metadata (for this purpose, To/From/Reply-to/Unsubscribe/etc.) and a different body (it’s customised per user (I think? Does this not happen in mailling list mode?)).

What exactly is the message? Treating 5322 as gospel:

A message consists of header fields, optionally followed by a message body.

The “Message-ID:” field provides a unique message identifier that refers to a particular version of a particular message.
[emphasis mine]

It’s that “particular version” that makes me think it would be inappropriate to re-send an incoming message with a different Message-ID. Though, if you change your point of view from Discourse as “Forum Software” to Discourse being “Mailing List Software” then it kind of makes sense to do so, so I get where you’re coming from.

Well, unfortunately this depends on an overly literal reading, maybe
reading conext which isn’t there.

Every email messages gets its headers modified as mail system pass it
along. If nothing else, Received: headers get added at every step, and
several systems add various headers indicating spam filtering results
and signatures. None of those trigger a message-id modification, and
indeed doing so would make the message-id totally dysfunctional.

Regarding content, as already mentioned, almost every mailing list adds
content to the body text, usually a footer with a link to the list admin
page or an unsubscribe link. There also do not trigger a message-id
change.

In fact, almost nothing which forwards a message changes the message-id.
Because that would break threading and duplicate detection for end user
clients.

I see you go on to quote what I was just about to cite

5322 also says:

There are many instances when messages are “changed”, but those changes do
not constitute a new instantiation of that message, and therefore the message
would not get a new message identifier. For example, when messages are
introduced into the transport system, they are often prepended with
additional header fields such as trace fields (described in section 3.6.7)
and resent fields (described in section 3.6.6). The addition of such header
fields does not change the identity of the message and therefore the original
“Message-ID:” field is retained. In all cases, it is the meaning that the
sender of the message wishes to convey (i.e., whether this is the same
message or a different message) that determines whether or not the
“Message-ID:” field changes, not any particular syntactic difference that
appears (or does not appear) in the message.

I suppose it comes down to, does the sender of the message change when Discourse sends it out?

I think you’ve misread things here. Let me emphasise:

In all cases, it is the meaning that the sender of the message
wishes to convey (i.e., whether this is the same message or a
different message) that determines whether or not the "Message-ID:"
field changes

The sender is the author, not an MTA such as Discourse.

If I post to Discourse via email, I want my message to reach the readers
as it is, semanticly speaking. Any riders like unsub links do not change
the semantics of what I have said in my message.

It’s still the same message.

Maybe we should use Resent-Message-ID and friends?

Absolutely not. They are for a user resubmitting a message. For
example, if I forwarded a message on to someone else. They’re not for
mail relays (such as lists and Discourse).

Cameron Simpson:

It looks like References is now also present in RFC5322

It’s always been there, all the back to 822. But as you say later, yes it’s been updated.

Ouch. I thought it was USENET only at that point. I stand corrected.

5322 also speaks directly to the way Discourse and Github use it:

The “In-Reply-To:” field may be used to identify the message (or messages) to
which the new message is a reply, while the “References:” field may be used to
identify a “thread” of conversation.

Possibly slightly improperly, likely due to the lack of a suitable “Thread Identifier” header. But this interpretation may not be what the RFC authors intended… it doesn’t address messages with a “References” but without “In-Reply-To”.

It says to me that the two fields cover the same information:

References shows a linear (usually) thread back to the OP
In-Reply-To shows the parent, and implies the same thread in
aggregate with the previous messages back to the OP

Martin Brennan:

This is interesting, I thought every recipient of the email had to have a unique Message-ID? In fact I believe this is why we went down the path of adding uniqueness to each recipient’s Message-ID, to avoid spam behaviours, looking back on our internal topic. Perhaps @supermathie , who is on our infra team and was doing a bunch of testing with email earlier in the year, could weigh in here too?

The tricky bit of this is that we aren’t sending out one email, we’re sending out N - one per recipient - so that their individual metadata (Unsubscribe, etc.) can be correct.

This isn’t tricky. The meaning of the messages is the same, the
customisations are minor and semanticly irrelevant. They do not
warrant new or distinct message-ids.

And yes, I did see strong indications during testing that spam determination would be tied to a Message-ID. If it was later seen again (same user or different user) it would be much more likely to be marked spam.

Can you show some of these instances. Because message-ids allow
deduplication are the end user’s end. And bear in mind that many
“antispam” measures are misguides rubbish. The number of things I’ve had
rejected as potential spam for utterly spurious reasons… breaking
email to work around broken spam misfiltering is a poor choice.

To this day I never CC people with GMail addresses because GMail’s spam
filtering knows me and drops things on the floor. If I send only to the
list, they get it. If I CC ther GMail address it (a) marks it as spam
and (b) then also marks the mailing list message as spam as well (same
message-id!) The end user doesn’t see my message. This logic is utterly
spurious and unrepairable.

[quote=“Cameron Simpson, post:22, topic:233499,
username:cameron-simpson”]
So I’d be entirely ok with you adding your recipient-specific unsub link and preserving the original message-id. The benefits far far outweigh the loss of threading if you gave each message copy an individual message-id.
[/quote]

The benefits here, to be fair, are entirely around threading the emails correctly in certain mail clients at the expense of deliverability.

Sigh. To all email clients. And a major reason people over in
Pythonland are saying they will just not go to Discourse is that the
email side threading is broken. Many people do not use forums, because
each forum requires them to visit it. Email comes to them, they get to
use their preferred reader and their preferred editor, and threading
lets people see the discussion flow clearly. When it works.

The current topic/#{topic_id}/#{post_id}.s#{sender_user_id}r#{receiver_user_id} at least makes it consistent for a user in their mailbox. The assumption

My biggest concern is the deliverability - it’s hard enough to get email delivered when there is zero visibility from the major providers.

I would like to see evidence. Mailing lists do this correctly all over
the planet. Discourse definitely and objectively breaks this. I’m trying
to get it fixed.

Let me reiterate the two basic problems here:

The OP In-Reply-To and References cite a fictitious “pre-OP”
“topic” message-id, so no email user has a threwad with a starting
message (the OP) - everything including the OP looks like a followup
The emails received via Discourse and the emails received directly eg
via CC have different message-ids even though they are the same
message semanticly speaking; this breaks threading and deduplication

But I do see a strong argument for making Discourse behave more like mailing list software in mailing list mode. @martin I believe we don’t customise the message body in mailing list mode? Do you think it makes sense to take a more strict approach around preserving and reusing Message-IDs in mailing list mode?

Their are people over in Pythonland who found “mailing list mode” too
much of a firehose. They want to get email for targeted topics but not
everything. The message-id handling should e the same for all of the
email side.

I’m a “mailing list mode” person on discuss.python.org. But I turned it
on here (discourse.org) and _immediately turned it off again. I need
targeted mode over here.

cameron-simpson · July 27, 2022, 11:26pm

By Michael Brown via Discourse Meta at 27Jul2022 22:37:

Sam Saffron:

We are currently in planet (1) which is wreaking havoc.

ah! I thought we were already doing: topic/#{topic_id}/#{post_id}.s#{sender_user_id}r#{receiver_user_id}

{receiver_user_id} puts you into distinct message-ids per end user for
the same source post. That is bad as soon as end users communicate
outside discourse or get copies not via discourse.

I would be inclined to, in the interest of balancing concerns of email uniqueness & deliverability vs. those of mailing-list-mode, do (2) for mailinglist-mode disabled and (3) for mailinglist-mode enabled.

And mentioned in my recent post, mailing list mode only covers one
flavour of Discourse email reception. All the same concerns apply
whether the email receiver is in mailing list mode or just
email-for-some-topics/tags mode.

Similarly, with the References header, I would be inclined to have it absent for post #1 in a topic

Likewise the In-Reply-To. neither should be present, becaue to be
present they have to reference a fictitious per-to-OP message.

and have it referencing the topic (so topic/#{topic_id}) and the post
to which it’s replying, if any.

You can’t refer to the “topic” message-id unless there was a post with
that message-id which went out as email. If you want to go that way,
special case the message-id of the OP to be the “topic” message=id
instead of ...../1.

cameron-simpson · July 28, 2022, 2:49am

This should be “prior-to-OP”. Sorry, Cameron Simpson

supermathie · July 28, 2022, 2:57am

As you say, this is exactly the problem frustrating us:

I agree this should be changed. The OP message-id should be (in lieu of one coming over mail) (simplified) topic/1 and not reference another message.

The message ID wouldn’t change, even if it was only ever a Discourse post and never an email.

Further messages can reference that one.

Why must an email exist? Semantically, having only the post fits the criteria. The message it’s responding to exists, just not in that person’s email folder. We’ve just come to the conclusion that the message is what’s important, be that Post body or Email body. It follows that the topic/#{topic_id}/1@site is a unique message ID referencing that post, whether it’s in an email message or not.

It’s no different than receiving a reply to an email that references an email not in your inbox. It’s still a reply, so References is legitimate and correct.

Fundamentally I agree with you. The purist in me wants this correct. But the practicality of needing to get email into people’s inboxes was what led to this. For worse, a ton of people use gmail and never train its filters, use it properly, and “Unsubscribe” by reporting as spam^[1].

I agree, I think we were a bit too literal in reading

A message identifier pertains to exactly one instantiation of a particular message

After mulling this over for a while I do think we ought to go back to what we had before (remove randomisation) and lock down a single message-id per post and it should be:

message_id_from_incoming_email || topic/#{topic_id}/#{post_num}@site (post_num of OP is 1)

And whenever we send out an email, I think it is correct to add References to parents all the way back to the OP and set In-Reply-To to the appropriate stable post message-ID (or the OP if replying to topic) since the Message is the post. But those fields for the OP should be blank, yes.

not that gmail reports this to us, despite us implementing feedback-loop. ↩︎

martin · July 29, 2022, 12:26am

Thank you for your responses @supermathie and @cameron-simpson , I do believe we have reached consensus. Pulling out the TODOs into a single post, and I hope to be able to begin work on these quite soon:

Change the generated Message-ID format to always be <discourse/post/:post_id@:hostname>, this is unique enough, it’s basically reverting to what we used to do. Referring to the OP will just use the first post ID now instead of just the bare topic ID.
If a post has an associated IncomingEmail record, we always use that Message-ID when sending email, otherwise we generate one using the format above.
Do not use a References when sending out emails for the OP of the topic, there is nothing to Reference yet because it is the first email in the thread.
Ensure that correct In-Reply-To and References headers are generated based on PostReply records.

This has the potential to leave things in a bit of a murky state thread-wise for already sent emails, but I will try my best to allow for the format we are moving away from for a changover period as well. Thanks for bearing with us!

sam · July 29, 2022, 12:32am

Just to clarify … this would not be the hostname of the server this originated from, but the url of the site? If it is hostname then we lose all stability when 3 different hosts serve the same site.

martin · July 29, 2022, 12:34am

Sorry yeah I mean the site domain e.g. meta.discourse.org which comes from Email::Sender.host_for(Discourse.base_url), what we already use.

supermathie · July 29, 2022, 1:12am

Good call, didn’t think about moves. Is :post_id the post’s ID (post.id) or number (within the topic)

If it’s the post ID, we can simplify and just use <post/:post_id@:hostname> since that will never change, then we don’t need to store the Message-ID unless it’s overridden from default.

If not… we may as well use the post ID here? no reason this part has to be long, it just has to be unique.

martin · July 29, 2022, 1:20am

It’s the actual ID not the post number.

That is a good point, <post/:post_id@:hostname> will probably work just fine and avoids the extra column requirement. Maybe to make it more Discourse-specific we could add discourse to the front e.g. <discourse/post/543563@meta.discourse.org> (keeping in mind many sites will have no mention of Discourse in the hostname). It’s splitting hairs at this point though.

I will try to think of ways this can be messed up. I guess if you move a post to another topic and then someone replies to the post via email their reply will end up in the new topic instead of the original topic. Maybe that is fine? Other risk is that the post is moved into a private category but I think we already have that same risk and we handle it.

Just thinking out loud, should be fine, will cover these things when I test out the changes anyway

sam · July 29, 2022, 1:28am

The argument for including topic_id is that you can deliberately break threading if people split a post from a topic into another topic.

I am mixed on it. Can go either way. But that would be the idea.

martin · July 29, 2022, 1:37am

The argument for using only the post ID is that it is more static which is what we want, since if you move a post to another topic the post ID will be the same in the Message-ID but the topic will not be the same.

I think if we end up moving the post and sending out emails from the new topic the new thread will be created correctly anyway in the mail client, since the References and In-Reply-To header chains will be different. Anyway, I will make sure to test this scenario out and see if it does what we expect as well. Nothing will be merged into core until the various scenarios work as expected.

martin · July 29, 2022, 1:46am

Based on these further discussions @cameron-simpson I updated the TODOs to this, posting them here so you get the update since the Discourse edits will not arrive via email:

Change the generated Message-ID format to always be <discourse/post/:post_id@:hostname>, this is unique enough, it’s basically reverting to what we used to do. Referring to the OP will just use the first post ID now instead of just the bare topic ID.

If a post has an associated IncomingEmail record, we always use that Message-ID when sending email, otherwise we generate one using the format above.

Add a new outbound_message_id column to the Post records which will be filled by either a) the Message-ID of the incoming email if it is creating the post or b) the outgoing Message-ID that we generate in the case of posts created by the Discourse web UI

Do not use a References or In-Reply-To headers when sending out emails for the OP of the topic, there is nothing to Reference or reply to yet because it is the first email in the thread.

Ensure that correct In-Reply-To and References headers are generated based on PostReply records.

sam · July 29, 2022, 2:21am

Does this cover quotes as well (Eg: a post quoted 10 different other posts, so it references them?)

cameron-simpson · July 29, 2022, 2:49am

By Sam Saffron via Discourse Meta at 29Jul2022 02:31:

Martin Brennan:

Ensure that correct In-Reply-To and References headers are generated based on PostReply records.

Does this cover quotes as well (Eg: a post quoted 10 different other
posts, so it references them?)

In-Reply-To can only cite one antecedant, so pick one. References
can reference more than one but the RFC explicitly recommends against
this because not all client apps might expect other than a linear chain
from this post back to the OP.

I’d be ok with either for the References but would lean to the
conservative one. The easy computation is:

In-Reply-To: use the message-id of the first quoted message (or
whatever single quote you pick based on some policy)
References: the References of the same single chosen antecedant
post above plus the message-id of that same post

These would be stable, predictable and correct.

Cheers,
Cameron Simpson cs@cskk.id.au

supermathie · July 29, 2022, 2:50am

References is discouraged from being used this way:

Note: Some implementations parse the “References:” field to display the “thread of the discussion”. These implementations assume that each new message is a reply to a single parent and hence that they can walk backwards through the “References:” field to find the parent of each message listed there. Therefore, trying to form a “References:” field for a reply that has multiple parents is discouraged; how to do so is not defined in this document.

cameron-simpson · July 29, 2022, 2:58am

By Martin Brennan via Discourse Meta at 29Jul2022 01:57:

Based on these further discussions @cameron-simpson I updated the TODOs
to this, posting them here so you get the update since the Discourse
edits will not arrive via email:

Change the generated Message-ID format to always be <discourse/post/:post_id@:hostname>, this is unique enough, it’s basically reverting to what we used to do. Referring to the OP will just use the first post ID now instead of just the bare topic ID.

If a post has an associated IncomingEmail record, we always use that Message-ID when sending email, otherwise we generate one using the format above.

Do not use a References when sending out emails for the OP of the topic, there is nothing to Reference yet because it is the first email in the thread.

I would also omit the In-Reply-To in the OP emails.

Ensure that correct In-Reply-To and References headers are
generated based on PostReply records.

Yes.

Personally, I would go the extra step of having a column for the
email-side message-id. That way, once you’ve allocated a message-id for
the post (from the source email if from email, or generated if from the
web interface), it remains stable regardless of whatever else might
happen in the code now or later. i.e. even if there’s no IncomingEmail
the message-id generation happens just the once, rather than being
recomputed (which could thus change).

i.e. make it stable once made by storing it.

You have an IncomingEmail relation by the look of it. Maybe you have
(or could use) an OutgoingEmail relation for the additional state for
the outbound email messages, made the first time a post is forwarded by
email.

I know the flow is basicly that that happens when a post is made, but I
can imagine some later user feature being things like:

please forward me emails for this whole topic, now that I’m interested
if an edit happens to a post, consider sending the updated message out
with the same message-id

The reason the second example comes to mind is that we’ve got more
things to report One is that Discourse seems to do some effort to
drop the quoted part of top posted replies to keep the post concise, or
something like that. I wrote a long post over in Python land a few weeks
back which got severe mistruncated. i went in and edited it on the forum
with the original text from my personal copy. But a receipient said he
had the full thing, and i was wondwering if Discourse sent edit updates
out as replacement messages with the same id. Which would be quite neat
depending on the end user client handling of that.

cameron-simpson · July 29, 2022, 3:00am

By Martin Brennan via Discourse Meta at 29Jul2022 00:36:

Add a new outbound_message_id to the Post table, so we can be
sure the thread survives even if a post moves topics or anything like
that, store the Message-ID here for both of the above cases.

Yes, i think this is important, however implemented (relation or column
of whatever). I think I said that in your revised TODOs.

Do not use a References when sending out emails for the OP of the topic, there is nothing to Reference yet because it is the first email in the thread.

Ensure that correct In-Reply-To and References headers are generated based on PostReply records and the new outbound_message_id column on the Post table.

This has the potential to leave things in a bit of a murky state thread-wise for already sent emails, but I will try my best to allow for the format we are moving away from for a changover period as well. Thanks for bearing with us!

Nothing we can do for the existing emails. Just get things well threaded
going forward!

Thanks!
Cameron Simpson cs@cskk.id.au

martin · July 29, 2022, 3:08am

We do have EmailLog but those records get cleaned up every 90 days, and I don’t think it would be a good fit to this. Will just do this:

supermathie · July 29, 2022, 3:15am

It was about avoiding storing it at all… but now I think of it the post ID will never change but hostname might. So we should store it immediately after saving in all cases.

Couldn’t hurt to have messageid be a property of every Post, forever immutable…

Wouldn’t this be a different version of the message? From the spec:

The “Message-ID:” field provides a unique message identifier that refers to a particular version of a particular message. … A message identifier pertains to exactly one version of a particular message; subsequent revisions to the message each receive new message identifiers.

So probably our generated message-id ought to be: <discourse/post/:post_id/rev/:revision_num> (possibly leaving off /rev/:revision_num for the first revision). This would allow email recipients to get the edit updates in the first place, considering that

martin · July 29, 2022, 3:19am

Yep will do that. As for these other discussions about edits and revisions, I think that is a whole other huge kettle of fish that we shouldn’t get into right now…let’s fix our threading sins first

Topic		Replies	Views
Discourse Emails not threaded properly in some Email clients Support	13	4926	June 16, 2022
Emails are not threaded in Outlook 2013 Bug	31	14427	January 9, 2015
Threading for email-only topics seems broken Support	7	1226	October 24, 2023
Email-in replies thread wrongly Bug	18	6476	June 23, 2017
Email threading broken Bug	8	758	July 29, 2022

Discourse email messages are incorrectly threaded

Related topics