By Martin Brennan via Discourse Meta at 22Jul2022 06:34:
Okay this is kind of huge, please bear with me. First, thanks for
another detailed reply and the offer of debugging / review, it is
really helpful
I’ve actually been looking into this this morning
and, surprisingly, the threading in a unified view works in Thunderbird
for most cases, and I think the References header consistently
pointing to the OP helps with that (for example the topic Reference
in this chain which is always present is
<topic/53@discoursehosted.martin-brennan.com>.
I’ve just reread RFC5322 section 3.6.4 closely. It has moved on from
earlier versions (822 and 2822), and has merged the email In-Reply-To
headers, USENET References headers and modern
reply-citing-more-that-one previous messages.
The short summary:
- The
Message-ID is a single persisent identifier for a message
- The
In-Reply-To contains all the message-ids of which this message
is a direct reply, so if I reply to a pair of messages it will have
those 2 message-ids
- The
References is a reply chain of antecedant message-ids from the
OP to the preceeding message. So indeed it should always start with
the OP message-id.
So for a discussions like this, pretending that labels are message-ids:
OP
-> reply1
-> reply2 ---+
-> reply3 |
-> reply4 |
-> reply5 <+
The reply5 would have:
- message-id=reply5
- in-reply-to=“reply2 reply4”
- references=“OP reply3 reply4”
It is also leagel to include “reply1 reply2” in the references (the
other chain to reply5) but the RFC explicitly recommends against that
becaause some clients expect the references to be a single linear chain
of replies, not some flattened digraph.
So my recommendation for constructing the references is to use the
references of the “primary” antecedant message with the primary
antecedant message’s message-id appended. That way you always get a
linear chain in the correct order.
Interestingly there seems to be some threading there.
But notice: the top post has a little “is a reply” arrow. Even though it
is post 1. I expect that is because of the “topic” references entry,
which make TB think there was a earlier message (which of course there
was not).
In mutt-land we see almost no threading at all:
23Jul2022 06:24 Olha via Discus - ┌>[Py] [Users] I need an advise discuss-users 5.7K
22Jul2022 17:12 Paul Jurczak vi - ├>[Py] [Users] I need an advise discuss-users 5.5K
22Jul2022 13:21 Rob via Discuss - ├>[Py] [Users] I need an advise discuss-users 6.8K
22Jul2022 12:53 vasi-h via Disc - ├>[Py] [Users] I need an advise discuss-users 5.5K
22Jul2022 11:38 Cameron Simpson - ├>[Py] [Users] I need an advise discuss-users 14K
22Jul2022 10:27 Rob via Discuss - ├>[Py] [Users] I need an advise discuss-users 6.6K
22Jul2022 06:14 vasi-h via Disc r ┴>[Py] [Users] I need an advise discuss-users 6.5K
which is because every message’s In-Reply-To points directly at the
fictitious “topic” message-id. Mutt probably ignores the References
because it is a mail reader, and References originates in USENET news.
Maybe Thunderbird is using the references or augumenting the in-reply-to
with references information.
You only need to consult one of In=-Reply-To or References to do
threading; the former comes from email and the latter from USENET.
You’re supporting both (which is great!) so we need to make them
consistent.
(Aside: there’s also discussion about USENET mirroring, because several
python people consume the lists via a USENET interface. Again, a
separate topic.)
[…]
[quote=“Cameron Simpson, post:8, topic:233499,
username:cameron-simpson”]
This looks fine. Does it save this id with the db record so that inbound
replies can be tied to the antecedant forum message?
[/quote]
The answer is – it depends. If a post is created in Discourse from an inbound email, such as this one of yours, we use that post’s original inbound Message-ID when someone replies to it for the In-Reply-To and References headers as per:
discourse/lib/email/sender.rb at 98bacbd2c6b9fe57167cd32af5eb4839b4a5d1f6 · discourse/discourse · GitHub
Otherwise we are just using the topic OP reference and just generating a new reference, which obviously is what is causing all the issues. In all cases we generate a new Message-ID every time an outbound email is sent, which seems correct and on par with other mail clients.
Alas, not quite. If you’re the origin of the message (i.e. authored in
Discourse), generating the message-id is fine. If there’s no message-id
(illegal) generating one is standard practice (usually by MTAs). But if
you’re passing a message on (authored in email), the existing message-id
should be preserved.
To my mind you need to be doing 3 things:
- having a stable message-id and not replacing the message-id from an
inbound message
- generating correct
In-Reply-To, which is easily computed from the
immediate antecedant message(s) i.e. antecedant(s)-Message-ID
- generating correct
References, which is easily computed as
antecedant-References + antecedant-Message-ID
For point 1, looking at the code you cite, you probably want the email
message id to be (Pythonish syntax, sorry):
def message_id(post):
return post.incoming_email.message_id or discourse_message_id(post)
i.e. to be the post’s email message-id if it originated from email,
otherwise the Discourse message-id using something like the algorithm
you outline later in this message: anything (a) stable and (b)
syntacticly valid.
Then computing the In-Reply-To and References fields is simple
mechanical stuff as in points 2 and 3.
I think I see what you mean, does it go like this:
- cameron sends email to Discourse from mutt which gets
Message-ID: 74398756983476983@mail.com
- Discourse creates a post and stores the
Message-ID with against the post with an IncomingEmail record
Correct.
- johndoe is watching the topic, so they get sent an email from Discourse with a
Message-ID: topic/222/44@discourse.com and no reference to the original Message-ID: 74398756983476983@mail.com
No. You really want to pass through IncomingEmail.message_id as the
Message-ID in the email to johndoe. It’s the same message.
Does that sound correct, that we should just “pass on” that Message-ID to those watching the topic instead of generating our own since it’s already unique? What then happens in johndoe’s mail client if
cameron also CC’d him on that original outbound message? This does sound like a separate issue so it would be good to open another bug topic for it.
By passing it on, the original message (cameron->cc:johndoe) and the
Discourse forwarded message (cameron->Discourse->johndoe) have the same
message-id and the same message contents. The receiving mail system
stores both. The mail reader sees both, and either presents both or
keeps just one (this is a policy decision of the mail reader - keeping
just one is common). Because they’re the same message, in general it
does not matter which is kept.
If we ignored discourse and considered a message which was
a copy of the message via the list and also via direct email. They’re
the same message, with the same message-id.
I will set up a mutt client locally to see what you are also seeing, I have never tested this functionality in a text-based client (only Gmail and Thunderbird) so I am keen to see how it looks anyway.
Happy to help with settings. For threaded view you need to set the
sorting to threadeed. Mutt is very configurable.
My line of thinking to address these issues this morning was to dispose
with the randomly generated suffixes generated when we send
Message-ID headers in emails and instead change to a scheme where we
use the user_id of both the sending and receiving user. The benefit
of this is that there is no need to store the Message-ID anywhere
(apart from when an inbound email creates a post) and so References
and In-Reply-To headers will always be consistent.
Yes, that is much better. Noting that the inbound email message-id
should override the Discourse derived message-id for the outbound email.
(Most mail systems use random strings because there’s no surrounding
context such as the discourse topic message structure - messages are
considered alone; but the only real requirement is persistent
uniqueness.)
Let me give an example. Say we have these users:
- martin - user_id 25
- cameron - user_id 44
- sam - user_id 78
- bob - user_id 999
And then we have this topic, topic_id 233499, with posts starting from post_id 100 as the OP. The format would become topic/#{topic_id}/#{post_id}.s#{sender_user_id}r#{receiver_user_id}.
The order of operations would look like this:
- martin creates the OP
- cameron is sent an email with these headers:
Message-ID: topic/233499.s25r44@meta.discourse.org
References: topic/233499@meta.discourse.org
- sam is sent an email with these headers:
Message-ID: topic/233499.s25r78@meta.discourse.org
References: topic/233499@meta.discourse.org
-
There should not be a References header in the OP. It isn’t
needed for threading and effectively pretends there’s some “post 0”
which doesn’t exist. It meeans every OP (a) looks like a reply, which it
is not and (b) looks like the thing to which it is a reply is missing
from the reader’s mailbox.
-
This makes different message-ids for each outbound copy of the OP.
That’s bad. They need to be the same. Supposing sam CCs cameron
directly in a reply. The In-Reply-To will cite a mesage-id cameron
has never received.
You can just drop the sender_user_id and receiver_user_id from the
message-id field and get a single unique id which every receiver sees.
The uniqueness constraint is the post itself, not the individual
email-level “message” object.
Re the References, the OP should not have one. TB and everything else
will be fine. If they’re threading using References instead of
In-Reply-To, the References in the reply messages are enough.
Here’s the start of a mailing list discussion thread in Mutt:
16Jul2022 01:09 Rob Boehne - │├>[Python-Dev] Re: [SPAM] Re: Swit python-dev 9.2K
16Jul2022 01:33 Peter Wang - │├> python-dev 3.0K
16Jul2022 00:24 Skip Montanaro - ├>[Python-Dev] Re: Switching to Dis python-dev 4.2K
16Jul2022 04:49 Erlend Egeberg - ├>[Python-Dev] Re: Switching to Dis python-dev 10K
16Jul2022 04:20 Mariatta - ├>[Python-Dev] Re: Switching to Dis python-dev 10K
15Jul2022 21:18 Petr Viktorin - [Python-Dev] Switching to Discourse python-dev 4.2K
Ignore that I sort my email newest-on-top. See that there’s no arrow on
the initial post (at the bottom). That messgae has no References and
no In-Reply-To. All the others have In-Reply-To (and possibly
References, but this is an email mailing list so not necessarily; as I
mentioned before they’re complimentary.)
If I repeat my Discourse example from earlier:
23Jul2022 06:24 Olha via Discus - ┌>[Py] [Users] I need an advise discuss-users 5.7K
22Jul2022 17:12 Paul Jurczak vi - ├>[Py] [Users] I need an advise discuss-users 5.5K
22Jul2022 13:21 Rob via Discuss - ├>[Py] [Users] I need an advise discuss-users 6.8K
22Jul2022 12:53 vasi-h via Disc - ├>[Py] [Users] I need an advise discuss-users 5.5K
22Jul2022 11:38 Cameron Simpson - ├>[Py] [Users] I need an advise discuss-users 14K
22Jul2022 10:27 Rob via Discuss - ├>[Py] [Users] I need an advise discuss-users 6.6K
22Jul2022 06:14 vasi-h via Disc r ┴>[Py] [Users] I need an advise discuss-users 6.5K
See they all have a leading arrow? That is because the mail client
believes they are all replies to a common (and missing) root message,
which is because of the “topic” message-id in the References header.
Whereas post 1 is actually the bottom message displayed above.
Summary:
- your plan is good, provided you drop the sender and receiver from the
message-id - they’re unnecessary and in fact the receiver will cause
trouble (the sender is just redundant).
- drop the “topic” pseudo-message-id from the
References - it misleads
email clients (including TB, even if it isn’t visually evident)
- cameron replies via email
- discourse is sent an email with these headers from mutt:
Message-ID: 43585349859734@test.com
References: topic/233499@meta.discourse.org topic/233499.s25r44@meta.discourse.org
In-Reply-To: topic/233499.s25r44@meta.discourse.org
Yes, again with the caveat that there should not be a “topic” reference.
As expected, there is a reference to the OP message-id. Though it should
be the same message-id that sam sees for the OP.
- discourse (as cameron, from the above email) creates post 101
- sam is sent an email from discourse with these headers:
Message-ID: topic/233499/101.s44r78@meta.discourse.org
References: 43585349859734@test.com topic/233499@meta.discourse.org
In-Reply-To: 43585349859734@test.com
And here it goes wrong. The Message-ID should be
43585349859734@test.com from the .incoming_post.message_id field.
(Well, in my mind this is post.message_id(), which returns
post.incoming_post.message_id for an email generated post and your
Discourse generated one otherwise).
Consider: I compose and send my reply with message-id
43585349859734@test.com. For continuity reasons, I keep a copy of that
in my local folder, where it shows as a reply to the OP. Ideally
Discourse also sends me a copy of my own post (this is a policy setting
on many mailing lists), so I get Discourse’s version also. That should
have the same message-id, because it is the same message, just via a
different route.
Discourse’s message is not “in reply to” my message. It is my
message, just forwarded.
This effect cascades through your following examples. The actual process
should be simpler than you’ve made it.
Think of it this way. If I reply to a post from email, it effectively is
like me emailing sam (and the others) via Discourse. Discourse
forwards my message to the email-receiving subscribers, and
“incidentally” keeps a copy on the forum 
As a side note, I also looked into what GitHub do with their
notification emails, and noticed they do a similar thing where they
have an ever-present Reference
(discourse/discourse/pull/252@github.com) that is used in all the
emails related to that “topic” which in this case is a GitHub pull
request:
References: <discourse/discourse/pull/252@github.com> <discourse/discourse/pull/252/issue_event/7042100517@github.com>
In-Reply-To: <discourse/discourse/pull/252/issue_event/7042100517@github.com>
Hoo, github. What a disaster their issue emails are 
However, in their scenario, the PR is the OP. So a reference directly
to the pull is sane. You could use the “topic” message-id for post 1,
provided you didn’t also use the “topic/1” id as well. But there seems
little point - it is extra effort to special case post 1 - I’d just use
“topic/1” myself.
To add some complication. As I understand it, an admin can move a post
or topic. Doesn’t that break the “generate the message-id” scheme,
particularly if they move just a post? I’m somewhat of the opinion that
every post should have a _message_id field, filled in from the
incoming message (from email) or generated (posting via Discourse). Then
it is persistent and stable and robust against any shuffling of posts or
changes of algorithm.
Finally, there’s a small security consideration: you should ignore the
inbound email message-id (and potentially bounce the message) if it
claims the message-id of an existing post. Since as an author, I can put
anything I like in that header
I’d go with just dropping the
message-id - accept the post, but don’t let it lie about being some
other post - give your copy the Discourse-generated id and then proceed
as normal.