Email trimming improvement (no trimming in code blocks)

ZLMarshall · August 5, 2024, 12:25pm

I’ve also opened a GitHub issue about this, but I wanted to post it here as well in case more folks are watching:

github.com/discourse/email_reply_trimmer

Don't apply trimming within code blocks

opened 04:00PM - 18 Jul 24 UTC

We're using a pretty standard setup of Discourse with email integration, and it …looks like even within code blocks the trimming rules are applied. So for example in an email containing: ``` # This should not be deleted # # Or trimmed # It is code #### Code code code ``` Everything below the first '#' is trimmed. That's a bit inconvenient, as many folks use comment markers to break sections of their code, sometimes even in multi-line strings for printing. Is there any chance this is a common enough issue that someone might have a moment to see if it could be improved? I've gotten as far as the regex where the matching happens, but I'm not sure how complex adding an exception for code blocks would be. Thanks, Zach

I think it’d be great if the email trimming logic could be improved to avoid trimming within code blocks. So for example in an email containing:

```
# This should not be deleted
#
# Or trimmed
# It is code
####
Code code code
```

Everything below the first ‘#’ is trimmed. That’s a bit inconvenient, as many folks use comment markers to break sections of their code, sometimes even in multi-line strings for printing. It also has the convenient feature that if folks want to copy-paste program output into an email and that output includes such lines, the email will not be trimmed at them if the program output is encased in ticks. Is there any chance this is a common enough issue that someone might have a moment to see if it could be improved? I’ve gotten as far as the regex where the matching happens, but I’m not sure how complex adding an exception for code blocks would be.

Thanks!

ZLMarshall · November 16, 2024, 12:13am

Ok, I had to learn a bit of Ruby, but:

github.com/discourse/email_reply_trimmer

Allowing delimiters inside of code blocks

discourse:main ← zlmarshall:AllowDelimintersInCodeBlocks

opened 12:12AM - 16 Nov 24 UTC

zlmarshall

+81 -0

Addresses https://github.com/discourse/email_reply_trimmer/issues/21 and https:/…/meta.discourse.org/t/email-trimming-improvement-no-trimming-in-code-blocks/320159 Adds two new regex classes identifying opening and closing code blocks based on Discourse rules for code blocks as I understand them. Creates two new patterns for matching either of those code block patterns, and an additional pattern for matching both (since something like "```" will match both opening and closing block patterns). Allows delimiters within code blocks. Added a test case to show that it is indeed working as expected. No other test cases appear affected.

Discussion is welcome!

zogstrip · November 28, 2024, 4:02pm

Just so we’re on the same page, let me reformulate your problem so that we are talking about the same thing

You want the ability to correctly handle Fenced code blocks in email replies, especially if they include the # symbol, which is often used as a (signature) delimiter and thus trimmed.

So if you were to send the following email reply

Here's my patch

```
# This is some comment
####

answer = 42
```

Does it look good?

It should be “clever” enough to recognize that the lines between the ``` is actual code and thus should be “ignored” from the regular processing.

If that’s the case, I would recommend a different solution / approach. It might be better to hoist all the code blocks in the preprocess! function and inject them back afterwards.

Fenced code blocks are somewhat hard to parse correctly using a regex but for a good enough solution, this should work

def hoist_code_blocks(text)
  blocks = {}
  pattern = /^```\w*$\n.*?^```$/m
  
  text.gsub(pattern) do |block|
    token = SecureRandom.hex
    blocks[token] = block
    token
  end

  [text, blocks]
end

This method will replace all the code blocks with a random value and keep track of the mapping between the random value and the content of the block in the blocks hash.

You can call it like this

text = "some text\n```ruby\ndef foo\nend\n```\nmore text"
new_text, blocks = hoist_code_blocks(text)

And then you can “restore” the code blocks with the following code

blocks.each { |token, block| new_text.gsub!(token, block) }

ZLMarshall · November 28, 2024, 10:19pm

Thanks for the reply! You correctly understood the problem I’m having, yes.

I thought about doing something like that as well, and I’m happy to have that implemented if it’s working and matches the behavior of the in-browser parser as closely as possible. For example, in the browser interface I’m allowed white space before my language declaration, and I’m allowed white space after my closing:

``` (many spaces here) c++
int x=42;
``` (many spaces here)

Still rightly renders as:

int x=42;

In the PR above I tried to follow the rules that I could identify in the parser.

The other two questions I’d ask about your implementation are whether this should really be done in preprocess!, so that the blocks also have to be passed around or held by the EmailReplyTrimmer class (and which would be preferred), and whether there might be a bug in there, because text that’s returned there is the same as the original text, without any substitutions done (seemingly gsub returns an enumerator of the matches, but you aren’t actually doing the replacement here?).

In any case, happy for you to use the test that was added in my pull request above if you’d prefer to add this to code to the parser yourself, or if you let me know about the above couple of issues I can create a new pull request. You absolutely got the problem right, and it looks like your solution is close, I’m just not sure quite how you’d like it finished off.

Thanks!

zogstrip · November 28, 2024, 10:39pm

Yeah, it’s starting to get complicated… It’s doable, but there will always be edge-cases as opposed to having a real parser.

You can also have more than 3 `, 3 being the minimum

You can also do it in the trim function, right after the preprocess! call and do the “postprocess” step towards the end.

Right, that was mostly pseudo-code

You can probably use gsub! or do text = text.gsub....

ZLMarshall · November 28, 2024, 11:06pm

Ok, great — I’ve opened a new PR here:

github.com/discourse/email_reply_trimmer

Alternate code block protection

discourse:main ← zlmarshall:HoistCodeBlocks

opened 11:06PM - 28 Nov 24 UTC

zlmarshall

+47 -1

Following further discussion at https://meta.discourse.org/t/email-trimming-i…mprovement-no-trimming-in-code-blocks/320159 Pulling code blocks out of the email to be trimmed completely, and then replacing them in a post-processing step. Added a test to demonstrate that this is working properly. Thanks to zogstrip for the help!

Thanks again!

zogstrip · November 30, 2024, 3:15pm

Bumped the version in Discourse as well

github.com/discourse/discourse

DEV: bump email_reply_trimmer version to 0.2.0

discourse:main ← discourse:update-email-reply-trimmer

opened 03:14PM - 30 Nov 24 UTC

ZogStriP

+1 -1

Context - https://meta.discourse.org/t/email-trimming-improvement-no-trimming-in…-code-blocks/320159

zogstrip · December 2, 2024, 7:00am

This topic was automatically closed after 39 hours. New replies are no longer allowed.

Topic		Replies	Views
Email replies truncated at code fences with tildes Bug	1	746	March 3, 2020
Disable indent code block for emails Feature	21	1799	April 30, 2020
Content after a horizontal line in email replies is stripped? Bug	2	586	January 5, 2022
Single quote block dropped in email reply Bug	7	1445	April 15, 2020
Code blocks in emails have empty newlines Support	3	501	June 21, 2022

Email trimming improvement (no trimming in code blocks)

Related topics