Email trimming improvement (no trimming in code blocks)

I’ve also opened a GitHub issue about this, but I wanted to post it here as well in case more folks are watching:

I think it’d be great if the email trimming logic could be improved to avoid trimming within code blocks. So for example in an email containing:

```
# This should not be deleted
#
# Or trimmed
# It is code
####
Code code code
```

Everything below the first ‘#’ is trimmed. That’s a bit inconvenient, as many folks use comment markers to break sections of their code, sometimes even in multi-line strings for printing. It also has the convenient feature that if folks want to copy-paste program output into an email and that output includes such lines, the email will not be trimmed at them if the program output is encased in ticks. Is there any chance this is a common enough issue that someone might have a moment to see if it could be improved? I’ve gotten as far as the regex where the matching happens, but I’m not sure how complex adding an exception for code blocks would be.

Thanks!

Ok, I had to learn a bit of Ruby, but:

Discussion is welcome!

3 Likes

Just so we’re on the same page, let me reformulate your problem so that we are talking about the same thing :wink:

You want the ability to correctly handle Fenced code blocks in email replies, especially if they include the # symbol, which is often used as a (signature) delimiter and thus trimmed.

So if you were to send the following email reply

Here's my patch

```
# This is some comment
####

answer = 42
```

Does it look good?

It should be “clever” enough to recognize that the lines between the ``` is actual code and thus should be “ignored” from the regular processing.

If that’s the case, I would recommend a different solution / approach. It might be better to hoist all the code blocks in the preprocess! function and inject them back afterwards.

Fenced code blocks are somewhat hard to parse correctly using a regex but for a good enough solution, this should work

def hoist_code_blocks(text)
  blocks = {}
  pattern = /^```\w*$\n.*?^```$/m
  
  text.gsub(pattern) do |block|
    token = SecureRandom.hex
    blocks[token] = block
    token
  end

  [text, blocks]
end

This method will replace all the code blocks with a random value and keep track of the mapping between the random value and the content of the block in the blocks hash.

You can call it like this

text = "some text\n```ruby\ndef foo\nend\n```\nmore text"
new_text, blocks = hoist_code_blocks(text)

And then you can “restore” the code blocks with the following code

blocks.each { |token, block| new_text.gsub!(token, block) }

Thanks for the reply! You correctly understood the problem I’m having, yes.

I thought about doing something like that as well, and I’m happy to have that implemented if it’s working and matches the behavior of the in-browser parser as closely as possible. For example, in the browser interface I’m allowed white space before my language declaration, and I’m allowed white space after my closing:

``` (many spaces here) c++
int x=42;
``` (many spaces here)

Still rightly renders as:

int x=42;

In the PR above I tried to follow the rules that I could identify in the parser.

The other two questions I’d ask about your implementation are whether this should really be done in preprocess!, so that the blocks also have to be passed around or held by the EmailReplyTrimmer class (and which would be preferred), and whether there might be a bug in there, because text that’s returned there is the same as the original text, without any substitutions done (seemingly gsub returns an enumerator of the matches, but you aren’t actually doing the replacement here?).

In any case, happy for you to use the test that was added in my pull request above if you’d prefer to add this to code to the parser yourself, or if you let me know about the above couple of issues I can create a new pull request. You absolutely got the problem right, and it looks like your solution is close, I’m just not sure quite how you’d like it finished off.

Thanks!

Yeah, it’s starting to get complicated… It’s doable, but there will always be edge-cases as opposed to having a real parser.

You can also have more than 3 `, 3 being the minimum :wink:

You can also do it in the trim function, right after the preprocess! call and do the “postprocess” step towards the end.

Right, that was mostly pseudo-code :sweat_smile:

You can probably use gsub! or do text = text.gsub....

Ok, great — I’ve opened a new PR here:

Thanks again!

3 Likes

Bumped the version in Discourse as well :up:

3 Likes

This topic was automatically closed after 39 hours. New replies are no longer allowed.