Catch and educate users posting code without properly formatting it

In this post @codinghorror horror mentioned that on StackOverflow there is some logic that catches users trying to post unformatted code (“there are a lot of unusual [ and { characters per line, etc)”) and then block them from posting until they format it.

He also mentioned that there already was some discussion on writing such a plugin, but I couldn’t find it - so here is a new topic.

Does anyone remember the former posts about this?

If not, I will try to refine the idea a bit, see how SO does it etc to effectively write a spec how this plugin could/should work. (Unfortunately, I have no idea how RoR works etc, so I can’t code anything of it myself.)

Here are my Canned Replies to handle this manually, that could be used as starters for the text people are shown:

And some elaborate explanations of how to post (code) on SO:
https://meta.stackexchange.com/a/22189/142000
https://meta.stackexchange.com/help/formatting
https://meta.stackexchange.com/editing-help

Anyone know where to find anything about the feature @codinghorror mentioned?

2 Likes

@naveedahmada036 posted an example for a Slack plugin that does something related:

1 Like

Step 1

So having thought about it a bit, I think these are the possible basic ways to implement it:

  1. Catch the user pasting code and somehow make it formatted (see step #2).
  2. Catch the user posting a topic or reply that contains unformatted code and handle it.
  3. Bot-reply or -PM a user after he posted something with unformatted code.

I think option 3) is out of scope here and more a moderation or community management thing that is probably better be handled manually by moderators or with another plugin. Leaves 1) and 2).

#1 would require monitoring the paste events, analyze their content and then some interface for the next step. Alternatively the code could also automatically be wrapped in the correct formatting, but this adds a lot of complexity (what if someone pastes code into an already existing code block?). But this would also be the most newbie friendly way.

#2 would require analysis of the text to be posted. Again some form of interface would have to be applied to inform the user of his options or, again, some logic could try to add formatting to the post automatically. This would be very convenient for the user, but quite complex to implement (Is this 1 or 2 code samples? Code block or just code in text?)

Step 2

The “Somehow make it formatted”, “handle it” and “interface” refers to the possible variant of the second step of the process. This would probably have to include some “error” or “notice” to the user:

Seems you are trying to post code.
Please apply ``` around it to format as code.
Or select the code and hit the </> button to format it automatically.

(Better wording of course).

Then this could maybe highlight the toolbar button to format as code.

If we decided to handle this at “pasting” time of code, the pasted text could also still be selected in the textarea so clicking the button would be enough.


Anything else to consider?

I would start much easier by adding a new ComposerMessagesFinder which would check for a large number of special characters mostly used in code and sending back to the user a JIT message teaching them how to properly format code.

Trying to automatically format code is going to end is :cry:

3 Likes

Yeah, my conclusion as well.
Just wanted to brainstorm the options.

Is this ComposerMessagesFinder at the time of pasting or posting?
Is this “JIT message” already a thing? Is this a private message or some UI element shown to the user?

The concept already exists - try creating a brand new user on try.discourse.org, make a topic, and you’ll see it.

So you could have something that looks like this:

4 Likes

It’s whenever we save a draft.

2 Likes

That looks perfect. It doesn’t block the actual typing and editing, users can click it away and ignore it, there is actually enough space to really explain the task.
Only drawback is that it covers the preview where the actual difference the code formatting make is hidden. Any workaround?

Saving a draft also sounds like a reasonable moment to trigger this.

Ok, so the last thing to define would probably be how to recognize code.

Some suggestions:

  • more than 2 { or } in one paragraph
  • more than 2 [ or ] in one paragraph
  • more than 2 ; in one paragraph
  • more than 5 \t (tab characters) in one paragraph

More?

Any chances you could “steal” what stackoverflow is doing? They probably put a lot of thinking into this. No point in re-inventing the wheel.

Would make sense, but my Google mojo didn’t turn anything up - it’s somehow pretty hard to google about some helper functionality to format code on StackOverflow :wink: @codinghorror, can you help out where to look?

This looks great, but could you please (pretty please with sugar on top) share some more details on how to implement ComposerMessagesFinder for catching users pasting code?

Let us say that I would like just to cover C/C++ so it would just detect semicolons, and { } curly braces. Could you please offer some documentation on that? Should I just modify composer_messages_finder.rb (I am not familiar with Ruby, so please excuse my ignorance, how do I catch “Paste” event in Ruby?

Thanks a lot in advance for any help you can offer.

What was posted here actually were just ideas how this could all go together and form a plugin for this functionality.

What is missing is exactly what you are asking for:
How to find out if a users posts code?
What regex or other technique can or should be used?
How does this work in other places? (StackOverflow was mentioned, but I couldn’t find any additional information about it)

I think you need a score which resets every (x) paragraphs that don’t contain enough characters to meet the threshold. So

  1. Loop through entire body
  2. Look at the current paragraph
  3. Check our current “is this code?” score
  4. Does this paragraph have a higher than expected special character score? Also, is this paragraph shorter than we’d expect?
  5. If we are already in “is this code?” mode, add the number of lines in sequence so far to the current score, to make it greater
  6. If the last (x) paragraphs have zero code, set “is this code?” to zero and record the ending line number
  7. If this is the first paragraph to trigger code mode set “is this code?” to the value of the paragraph score, and record the starting line number

You’d need a sizable test corpus for this to work. I know @zogstrip has one because I’ve seen it :wink:

Also

we can realistically limit our check to, say, the “big ten” languages. Per the tags page that would be C#, Java, PHP, JavaScript, Objective-C, C, C++, Python, Ruby.

This is also interesting as an approach

Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it’s probably code.

However Random Syntax Highlighting tends to love highlighting numbers in text for no reason. So you’d need to play with that, but it’s a very interesting idea.

What happens when you randomly type text and have it in a syntax highlighting block? 

Let's find out. Is this code? I dunno, is it code? You tell me.

6 Likes

FYI @erlend_sh if you are looking for encouragement projects this one is extremely strong.

3 Likes

For reference, here’s a kitchen sink dump of answers from Jeff asking this same question about implementation for SO back in 2011.

4 Likes

@Jose_C_Gomez this isn’t actually a plugin. It’s just a theory with some proposed implementation logic.

3 Likes

I’m in the process of creating a plugin for this, using a naïve (but hopefully effective) pattern matching approach.

My main doubt is actually about the UI, though. It seems Discourse currently has at least 3 (maybe more?) idioms for alerting users before posting:

  • Yellow overlay on top of rendered markdown. Dismissible, doesn’t prevent posting. Example: welcome message for first couple of posts. Least disruptive.
  • Orange/red tooltip-like prompt. Prevents posting and jiggles if you try to post before fixing the problem. Example: “post too short” message.
  • Modal dialog box. Prevents posting. Example: “your post looks like gibberish” message. Most disruptive.

I don’t think the modal is the right tool, as the message will need to be at least a couple of paragraphs long. Then again, the yellow overlay seems too easy to ignore. Possibly a combination of yellow overlay and orange jiggly tooltip could work?

Ideally, it’d also be permanently dismissible (“don’t show me this again”) to avoid annoying power users who intentionally do funky things with their markup that might otherwise trigger the pattern matcher.

Anyone have any thoughts? I’m open to other UI idioms as well, as long as they’re consistent with the general Discourse look and feel.

4 Likes

I would try to get detection down reliably first before agonizing too much over UI.

3 Likes

Here’s the basis of what I have for detection so far:

Code
const codeTypes = [
  /(^`{3,}).*\r?\n[\s\S]*\r?\n\1/gm, // backtick-fenced block
  /(^~{3,}).*\r?\n[\s\S]*\r?\n\1/gm, // tilde-fenced block
  /(?:^|(?:\r?\n{2,}))\s*(?:(?: {4}|\t).*(?:\r?\n|$))/g, // indented block
  // lack of `m` flag is intentional (`^` must match beginning of input, not line)

  /\[code.*\][\s\S]*\[\/code\]/gm, // BBCode tags

  /`.+`/g, // inline backticks (must come last)
];

const varNameStart = '[$_a-zA-Z]';
const varNameEnd = '[$_a-zA-Z0-9]*';
const varName = `${varNameStart}${varNameEnd}`;
const xmlLikeName = '[a-zA-Z-]+';

const nonHtmlIndicators = [
  `[$_]${varName}`, // almost certain to be var name
  `${varName}(?:_${varName})+`, // snake_case
  // camelCase and spinal-case omitted due to too many false positives
  '(?:^|\\s+)(?:\\/\\/|[;])', // single-line comment
  // ignore python-style `#` single-line comments due to conflict with md headings
  `\\/\\*[\\s\\S]+\\*\\/`, // C-like multiline comment
  `('''|""")[\\s\\S]+\\1`, // python-like multiline string/comment
  ';\\s*$', // trailing semicolon
  `${varName}\\((?:${varName})?\\)`, // function call
  `${varName}\\[(${varName}?)\\]`, // array index
  `${varName}\\.${varName}`, // object property
  '^\\s*[{}]\\s*$', // curly brace and nothing else on a line
  '\\{\\{.+\\}\\}', // templating languages e.g. handlebars
  '[$#]\\{.+\\}', // template string
  '&&|\\|\\||==|!=|>=|<=|=>|->|>>|<<|::'
  + '|__|!!|\\+\\+|\\+=|-=|\\*=|\\/=|\\|=|&=', // various operators
  '\\\\[\'"ntr0\\\\]', // common escape sequences
];

const htmlIndicators = [
  '<!--[\\s\\S]*?-->', // xml-like comment
  `<${xmlLikeName}.*\\/?>`, // xml-like start/empty tag
  `</${xmlLikeName}>`, // xml-like end tag
  '&([0-9a-zA-Z]+);$', // html entity - human-readable
  '&#([0-9]{1,7});$', // html entity - decimal
  '&#x([0-9a-fA-F]{1,6});$', // html entity - hex
];

const indicators = nonHtmlIndicators.concat(INCLUDE_HTML ? htmlIndicators : [])
.map(str => new RegExp(str, 'gm'));

Strip out everything under codeTypes, then check for anything under indicators in the remaining content, and warn if the number of matches is above a preconfigured threshold (defaulting to 0).

I guess I’ll just try to get a working version up and running using native browser alert, then work from there.

3 Likes