Catch and educate users posting code without properly formatting it

Any chances you could “steal” what stackoverflow is doing? They probably put a lot of thinking into this. No point in re-inventing the wheel.

Would make sense, but my Google mojo didn’t turn anything up - it’s somehow pretty hard to google about some helper functionality to format code on StackOverflow :wink: @codinghorror, can you help out where to look?

This looks great, but could you please (pretty please with sugar on top) share some more details on how to implement ComposerMessagesFinder for catching users pasting code?

Let us say that I would like just to cover C/C++ so it would just detect semicolons, and { } curly braces. Could you please offer some documentation on that? Should I just modify composer_messages_finder.rb (I am not familiar with Ruby, so please excuse my ignorance, how do I catch “Paste” event in Ruby?

Thanks a lot in advance for any help you can offer.

What was posted here actually were just ideas how this could all go together and form a plugin for this functionality.

What is missing is exactly what you are asking for:
How to find out if a users posts code?
What regex or other technique can or should be used?
How does this work in other places? (StackOverflow was mentioned, but I couldn’t find any additional information about it)

I think you need a score which resets every (x) paragraphs that don’t contain enough characters to meet the threshold. So

  1. Loop through entire body
  2. Look at the current paragraph
  3. Check our current “is this code?” score
  4. Does this paragraph have a higher than expected special character score? Also, is this paragraph shorter than we’d expect?
  5. If we are already in “is this code?” mode, add the number of lines in sequence so far to the current score, to make it greater
  6. If the last (x) paragraphs have zero code, set “is this code?” to zero and record the ending line number
  7. If this is the first paragraph to trigger code mode set “is this code?” to the value of the paragraph score, and record the starting line number

You’d need a sizable test corpus for this to work. I know @zogstrip has one because I’ve seen it :wink:

Also

we can realistically limit our check to, say, the “big ten” languages. Per the tags page that would be C#, Java, PHP, JavaScript, Objective-C, C, C++, Python, Ruby.

This is also interesting as an approach

Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it’s probably code.

However Random Syntax Highlighting tends to love highlighting numbers in text for no reason. So you’d need to play with that, but it’s a very interesting idea.

What happens when you randomly type text and have it in a syntax highlighting block? 

Let's find out. Is this code? I dunno, is it code? You tell me.

6 Likes

FYI @erlend_sh if you are looking for encouragement projects this one is extremely strong.

3 Likes

For reference, here’s a kitchen sink dump of answers from Jeff asking this same question about implementation for SO back in 2011.

4 Likes

@Jose_C_Gomez this isn’t actually a plugin. It’s just a theory with some proposed implementation logic.

3 Likes

I’m in the process of creating a plugin for this, using a naïve (but hopefully effective) pattern matching approach.

My main doubt is actually about the UI, though. It seems Discourse currently has at least 3 (maybe more?) idioms for alerting users before posting:

  • Yellow overlay on top of rendered markdown. Dismissible, doesn’t prevent posting. Example: welcome message for first couple of posts. Least disruptive.
  • Orange/red tooltip-like prompt. Prevents posting and jiggles if you try to post before fixing the problem. Example: “post too short” message.
  • Modal dialog box. Prevents posting. Example: “your post looks like gibberish” message. Most disruptive.

I don’t think the modal is the right tool, as the message will need to be at least a couple of paragraphs long. Then again, the yellow overlay seems too easy to ignore. Possibly a combination of yellow overlay and orange jiggly tooltip could work?

Ideally, it’d also be permanently dismissible (“don’t show me this again”) to avoid annoying power users who intentionally do funky things with their markup that might otherwise trigger the pattern matcher.

Anyone have any thoughts? I’m open to other UI idioms as well, as long as they’re consistent with the general Discourse look and feel.

4 Likes

I would try to get detection down reliably first before agonizing too much over UI.

3 Likes

Here’s the basis of what I have for detection so far:

Code
const codeTypes = [
  /(^`{3,}).*\r?\n[\s\S]*\r?\n\1/gm, // backtick-fenced block
  /(^~{3,}).*\r?\n[\s\S]*\r?\n\1/gm, // tilde-fenced block
  /(?:^|(?:\r?\n{2,}))\s*(?:(?: {4}|\t).*(?:\r?\n|$))/g, // indented block
  // lack of `m` flag is intentional (`^` must match beginning of input, not line)

  /\[code.*\][\s\S]*\[\/code\]/gm, // BBCode tags

  /`.+`/g, // inline backticks (must come last)
];

const varNameStart = '[$_a-zA-Z]';
const varNameEnd = '[$_a-zA-Z0-9]*';
const varName = `${varNameStart}${varNameEnd}`;
const xmlLikeName = '[a-zA-Z-]+';

const nonHtmlIndicators = [
  `[$_]${varName}`, // almost certain to be var name
  `${varName}(?:_${varName})+`, // snake_case
  // camelCase and spinal-case omitted due to too many false positives
  '(?:^|\\s+)(?:\\/\\/|[;])', // single-line comment
  // ignore python-style `#` single-line comments due to conflict with md headings
  `\\/\\*[\\s\\S]+\\*\\/`, // C-like multiline comment
  `('''|""")[\\s\\S]+\\1`, // python-like multiline string/comment
  ';\\s*$', // trailing semicolon
  `${varName}\\((?:${varName})?\\)`, // function call
  `${varName}\\[(${varName}?)\\]`, // array index
  `${varName}\\.${varName}`, // object property
  '^\\s*[{}]\\s*$', // curly brace and nothing else on a line
  '\\{\\{.+\\}\\}', // templating languages e.g. handlebars
  '[$#]\\{.+\\}', // template string
  '&&|\\|\\||==|!=|>=|<=|=>|->|>>|<<|::'
  + '|__|!!|\\+\\+|\\+=|-=|\\*=|\\/=|\\|=|&=', // various operators
  '\\\\[\'"ntr0\\\\]', // common escape sequences
];

const htmlIndicators = [
  '<!--[\\s\\S]*?-->', // xml-like comment
  `<${xmlLikeName}.*\\/?>`, // xml-like start/empty tag
  `</${xmlLikeName}>`, // xml-like end tag
  '&([0-9a-zA-Z]+);$', // html entity - human-readable
  '&#([0-9]{1,7});$', // html entity - decimal
  '&#x([0-9a-fA-F]{1,6});$', // html entity - hex
];

const indicators = nonHtmlIndicators.concat(INCLUDE_HTML ? htmlIndicators : [])
.map(str => new RegExp(str, 'gm'));

Strip out everything under codeTypes, then check for anything under indicators in the remaining content, and warn if the number of matches is above a preconfigured threshold (defaulting to 0).

I guess I’ll just try to get a working version up and running using native browser alert, then work from there.

3 Likes

Yes sounds like a plan, if you are comfortable with this being client only then it can be done as a theme component which is awesome cause you will be able to reach lots more people.

6 Likes

Nice! Didn’t even know that feature existed. Looks like it could be perfect for this use case. The only caveat is that I can’t see any way to attach a new custom user field (has_dismissed_code_notice) without access to the back end. I guess there might be some hacky way of achieving the same effect? Alternatively, there’s always localStorage, but that doesn’t persist between devices.

1 Like

Yeah theme user settings has been on our list for a while, we need it in quite a few spots now. We will get to it.

1 Like

Is that something you’d welcome pull requests for? If so, I’d be happy to give it a go. I’m assuming something like a remote version of localStorage would cut it (string-only key-value pairs). Values set and retrieved using a tuple of theme ID, user ID, and key. User ID must be current user if the field is set to private. Is there anything missing from that picture?

Sorry but a pr for this is not going to work, I started looking at this yesterday and getting this to work just right with minimal cost is very very fiddly, I am rather stickler about minimizing queries on first load, doing so here is hard mode ++

I do really want this change so I may plug at it a bit more over the next few days

4 Likes

Theme component topic is here.

6 Likes

So I just did this for my site mostly as a joke , but it works well and it is effective (thus far)
It doesn’t automatically catch the offenders, but when I (or any of the leaders) see it we set a tag on the post of “formatcode”
Then this happens the next time that user logs in


It sticks around until they correct the problem and remove the tag, so far we’ve throw it at a few people and the code ALWAYS gets cleaned up. It only shows up for the author of the post.

Anyways I thought it was fun, I haven’t made a plugin out of it (I don’t have the time) but its fairly straight forward javascript. If someone wants it let me know and I’ll post it

18 Likes

OMG this is an epic thing you have done here :rofl:

5 Likes

Here’s the code in case someone else wants to mess with it
https://github.com/josegomez/discourse-clippy-nag

10 Likes