Hand-holding needed: Using regular expressions as Watched Words


(Southpaw) #1

I’m trying to integrate regular expressions into an already-populated Watched Words list. I use only the “Censored” option. I must be doing everything wrong, and I’ll appreciate any assistance available. Here are the first two challenges I’m facing.

  1. When I make no change except to check “Watched words are regular expressions” and click the green check mark to save that configuration, the word “should” suddenly becomes censored. If I uncheck “Watched words are regular expressions” and save, “should” is allowed again. “Should” is not on my list of 93 words.

  2. When I tried to add regular expressions, an error message is returned. For example, I have a regular expression that I’ve tested and verified using https://regexr.com/ that is designed to censor E-mail addresses, except for staff e-mail addresses.

(\w+)@(?!(?:republicwireless).com(\s|$))(\w+).([a-zA-Z]{2,5})(\s|$)

I check “Watched words are regular expressions”, save, and refresh. I then enter that regular expression in the Watched Words field, and am told, “Sorry, an error has occurred.” Is there a different format for regular expressions I need to be using? I’ve tried it both with and without the open and close / mark.


(Southpaw) #3

Please help me understand your reasoning. I’d like to know why the flag option would be preferable?

Understood, and seen daily. This is just one example to begin my learning.

Ok. According to http://rubular.com/ it does.

Edited to add:
I’m not sure what happened to the post to which I was replying, or the gentleman who wrote it, as both appeared to have vanished, but I wanted to thank him for challenging me on flagging vs. censoring. As it turns out, censoring an E-mail address masks the actual letters rendered but leaves the underlying mailto: link with the E-mail address intact. I’ve moved my E-mail regex to the “Require Approval” section of Watched Words.


(Simon Cossar) #4

I’ve tested your (\w+)@(?!(?:republicwireless).com(\s|$))(\w+).([a-zA-Z]{2,5})(\s|$) regular expression at http://rubular.com/ and it works for me. When I try to enter it in the watched words list on my forum, I’m getting the same error message as you are.

The error in the console is: the server responded with a status of 422 (Unprocessable Entity)

This is probably to do with changing from watching words based on strings to watching words based on regular expressions. Look for any words in your watched word list that contain the dot character (.). It will match any single character in a word.


(Southpaw) #5

Found the culprit! Thank you!

Does this mean I need to keep tweaking the expression, or is this someone else’s problem to solve?


(Kris) #6

It’s possible there’s some length limit here?

If I enter (\w+)@(?!(?:a).a(\s|$))(\w+).([a-zA-Z]{2,5})(\s|$) it’s accepted (and it’s 50 characters). If I go one character higher it fails.

Simplifying to test the theory:

veryveryveryveryveryveryveryveryveryveryverylong works
veryveryveryveryveryveryveryveryveryveryveryverylong doesn’t

Maybe @neil knows?


(Simon Cossar) #7

We (Discourse) need to figure out what the problem is with this regular expression.


(Neil Lalonde) #8

Yes there is a length limit of 50 chars per word, which is from when only words were supported. Now that the “Watched words are regular expressions” setting exists, 50 might be too small. Also the reason for the error should be showing.

Your regexp can be shortened to this which will work:

\w+@(?!(?:republicwireless).com)\w+.[a-zA-Z]{2,5}

Yours is capturing parts of the string as match groups which is unnecessary when you only need to know if strings match or not. I didn’t touch the (?!(?: part because I don’t know what’s going on there exactly. :blush:


(Southpaw) #9

Beautiful! I just had to make one change, but yes, the shortened version works well.

Now I have to tackle phone numbers, and I’ll try to keep the 50 character limit in mind unless/until it is addressed.

Thanks so much, y’all are amazingly patient, prompt, and helpful.


(Southpaw) #10

I’m back with another question, if it’s okay to tack it on here.

For their own sakes, I need to be able to prevent users from posting their own phone numbers. (Yes, I’m certain they run with knives, too, but I can’t be everywhere.)

I have a regex that does exactly what I want it to do on rubular.com and on regexr.com

(^|\D)(1[\W]?)?(\(?\d{3}\)?\W?){2}\d{4}($|\D)

Capture the most common formats of U.S. phone numbers - 10 digits, with optional separators after the third and sixth digit, with the first three digits optionally surrounded by parenthesis, and an optional 1 with optional separator prepended.

I need 10-digit strings of numbers within a 12-digit string of numbers to not be captured.

In both regular expression testing environments, this regex works correctly. 9195551212 is captured but 919555121212 is not.

10-digits

12-digits

10-digits

image

12-digits

image

However, when I add the regex to Watched Words, the first 10 digits of the 12-digit number are captured.

image

I don’t think it’s a different rule that’s capturing the 10 digits; the only other rules in my Watched Words list that capture numbers are:
(\d{4}\W?){4}
Capture four strings of four digits each, optionally separated.
and
\d{4}\W?\d{6}\W?\d{5}
Capture 15-digit numbers optionally separated at the 4th and 10th digits.

My question is: What do I need to change about this regular expression so that it will perform in Discourse as it does when tested elsewhere?


(Neil Lalonde) #11

Hmm, it works for me if I use \b (word boundary) instead of what you have at the beginning and end.

\b(1[\W]?)?(\(?\d{3}\)?\W?){2}\d{4}\b

I’m not sure why, but it’s slightly simpler and works. ¯\_(ツ)_/¯


(Southpaw) #12

Ah, I should have explained that. I need to still capture numbers if the person fails to use a word boundary. It is very common, for example, for them to type:
My number is:9195551212


(Neil Lalonde) #13

Ok then keep (^|\D) at the beginning. It works for me in Discourse.


(Jeff Atwood) #14

I recommend being as strict as possible on matches, if you need a possible colon in front, add it to the possible character set to match.


(Southpaw) #15

It’s not always a colon and it’s not always at the front. That’s why I said, “for example.”
I can’t specifically list all the possibilities a user might type, because there’s a Discourse-imposed, 50-character limit on the regex.
The point is that sometimes when our users type, they do not use a word boundary before and/or after the phone number that they shouldn’t be posting in the first place.

Is there no hope of the regular expression working in Discourse as it works in the two test environments?

By the way, negative lookbehind also fails in Discourse, as the rule becomes truncated at the < sign.


(Jeff Atwood) #16

The more fancy you add to a regex, the more risk of a runaway match, or unexpected results. I think fundamentally it is kind of unrealistic to expect this to work in every single possible scenario, and trying to do that can result in extreme complexity which isn’t sustainable.

I would generally advocate a “best effort” approach here where you catch 90% of them and don’t try to climb the impossible hill to 99% much less 100%.


(Southpaw) #17

Hi @neil,

Thank you for taking the time to try it and suggest alternates. Your comment that the initial boundary works led me to take another look at the final one. While ($|\D) fails, (\D|$) works. I now have the rule working as intended.