How to use Discourse regexes with watched words?

rizka · 24 Aprile 2017, 9:25pm

I’ve worked on censoring the most vulgar swear words with regular expressions today. Why regexes? Well, in Finnish and other Uralic languages like Hungarian and Estonian words are inflected. A single swear word could have maybe thousands of mutations, which is why it is awesome to have the ability to use regex patterns. It is also no coincidence that it was another Finn who proposed this originally.

I need some quick advice about which regex flavor Discourse uses. I experience some unexpected behavior with non-alphanumeric characters which is awkward especially because of the common letter ä in the Finnish alphabet. I got the regex into pretty good shape by basic knowledge about regexes and the method of trial and error, but for an even better result, I would need documentation or something.

Falco · 24 Aprile 2017, 10:31pm

You can read about it in the source code.

elijah · 24 Aprile 2017, 10:38pm

Reading that, I don’t see much about them except to see that they are Javascript regular expressions. (I would have assumed Ruby without that link.) So a Javascript reference would be in order.

Which has internal links to specifications, if you want to go deeper.

Mittineague · 25 Aprile 2017, 12:29am

AFAIK, Ruby and Postgres support POSiX

rizka · 25 Aprile 2017, 8:26am

Cool, thank you for your replies all. I’ll look into them and return if I still can’t figure it out.

justin · 29 Maggio 2019, 6:04pm

Discourse Regexes (Watched Words)

To use regular expressions (regex) in watched words you must first turn on the watched words regular expressions site setting.

Discourse by default matches all uppercase and lowercase forms of a word entered as a regular expression. That is,

thread

This will match thread, THREAD, and thReAd.

(t|7)hr(3|e)(4|a)d

This will match all of the cases above, plus thr3ad, 7hread, and thr34d.

threads?\S+

This will match thread and threads but not threaded or threading.

However, there’s a glaring error in ALL the above examples! The words threadlike and unthreading are matched (un▪️▪️▪️▪️▪️ing), even though they’re not referring to thread. How do we fix that?

We’d have to amend our regex to handle word boundaries.

\bthreads?\b

This looks for boundaries around the word so that unthreading or threadlike aren’t caught by the filter, but thread and threads still are.

For handling Unicode characters

gr(ü|ue)(ß|ss)e

This matches all commonly spelled forms of the word grüße — including gruesse and GRÜSSE

Say we want to block the word Über, but not Übersicht. Using word boundaries like \b(ü|ue)ber\b doesn’t work because some of the JavaScript regex word flags don’t handle Unicode characters.

Instead we have to make our own boundaries.

(?:^|\s)(ü|ue)ber\b

This will now appropriately match Über and ueber, but not Übersicht or uebersicht.

A final warning

Regex is extremely powerful and thus dangerous. An incorrectly written regex statement can cause issues for your users. Test your regex statements on non-production instances before going live.

supermathie · 30 Maggio 2019, 1:43am

If you want to get more serious (or silly) about this kind of thing, you can introduce formal test cases. For example, I’ve put @justin’s über-excellent example onto regex101.com: https://regex101.com/r/4ano0r/1/tests

If you do so, ensure you switch the regex flavour to ECMAScript:

image266×451 20.8 KB

hey, they spelled flavour wrong

Argomento		Risposte	Visualizzazioni
Using Regex with Watched Words Site Management reference , regex , watched-words , content	1	1965	Giugno 14, 2024
Watched Words - Is there a way to block the use of asterisks with certain words instead of it being used as a wildcard? Support watched-words	12	242	Agosto 16, 2024
Where is the watched words regular expressions site setting? Support regex , watched-words	2	508	Agosto 14, 2023
* wildcards in Watched Words (Censor) don't work Feature	19	3264	Gennaio 11, 2018
Invalid regular expressions in 'Watched Words' makes no watched word work Bug watched-words	2	662	Maggio 31, 2021

How to use Discourse regexes with watched words?

Discourse Regexes (Watched Words)

A final warning

Argomenti correlati