Uppercase letter detection appears to be ignoring accented letters

Just one is enough to get past the ‘is this content’ filter, as the post below shows.

Not a biggie, just seems a little inconsistent, s’all.

8 Likes

ÁN EXAMPLE REPRODUCTION HERE TOO

2 Likes

The Uppercase detection is one of those features where we just handle the simplest and most basic version of the issue and leave it up to moderators to enforce it otherwise.

Why? Well as you noticed there are hundreds of thousands of unicode endpoints that would break it. It is just not practical to get them all when it’s easier to tell a user, “hey, stop doing that!”

1 Like

True I guess, but IIRC, the .NET Framework has an API for asking whether a letter is upper or lower case; does whatever Discourse runs on (Ruby On Rails?) not have an equivalent? Or would that slow things down too much?

Ruby provides an API to do this, and we use it, however it is only effective for ASCII. So when non-ASCII characters are present we skip the check. We were bitten by this previously with foreign languages.

Well, that’s a bit… huh.

Eh, may as well close this now I guess.

You can actually do this in Ruby it just means you need to be a tad more fancy

utf_pattern = Regexp.new("\\p{Lower}".force_encoding("UTF-8"))

a = "Go234"
a.match(utf_pattern) # => #<MatchData "o">

b = "GO234"
b.match(utf_pattern) # => nil

b = "ÜÖ234"
b.match(utf_pattern) # => nil

b = "Über234"
b.match(utf_pattern) # => #<MatchData "b">
1 Like

Don’t forget Chinese/Japanese/Korean!

@neil is there a reason you didn’t use this approach? Looks like you were the one who did the ascii change.

I have no memory of this… It should use that approach. Also, can you do ALL CAPS in Chinese/Japanese/Korean??

I was saying to make sure that the behavior was correct, as it looks like that regex checks for “any lowercase”.

Huh I found this commit but maybe you took the approach from someone else?

https://github.com/discourse/discourse/commit/876a570e3a2e227528d135a0cc67cccf442baaf1

2 Likes