Uppercase letter detection appears to be ignoring accented letters

RaceProUK · March 25, 2015, 11:35am

Just one is enough to get past the ‘is this content’ filter, as the post below shows.

Not a biggie, just seems a little inconsistent, s’all.

RaceProUK · March 25, 2015, 11:35am

ÁN EXAMPLE REPRODUCTION HERE TOO

eviltrout · March 25, 2015, 3:02pm

The Uppercase detection is one of those features where we just handle the simplest and most basic version of the issue and leave it up to moderators to enforce it otherwise.

Why? Well as you noticed there are hundreds of thousands of unicode endpoints that would break it. It is just not practical to get them all when it’s easier to tell a user, “hey, stop doing that!”

RaceProUK · March 25, 2015, 3:15pm

True I guess, but IIRC, the .NET Framework has an API for asking whether a letter is upper or lower case; does whatever Discourse runs on (Ruby On Rails?) not have an equivalent? Or would that slow things down too much?

eviltrout · March 25, 2015, 3:28pm

Ruby provides an API to do this, and we use it, however it is only effective for ASCII. So when non-ASCII characters are present we skip the check. We were bitten by this previously with foreign languages.

RaceProUK · March 25, 2015, 3:50pm

Well, that’s a bit… huh.

Eh, may as well close this now I guess.

sam · March 25, 2015, 10:59pm

You can actually do this in Ruby it just means you need to be a tad more fancy

utf_pattern = Regexp.new("\\p{Lower}".force_encoding("UTF-8"))

a = "Go234"
a.match(utf_pattern) # => #<MatchData "o">

b = "GO234"
b.match(utf_pattern) # => nil

b = "ÜÖ234"
b.match(utf_pattern) # => nil

b = "Über234"
b.match(utf_pattern) # => #<MatchData "b">

riking · March 26, 2015, 6:47am

Don’t forget Chinese/Japanese/Korean!

eviltrout · March 26, 2015, 3:28pm

@neil is there a reason you didn’t use this approach? Looks like you were the one who did the ascii change.

neil · March 26, 2015, 3:59pm

I have no memory of this… It should use that approach. Also, can you do ALL CAPS in Chinese/Japanese/Korean??

riking · March 26, 2015, 6:06pm

I was saying to make sure that the behavior was correct, as it looks like that regex checks for “any lowercase”.

eviltrout · March 26, 2015, 6:11pm

Huh I found this commit but maybe you took the approach from someone else?

https://github.com/discourse/discourse/commit/876a570e3a2e227528d135a0cc67cccf442baaf1

Topic		Replies	Views
Force Lowercase slug URLs when set to "encoded" Support	23	4275	June 8, 2024
Username completition broken for names with accents like Régis Bug	9	1565	August 22, 2018
When watched words regular expressions is true, watched words does not allow uppercase regex Bug	2	1132	January 9, 2018
Unicode username with Σ as the final char leads to an error loading profile page Bug	36	2197	February 23, 2021
Accented characters cause false postives in Watched Words Bug watched-words	3	402	May 22, 2023

Uppercase letter detection appears to be ignoring accented letters

Related topics