Unicode usernames and group names

(Gerhard Schlager) #1

Lifting the restrictions on allowed characters in usernames is one of the oldest feature requests. Starting with Discourse 2.3.0.beta9 it’s finally possible to use Unicode characters within usernames and group names.

New Site Settings

There are two new site settings: unicode username character whitelist and unicode usernames.

unicode username character whitelist allows you to whitelist only certain Unicode characters (e.g. [äöüßÄÖÜẞ] or \p{Greek}). By default Discourse permits letters (Ll / Lm / Lo / Lt / Lu), marks (Mc, Me, Mn) and numbers (Nd, Nl, but not No). The whitelist can restrict those characters, but it’s not possible to allow additional characters. Also, it’s not possible to forbid ASCII letters and numbers.

You should tailor it to your community’s needs and only allow characters and scripts that are needed for the languages used by your community.

Take a look at the Ruby documentation if you want to know more about character classes and character properties in regular expressions.

unicode usernames is disabled by default and we strongly advise you to configure the whitelist setting before enabling it in order to prevent homograph username spoofing.

Letter Avatar Service

The Letter Avatar Service has been updated and we added support for generating avatars with the most commonly used scripts. Feel free to create a pull request on GitHub to add a font from the Google Noto Fonts family if you encounter missing avatars for your language.

Enabling unicode usernames is only possible when the external system avatars enabled site setting is enabled, because the internal avatar generator doesn’t support Unicode. You can run your own instance of the Letter Avatar Service if you can’t or don’t want to rely on the external service.

image We even support the brand new glyph for “令和” (Reiwa) that was added to Unicode in May.

Good to know…

Discourse counts grapheme clusters (“user-perceived characters”) instead of Unicode codepoints when it validates username length (min username length and max username length site settings). The Letter Avatar Service also uses the first grapheme cluster of a username to generate an avatar.

You should also take a look at the reserved usernames site setting. You might want to add additional usernames now that your forum supports Unicode in usernames.

Feedback

Did you enable Unicode usernames for your community? We’d like to hear your feedback.
Also, we want to ship sensible default values for the unicode username character whitelist for each locale supported by Discourse. Please feel free to suggest regular expressions in a reply.

29 Likes
(Marguerite Su) #2

Thanks for the new feature!

I do have a discourse instance running for Chinese users, and I would like to test it.

But we have installed another plugin discourse-username-localization because previously unicode usernames were not supported officially.

So I would like to know how could I disable that plugin and switch to the official solution, will it break something? Any recommended steps to follow?

If this can be done, I think every CJK instance will switch to our official solution and contribute whitelist immediately :grinning:

3 Likes
(Gerhard Schlager) #3

It looks like the plugin also changes the behavior for linking to CJK tags and categories. This will probably break, but we should fix it in Discourse core. That should be easy to fix.

Other than that disabling the plugin and enabling the official Unicode support should work without problems. Letter avatars will look differently afterwards, because the plugin currently converts Chinese usernames into latin characters. But I guess that’s a good thing. :slight_smile:

8 Likes
(Marguerite Su) #4

Thanks.

I ‘ll create a branch with those not implemented yet left and try the official solution so they may not conflict.

The tags and categories uses the same set of regex but in javascript which doesn’t support \p{Katakana} stuff. I raised an issue to unify regex in that plugin, but the attempt failed. is it possible to use the same whitelist in the official implementation? eg a converter to convert ruby whitelist to javascript.

And the unicode avatar is just excellent!

5 Likes
(Marguerite Su) #5

I just switched my forum to unicode username.

I updated discourse-username-localization to remove all the ruby stuff. (can’t wait to see you guys fix hashtags and mentions in the core, so I can abandon it completely)

And use this whitelist:

[\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]

And update letter_avatar service to v4.

Now it works

5 Likes
(Rafael dos Santos Silva) #6

I think mentions are already supported :thinking:

3 Likes
(rizka) #7

For Finnish it should be [åäöÅÄÖ].

3 Likes
(Jeff Atwood) #8

Isn’t this your real Finnish name @rizka :wink:

6 Likes
(rizka) #9

Not quite, I have just one of those in my surname. :slight_smile:

Å/å is actually not a pure Finnish language letter. It never appears anywhere except the Finnish alphabet, computer keyboards and names of Swedish people and places. Ö/ö is somewhat rare. Ä/ä is by far the most common, but for a reason unknown to me, very uncommon in Finnish first names. Appears in many surnames, though, like mine. :slight_smile:

5 Likes
(Gerhard Schlager) #10

@marguerite You should also remove mentions.js.es6 from the plugin. There’s no need to patch anything related to usernames anymore. Only your customizations for categories and tags might still be needed, but we will fix that as well.

The +? at the end of the regex isn’t needed.
Out of curiosity: Can this whitelist be used for both zh_CN and zh_TW or is there a difference?

6 Likes
(Marguerite Su) #11

@gerhard

I removed mentions.js.es6, do I need to remove override-username-match.js.es6 as well?

\p{Han} covers traditional and simplified Chinese. My whitelist will allow CJK usernames

2 Likes
(Daniel Hollas) #12

For EU langs it might be easiest to just allow all extended latin if possible, rather than hand-picking specific letters for every language. :slight_smile:

EDIT: Although reading a bit more about homograph attack, might not be the best idea after all. :blush:
Here are chars for Czech;

ěščřžýáíéóůúďťň

7 Likes
(Gerhard Schlager) #13

Yes, you can remove that as well. Looks like that part of the plugin is broken anyway. User cards were refactored about a year ago.

4 Likes
(Erick Guan) #14

Katakana and Hiragana are both Japanese. Hangul is Korean.

I love this work. And I think a default setting below should work:

  • zh_CN, zh_TW: \p{Han}. This covers Chinese characters. Some communities can use more other characters. Maybe not default.
  • ko: \p{Hangul}. Korean don’t write Chinese at all. (I heard there are some Chinese characters in use in Korean?)
  • jp: [\p{Han}\p{Katakana}\p{Hiragana}] Japanese use all of them.

And maybe it’s good to mention reserved_usernames :sweat_smile:Unicode username does enable more names to be fake as admin/moderator.

7 Likes
(Gerhard Schlager) #15

Thanks for the regular expressions and also for the tip regarding reserved usernames. I added a note in the first post.

4 Likes