Unicode usernames and group names

Lifting the restrictions on allowed characters in usernames is one of the oldest feature requests. Starting with Discourse 2.3.0.beta9 it’s finally possible to use Unicode characters within usernames and group names.

New Site Settings

There are two new site settings: allowed unicode username characters and unicode usernames.

allowed unicode username characters allows you to allow only certain Unicode characters (e.g. [äöüßÄÖÜẞ] or \p{Greek}). By default Discourse permits letters (Ll / Lm / Lo / Lt / Lu), marks (Mc, Me, Mn) and numbers (Nd, Nl, but not No). The setting can restrict those characters, but it’s not possible to allow additional characters. Also, it’s not possible to forbid ASCII letters and numbers.

You should tailor it to your community’s needs and only allow characters and scripts that are needed for the languages used by your community.

Take a look at the Ruby documentation if you want to know more about character classes and character properties in regular expressions.

unicode usernames is disabled by default and we strongly advise you to configure the allowed unicode username characters setting before enabling it in order to prevent homograph username spoofing.

Example allowed values:

  • zh_CN Chinese: [\p{Han}]
  • zh_TW Chinese: [\p{Han}]
  • ko Korean only: [\p{Hangul}]
  • jp Japanese: [\p{Han}\p{Katakana}\p{Hiragana}]
  • jp Japanese (カタカナ only): [\p{Katakana}]
  • fi Finnish: [åäöÅÄÖ]
  • cs Czech: [ěščřžýáíéóůúďťň]

Letter Avatar Service

The Letter Avatar Service has been updated and we added support for generating avatars with the most commonly used scripts. Feel free to create a pull request on GitHub to add a font from the Google Noto Fonts family if you encounter missing avatars for your language.

Enabling unicode usernames is only possible when the external system avatars enabled site setting is enabled, because the internal avatar generator doesn’t support Unicode. You can run your own instance of the Letter Avatar Service if you can’t or don’t want to rely on the external service.

image We even support the brand new glyph for “令和” (Reiwa) that was added to Unicode in May.

Good to know…

Discourse counts grapheme clusters (“user-perceived characters”) instead of Unicode codepoints when it validates username length (min username length and max username length site settings). The Letter Avatar Service also uses the first grapheme cluster of a username to generate an avatar.

You should also take a look at the reserved usernames site setting. You might want to add additional usernames now that your forum supports Unicode in usernames.

Feedback

Did you enable Unicode usernames for your community? We’d like to hear your feedback.
Also, we want to ship sensible default values for the unicode username character whitelist for each locale supported by Discourse. Please feel free to suggest regular expressions in a reply.

36 Likes

Thanks for the new feature!

I do have a discourse instance running for Chinese users, and I would like to test it.

But we have installed another plugin discourse-username-localization because previously unicode usernames were not supported officially.

So I would like to know how could I disable that plugin and switch to the official solution, will it break something? Any recommended steps to follow?

If this can be done, I think every CJK instance will switch to our official solution and contribute whitelist immediately :grinning:

5 Likes

It looks like the plugin also changes the behavior for linking to CJK tags and categories. This will probably break, but we should fix it in Discourse core. That should be easy to fix.

Other than that disabling the plugin and enabling the official Unicode support should work without problems. Letter avatars will look differently afterwards, because the plugin currently converts Chinese usernames into latin characters. But I guess that’s a good thing. :slight_smile:

9 Likes

Thanks.

I ‘ll create a branch with those not implemented yet left and try the official solution so they may not conflict.

The tags and categories uses the same set of regex but in javascript which doesn’t support \p{Katakana} stuff. I raised an issue to unify regex in that plugin, but the attempt failed. is it possible to use the same whitelist in the official implementation? eg a converter to convert ruby whitelist to javascript.

And the unicode avatar is just excellent!

6 Likes

I just switched my forum to unicode username.

I updated discourse-username-localization to remove all the ruby stuff. (can’t wait to see you guys fix hashtags and mentions in the core, so I can abandon it completely)

And use this whitelist:

[\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]

And update letter_avatar service to v4.

Now it works

5 Likes

I think mentions are already supported :thinking:

3 Likes

For Finnish it should be [åäöÅÄÖ].

4 Likes

Isn’t this your real Finnish name @rizka :wink:

7 Likes

Not quite, I have just one of those in my surname. :slight_smile:

Å/å is actually not a pure Finnish language letter. It never appears anywhere except the Finnish alphabet, computer keyboards and names of Swedish people and places. Ö/ö is somewhat rare. Ä/ä is by far the most common, but for a reason unknown to me, very uncommon in Finnish first names. Appears in many surnames, though, like mine. :slight_smile:

7 Likes

@marguerite You should also remove mentions.js.es6 from the plugin. There’s no need to patch anything related to usernames anymore. Only your customizations for categories and tags might still be needed, but we will fix that as well.

The +? at the end of the regex isn’t needed.
Out of curiosity: Can this whitelist be used for both zh_CN and zh_TW or is there a difference?

6 Likes

@gerhard

I removed mentions.js.es6, do I need to remove override-username-match.js.es6 as well?

\p{Han} covers traditional and simplified Chinese. My whitelist will allow CJK usernames

2 Likes

For EU langs it might be easiest to just allow all extended latin if possible, rather than hand-picking specific letters for every language. :slight_smile:

EDIT: Although reading a bit more about homograph attack, might not be the best idea after all. :blush:
Here are chars for Czech;

ěščřžýáíéóůúďťň

8 Likes

Yes, you can remove that as well. Looks like that part of the plugin is broken anyway. User cards were refactored about a year ago.

4 Likes

Katakana and Hiragana are both Japanese. Hangul is Korean.

I love this work. And I think a default setting below should work:

  • zh_CN, zh_TW: \p{Han}. This covers Chinese characters. Some communities can use more other characters. Maybe not default.
  • ko: \p{Hangul}. Korean don’t write Chinese at all. (I heard there are some Chinese characters in use in Korean?)
  • jp: [\p{Han}\p{Katakana}\p{Hiragana}] Japanese use all of them.

And maybe it’s good to mention reserved_usernames :sweat_smile:Unicode username does enable more names to be fake as admin/moderator.

9 Likes

Thanks for the regular expressions and also for the tip regarding reserved usernames. I added a note in the first post.

5 Likes

How does this option affect migrations? I am migrating from Kunena with a script based on the “official” kunena3.rb script.

I have a user called abd-def (for example). It gets imported as abcdef.

Then I turned on this option for the unicode usernames and deleted that user, and re-ran the script. It was again imported as abddef :frowning:

How can I ensure my user names with dashes don’t get changed during import?

Thanks!

The removal of the dash has nothing to do with Unicode usernames. That’s happening because the import script manipulates the username during the import.

https://github.com/discourse/discourse/blob/7f8cdea9244760b7f27bcebb86de0121006f0ce3/script/import_scripts/kunena3.rb#L89-L93

I don’t think there’s any need for that. Try replacing those lines with the following code snippet. It should work.

@users[u['id'].to_i] = { id: u['id'].to_i, username: u['username'], email: u['email'], created_at: u['registerDate'] }
7 Likes

I’m happy to see support for Unicode usernames and group names :+1:.

With the introduction of the support for Unicode usernames however, there’s now a bit of a odd situation where Discourse can support something like 中国 or ไทย as a username, but not -dashing- as it still requires the first and last character to be a letter, number or underscore (but not a dash).

I tried using the Unicode support setting to add support for the dash character but that didn’t seem to work for me, although I may have missed something.

Would it make sense to revise this rule for the first/last characters about the dash now that Unicode is supported? Is there a reason to continue not allowing dash on the first&last position but allow any non-ASCII letter (including the underscore)? Dash doesn’t seem to require special encoding on URLs, but maybe there’s another reason for this?

I know this is a bit of a tangent to the topic, so let me know if I should open a separate one.

@gerhard Can a user name be like this?

discource__
or
discource_name
?

Because I can’t seem to make it work!

TIA

See reserved_usernames site setting.

1 Like