What's the word tokenizer for different languages in discourse?

xiasummer · May 27, 2020, 12:53am

Hi, I was using discourse in Chinese. And I found that many words are not tokenized well.

In a sentence, we have many words, and we need to separate them well, to support keyword searching and other important functions.

While I have to say, I believe that the Chinese words tokenizer of the discourse works not good enough.

Is that an “Old” tokenizer? Can we replace it with new ones?

If you can read Chinese words, here are my findings:

sam · May 27, 2020, 1:07am

@fantasticfears built the gem that enables Ruby to have support for this.

Are you noticing any specific issues you would like addressed?

Topic		Replies	Views
Chinese search doesn't work to some words Support	15	1698	October 31, 2021
Korean words can't be searched Support	36	1588	November 22, 2020
Search a term in Japanese Support	26	2169	February 2, 2022
Searching Chinese terms in middle of sentence Feature	23	3458	October 8, 2016
Thai language support for searching Bug	4	1195	August 11, 2020