What's the word tokenizer for different languages in discourse?

Hi, I was using discourse in Chinese. And I found that many words are not tokenized well.

In a sentence, we have many words, and we need to separate them well, to support keyword searching and other important functions.

While I have to say, I believe that the Chinese words tokenizer of the discourse works not good enough.

Is that an “Old” tokenizer? Can we replace it with new ones?

If you can read Chinese words, here are my findings:

1 Like

We use GitHub - erickguan/cppjieba_rb

Which is based on GitHub - yanyiwu/cppjieba: "结巴"中文分词的C++版本

@fantasticfears built the gem that enables Ruby to have support for this.

Are you noticing any specific issues you would like addressed?