Hi, I was using discourse in Chinese. And I found that many words are not tokenized well.
In a sentence, we have many words, and we need to separate them well, to support keyword searching and other important functions.
While I have to say, I believe that the Chinese words tokenizer of the discourse works not good enough.
Is that an “Old” tokenizer? Can we replace it with new ones?
If you can read Chinese words, here are my findings: