Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.
An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?
The term テスト is converted to テ ス ト after going through CppjiebaRb and this trips the min_search_length protector we have.
@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.
I came to this topic because I found that searching some words doesn’t work on my hosted public instance. I have
min search term length: 1
search tokenize chinese japanese korean: enabled
default locale: Japanese
IIRC, I’ve initialized the site with English locale and changed the setting to Japanese later.
The words I found failed to search are “北側”, “真上”, “一般”. These words are in this topic. Many words work but these don’t. I don’t see any pattern whether a word works or not.
Is there a way to check the generated search index on the hosted instance? I can read both Ruby and Japanese so if there is a way to see how Discourse generate search index for CJK, I might be some help.
Mecab is sadly not an option, it is GPL and we prefer only to take on MIT and BSD licenses in dependencies
We have a PR that will add TinySegmenter: Javascriptだけで実装されたコンパクトな分かち書きソフトウェア which has a compatible license. Can you try out the segmenting and let us know how well it works, there is a form on the website you can use to test