Search a term in Japanese

SSS · July 13, 2020, 5:50am

Thank you for your reply.

A sample paragraph here in katakana
通報テスト9,通報テスト11,通報テスト8…etc
A sample search term that you have that is not working
テスト
The “テスト” is not working.

2テスト1152×347 9.95 KB

But the “通報” or “通報テスト” seems to be working correctly.

通報1206×345 19.3 KB

1通報テスト1151×334 18.8 KB
Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.

setting11207×400 17.4 KB

setting21207×400 8.72 KB

SSS · July 15, 2020, 1:08am

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

tgxworld · August 24, 2020, 9:01am

I can repro this and it is mainly due to a combination of

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L66-L69

and

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L242-L243

The term テスト is converted to テスト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

sam · September 28, 2020, 7:14am

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

https://github.com/discourse/discourse/blob/580383dff342a9a12f2270a8224b91c12f0e6ca7/config/site_settings.yml#L1837-L1844

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

yashi · February 2, 2022, 10:13am

I don’t use Ruby these days nor don’t know the requirement from Discourse but there seems to be a gem for “mecab”.
https://rubygems.org/gems/mecab/

I came to this topic because I found that searching some words doesn’t work on my hosted public instance. I have

min search term length: 1
search tokenize chinese japanese korean: enabled
default locale: Japanese

IIRC, I’ve initialized the site with English locale and changed the setting to Japanese later.

The words I found failed to search are “北側”, “真上”, “一般”. These words are in this topic. Many words work but these don’t. I don’t see any pattern whether a word works or not.

Is there a way to check the generated search index on the hosted instance? I can read both Ruby and Japanese so if there is a way to see how Discourse generate search index for CJK, I might be some help.

CppjiebaRb, or cppjieba, mentioned by @tgxworld seems to be for Chinese. Is it used for Japanese locale?

sam · February 2, 2022, 10:21am

Mecab is sadly not an option, it is GPL and we prefer only to take on MIT and BSD licenses in dependencies

We have a PR that will add TinySegmenter: Javascriptだけで実装されたコンパクトな分かち書きソフトウェア which has a compatible license. Can you try out the segmenting and let us know how well it works, there is a form on the website you can use to test

yashi · February 2, 2022, 10:58am

I’ve tried tiny_segmenter from Rubygems and at least it does generate the words I’ve listed in the previous comment.

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)

bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

A quick search about TinySegmenter told me that the model it uses isn’t as good. There is model generator for it.

I haven’t tried it though.

Topic		Replies	Views
Korean words can't be searched Support	36	1576	November 22, 2020
Chinese search doesn't work to some words Support	15	1693	October 31, 2021
Thai language support for searching Bug	4	1194	August 11, 2020
What's the word tokenizer for different languages in discourse? Support	1	588	May 27, 2020
Optimizing Discourse search for CJK languages Site Management how-to , localization	3	3149	March 13, 2017

Search a term in Japanese

Related topics