Search a term in Japanese

SSS · 2020 年 7 月 13 日午前 5:50

Thank you for your reply.

A sample paragraph here in katakana
通報テスト9,通報テスト11,通報テスト8…etc
A sample search term that you have that is not working
テスト
The “テスト” is not working.

2テスト1152×347 9.95 KB

But the “通報” or “通報テスト” seems to be working correctly.

通報1206×345 19.3 KB

1通報テスト1151×334 18.8 KB
Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.

setting11207×400 17.4 KB

setting21207×400 8.72 KB

SSS · 2020 年 7 月 15 日午前 1:08

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

tgxworld · 2020 年 8 月 24 日午前 9:01

I can repro this and it is mainly due to a combination of

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L66-L69

and

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L242-L243

The term テスト is converted to テスト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

sam · 2020 年 9 月 28 日午前 7:14

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

https://github.com/discourse/discourse/blob/580383dff342a9a12f2270a8224b91c12f0e6ca7/config/site_settings.yml#L1837-L1844

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

yashi · 2022 年 2 月 2 日午前 10:13

最近はRubyを使っていませんし、Discourseの要件も分かりませんが、「MeCab」用のgemがあるようです。

私のホストされている公開インスタンスでいくつかの単語の検索が機能しないことに気づき、このトピックに来ました。

最小検索語長: 1
中国語、日本語、韓国語の検索トークン化: 有効
デフォルトロケール: 日本語

記憶が正しければ、サイトは英語ロケールで初期化し、後で日本語設定に変更しました。

検索に失敗した単語は「北側」、「真上」、「一般」です。これらの単語はこのトピックにあります。多くの単語は機能しますが、これらは機能しません。単語が機能するかどうかのパターンは分かりません。

ホストされているインスタンスで生成された検索インデックスを確認する方法はありますか？Rubyと日本語の両方が読めるので、DiscourseがCJKの検索インデックスをどのように生成するかを見る方法があれば、何かお手伝いできるかもしれません。

@tgxworldが言及したCppjiebaRbまたはcppjiebaは中国語用のように思われます。日本語ロケールでも使用されますか？

sam · 2022 年 2 月 2 日午前 10:21

Mecab is sadly not an option, it is GPL and we prefer only to take on MIT and BSD licenses in dependencies

We have a PR that will add TinySegmenter: Javascriptだけで実装されたコンパクトな分かち書きソフトウェア which has a compatible license. Can you try out the segmenting and let us know how well it works, there is a form on the website you can use to test

yashi · 2022 年 2 月 2 日午前 10:58

tiny_segmenter を試しましたが、少なくとも前のコメントにリストした単語を生成してくれます。

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)

bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

TinySegmenter について軽く検索したところ、使用しているモデルがあまり良くないことがわかりました。モデルジェネレーターがあります。

まだ試していませんが。

トピック		返信	表示
Korean words can't be searched Support	36	1617	2020 年 11 月 22 日
Chinese search doesn't work to some words Support	15	1705	2021 年 10 月 31 日
Thai language support for searching Bug	4	1202	2020 年 8 月 11 日
What's the word tokenizer for different languages in discourse? Support	1	595	2020 年 5 月 27 日
Optimizing Discourse search for CJK languages Site Management how-to , localization	3	3203	2017 年 3 月 13 日

Search a term in Japanese

関連トピック