Search a term in Japanese

SSS · 2020 年7 月 13 日 05:50

Thank you for your reply.

A sample paragraph here in katakana
通報テスト9,通報テスト11,通報テスト8…etc
A sample search term that you have that is not working
テスト
The “テスト” is not working.

2テスト1152×347 9.95 KB

But the “通報” or “通報テスト” seems to be working correctly.

通報1206×345 19.3 KB

1通報テスト1151×334 18.8 KB
Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.

setting11207×400 17.4 KB

setting21207×400 8.72 KB

SSS · 2020 年7 月 15 日 01:08

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

tgxworld · 2020 年8 月 24 日 09:01

I can repro this and it is mainly due to a combination of

github.com/discourse/discourse

lib/search.rb

e8a842ab8


      
          if ['zh_TW', 'zh_CN', 'ja'].include?(SiteSetting.default_locale) || SiteSetting.search_tokenize_chinese_japanese_korean
            require 'cppjieba_rb' unless defined? CppjiebaRb
            mode = (purpose == :query ? :query : :mix)
            data = CppjiebaRb.segment(search_data, mode: mode)

and

github.com/discourse/discourse

lib/search.rb

e8a842ab8


      
          min_length = @opts[:min_search_term_length] || SiteSetting.min_search_term_length
          terms = (@term || '').split(/\s(?=(?:[^"]|"[^"]*")*$)/).reject { |t| t.length < min_length }

The term テスト is converted to テスト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

sam · 2020 年9 月 28 日 07:14

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

github.com/discourse/discourse

config/site_settings.yml

580383dff


      
          min_search_term_length:
            client: true
            default: 3
            locale_default:
              zh_CN: 2
              zh_TW: 2
              ko: 2
              ja: 2

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

yashi · 2022 年2 月 2 日 10:13

我这些天不使用 Ruby，也不知道 Discourse 的要求，但似乎有一个用于“MeCab”的 gem。

我之所以来到这个话题，是因为我发现搜索某些词在我托管的公共实例上不起作用。我有

最小搜索词长度：1
搜索分词中文、日文、韩文：启用
默认语言：日语

我记得，我最初是用英文初始化网站，后来才将其更改为日语。

我发现搜索失败的词是“北側”、“真上”、“一般”。这些词在这个话题中。许多词可以搜索，但这些词不行。我看不到词语是否能搜索的任何模式。

有没有办法检查托管实例上生成的搜索索引？我能读 Ruby 和日语，所以如果有一种方法可以看到 Discourse 如何为 CJK 生成搜索索引，我或许能提供一些帮助。

@tgxworld 提到的 CppjiebaRb 或 cppjieba 似乎是用于中文的。它是否用于日语环境？

sam · 2022 年2 月 2 日 10:21

Mecab 不幸不是一个选项，它是 GPL 的，我们更倾向于在依赖项中只采用 MIT 和 BSD 许可证。

我们有一个 PR 将添加 http://chasen.org/~taku/software/TinySegmenter/，它具有兼容的许可证。你能试用一下分词功能，并告诉我们它的效果如何吗？网站上有一个表单可供测试。

yashi · 2022 年2 月 2 日 10:58

我尝试了 tiny_segmenter（来自 Rubygems），至少它生成了我上一个评论中列出的词语。

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)

bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

快速搜索 TinySegmenter 告诉我它使用的模型不是很好。有一个模型生成器。

不过我还没试过。

话题		回复	浏览量
Korean words can't be searched Support	36	1652	2020 年11 月 22 日
Chinese search doesn't work to some words Support	15	1715	2021 年10 月 31 日
Thai language support for searching Bug	4	1209	2020 年8 月 11 日
What's the word tokenizer for different languages in discourse? Support	1	620	2020 年5 月 27 日
Optimizing Discourse search for CJK languages Site Management how-to , localization	3	3242	2017 年3 月 13 日

Search a term in Japanese

相关话题