Search a term in Japanese

Thank you for your reply.

  1. A sample paragraph here in katakana
    通報テスト9,通報テスト11,通報テスト8…etc

  2. A sample search term that you have that is not working
    テスト
    The “テスト” is not working.


    But the “通報” or “通報テスト” seems to be working correctly.

  3. Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
    Yes, I have confirmed that both settings are set correctly.


1 个赞

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

1 个赞

I can repro this and it is mainly due to a combination of

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L66-L69

and

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L242-L243

The term テスト is converted to テ ス ト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

3 个赞

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

https://github.com/discourse/discourse/blob/580383dff342a9a12f2270a8224b91c12f0e6ca7/config/site_settings.yml#L1837-L1844

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

2 个赞

我这些天不使用 Ruby,也不知道 Discourse 的要求,但似乎有一个用于“MeCab”的 gem。

我之所以来到这个话题,是因为我发现搜索某些词在我托管的公共实例上不起作用。我有

  • 最小搜索词长度:1
  • 搜索分词中文、日文、韩文:启用
  • 默认语言:日语

我记得,我最初是用英文初始化网站,后来才将其更改为日语。

我发现搜索失败的词是“北側”、“真上”、“一般”。这些词在这个话题中。许多词可以搜索,但这些词不行。我看不到词语是否能搜索的任何模式。

有没有办法检查托管实例上生成的搜索索引?我能读 Ruby 和日语,所以如果有一种方法可以看到 Discourse 如何为 CJK 生成搜索索引,我或许能提供一些帮助。

@tgxworld 提到的 CppjiebaRbcppjieba 似乎是用于中文的。它是否用于日语环境?

2 个赞

Mecab 不幸不是一个选项,它是 GPL 的,我们更倾向于在依赖项中只采用 MIT 和 BSD 许可证。

我们有一个 PR 将添加 http://chasen.org/~taku/software/TinySegmenter/,它具有兼容的许可证。你能试用一下分词功能,并告诉我们它的效果如何吗?网站上有一个表单可供测试。

2 个赞

我尝试了 tiny_segmenter(来自 Rubygems),至少它生成了我上一个评论中列出的词语。

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)
bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

快速搜索 TinySegmenter 告诉我它使用的模型不是很好。有一个模型生成器。

不过我还没试过。

3 个赞