Search a term in Japanese

SSS · Julho 13, 2020, 5:50am

Thank you for your reply.

A sample paragraph here in katakana
通報テスト9,通報テスト11,通報テスト8…etc
A sample search term that you have that is not working
テスト
The “テスト” is not working.

2テスト1152×347 9.95 KB

But the “通報” or “通報テスト” seems to be working correctly.

通報1206×345 19.3 KB

1通報テスト1151×334 18.8 KB
Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.

setting11207×400 17.4 KB

setting21207×400 8.72 KB

SSS · Julho 15, 2020, 1:08am

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

tgxworld · Agosto 24, 2020, 9:01am

I can repro this and it is mainly due to a combination of

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L66-L69

and

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L242-L243

The term テスト is converted to テスト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

sam · Setembro 28, 2020, 7:14am

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

https://github.com/discourse/discourse/blob/580383dff342a9a12f2270a8224b91c12f0e6ca7/config/site_settings.yml#L1837-L1844

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

yashi · Fevereiro 2, 2022, 10:13am

Não uso Ruby atualmente nem sei qual é o requisito do Discourse, mas parece haver uma gem para “mecab”.

Cheguei a este tópico porque descobri que a busca por algumas palavras não funciona em minha instância pública hospedada. Eu tenho

termo de busca mínimo: 1
busca tokenizada chinês, japonês, coreano: ativado
localidade padrão: japonês

Se bem me lembro, inicializei o site com a localidade em inglês e depois mudei a configuração para japonês.

As palavras que descobri que falharam na busca são “北側”, “真上”, “一般”. Essas palavras estão em este tópico. Muitas palavras funcionam, mas essas não. Não vejo nenhum padrão se uma palavra funciona ou não.

Existe alguma maneira de verificar o índice de busca gerado na instância hospedada? Posso ler Ruby e japonês, então se houver uma maneira de ver como o Discourse gera o índice de busca para CJK, talvez eu possa ajudar.

CppjiebaRb, ou cppjieba, mencionado por @tgxworld parece ser para chinês. É usado para a localidade japonesa?

sam · Fevereiro 2, 2022, 10:21am

O Mecab não é uma opção, infelizmente, ele é GPL e preferimos apenas adotar licenças MIT e BSD em dependências.

Temos um PR que adicionará TinySegmenter: Javascriptだけで実装されたコンパクトな分かち書きソフトウェア, que tem uma licença compatível. Você pode testar a segmentação e nos informar como funciona? Há um formulário no site que você pode usar para testar.

yashi · Fevereiro 2, 2022, 10:58am

Eu tentei o tiny_segmenter do Rubygems e, pelo menos, ele gera as palavras que listei no comentário anterior.

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)

bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

Uma rápida pesquisa sobre o TinySegmenter me disse que o modelo que ele usa não é tão bom. Existe um gerador de modelo para ele.

Ainda não tentei.

Tópico		Respostas	Visualizações
Korean words can't be searched Support	36	1617	22 de Novembro de 2020
Chinese search doesn't work to some words Support	15	1705	31 de Outubro de 2021
Thai language support for searching Bug	4	1202	11 de Agosto de 2020
What's the word tokenizer for different languages in discourse? Support	1	595	27 de Maio de 2020
Optimizing Discourse search for CJK languages Site Management how-to , localization	3	3203	13 de Março de 2017

Search a term in Japanese

Tópicos relacionados