Search a term in Japanese

SSS · 13 Luglio 2020, 5:50am

Thank you for your reply.

A sample paragraph here in katakana
通報テスト9,通報テスト11,通報テスト8…etc
A sample search term that you have that is not working
テスト
The “テスト” is not working.

2テスト1152×347 9.95 KB

But the “通報” or “通報テスト” seems to be working correctly.

通報1206×345 19.3 KB

1通報テスト1151×334 18.8 KB
Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.

setting11207×400 17.4 KB

setting21207×400 8.72 KB

SSS · 15 Luglio 2020, 1:08am

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

tgxworld · 24 Agosto 2020, 9:01am

I can repro this and it is mainly due to a combination of

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L66-L69

and

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L242-L243

The term テスト is converted to テスト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

sam · 28 Settembre 2020, 7:14am

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

https://github.com/discourse/discourse/blob/580383dff342a9a12f2270a8224b91c12f0e6ca7/config/site_settings.yml#L1837-L1844

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

yashi · 2 Febbraio 2022, 10:13am

Non uso Ruby di questi tempi né conosco i requisiti di Discourse, ma sembra esserci una gemma per “MeCab”.

Sono arrivato a questo argomento perché ho scoperto che la ricerca di alcune parole non funziona sulla mia istanza pubblica ospitata. Ho

lunghezza minima del termine di ricerca: 1
ricerca tokenizzata cinese giapponese coreano: abilitata
locale predefinito: giapponese

Se non ricordo male, ho inizializzato il sito con la locale inglese e poi ho cambiato l’impostazione in giapponese.

Le parole che ho trovato non ricercabili sono “北側”, “真上”, “一般”. Queste parole sono in questo argomento. Molte parole funzionano, ma queste no. Non vedo uno schema sul fatto che una parola funzioni o meno.

C’è un modo per controllare l’indice di ricerca generato sull’istanza ospitata? Posso leggere sia Ruby che giapponese, quindi se c’è un modo per vedere come Discourse genera l’indice di ricerca per CJK, potrei essere d’aiuto.

CppjiebaRb, o cppjieba, menzionato da @tgxworld sembra essere per il cinese. Viene utilizzato per la locale giapponese?

sam · 2 Febbraio 2022, 10:21am

Mecab non è un’opzione, purtroppo, è GPL e preferiamo accettare solo licenze MIT e BSD nelle dipendenze.

Abbiamo una PR che aggiungerà TinySegmenter: Javascriptだけで実装されたコンパクトな分かち書きソフトウェア che ha una licenza compatibile. Potresti provare la segmentazione e farci sapere come funziona? C’è un modulo sul sito web che puoi usare per testare.

yashi · 2 Febbraio 2022, 10:58am

Ho provato tiny_segmenter da Rubygems e almeno genera le parole che ho elencato nel commento precedente.

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)

bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

Una rapida ricerca su TinySegmenter mi ha detto che il modello che utilizza non è così buono. Esiste un generatore di modelli per esso.

Non l’ho ancora provato.

Argomento		Risposte	Visualizzazioni
Korean words can't be searched Support	36	1617	Novembre 22, 2020
Chinese search doesn't work to some words Support	15	1705	Ottobre 31, 2021
Thai language support for searching Bug	4	1202	Agosto 11, 2020
What's the word tokenizer for different languages in discourse? Support	1	595	Maggio 27, 2020
Optimizing Discourse search for CJK languages Site Management how-to , localization	3	3203	Marzo 13, 2017

Search a term in Japanese

Argomenti correlati