Search a term in Japanese

SSS · 13 يوليو 2020، 5:50ص

Thank you for your reply.

A sample paragraph here in katakana
通報テスト9,通報テスト11,通報テスト8…etc
A sample search term that you have that is not working
テスト
The “テスト” is not working.

2テスト1152×347 9.95 KB

But the “通報” or “通報テスト” seems to be working correctly.

通報1206×345 19.3 KB

1通報テスト1151×334 18.8 KB
Confirmation that your site locale is in Japanese or that search tokenize chinese japanese korean is enabled
Yes, I have confirmed that both settings are set correctly.

setting11207×400 17.4 KB

setting21207×400 8.72 KB

SSS · 15 يوليو 2020، 1:08ص

An incredible thing happened. After changing the ‘min search term length’ from the default value of 2 to 1, we are now able to search for katakana. I don’t know why, but is this setting relevant?

tgxworld · 24 أغسطس 2020، 9:01ص

I can repro this and it is mainly due to a combination of

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L66-L69

and

https://github.com/discourse/discourse/blob/e8a842ab8cbbabe92fe33cfc4bbe5f839d4543e9/lib/search.rb#L242-L243

The term テスト is converted to テスト after going through CppjiebaRb and this trips the min_search_length protector we have.

@sam This is tricky to fix because we need a proper tokenizer for Japanese to resolve search issues like this for good. We can do tweaks here and there but it is going to be a game of wack a mole.

sam · 28 سبتمبر 2020، 7:14ص

I don’t think there exists a proper Japanese segmentar we can use.

I think the best thing to do here is simply tone down these defaults to 1.

https://github.com/discourse/discourse/blob/580383dff342a9a12f2270a8224b91c12f0e6ca7/config/site_settings.yml#L1837-L1844

Otherwise we are banning people from searching for house in Japanese which seems reasonable (家) … we allow people to search for house in English.

yashi · 2 فبراير 2022، 10:13ص

لا أستخدم Ruby هذه الأيام ولا أعرف متطلبات Discourse ولكن يبدو أن هناك جوهرة لـ “MeCab”.

وصلت إلى هذا الموضوع لأنني وجدت أن البحث عن بعض الكلمات لا يعمل على نسختي العامة المستضافة. لدي

الحد الأدنى لطول مصطلح البحث: 1
البحث عن الكلمات الصينية واليابانية والكورية: ممكّن
الإعداد الافتراضي للغة: اليابانية

إذا كنت أتذكر جيدًا، فقد قمت بتهيئة الموقع باللغة الإنجليزية ثم غيرت الإعداد إلى اليابانية لاحقًا.

الكلمات التي وجدت أنها فشلت في البحث عنها هي “北側” و “真上” و “一般”. هذه الكلمات موجودة في هذا الموضوع. العديد من الكلمات تعمل ولكن هذه لا تعمل. لا أرى أي نمط سواء كانت الكلمة تعمل أم لا.

هل هناك طريقة للتحقق من فهرس البحث الذي تم إنشاؤه على النسخة المستضافة؟ يمكنني قراءة كل من Ruby واليابانية لذلك إذا كانت هناك طريقة لمعرفة كيفية إنشاء Discourse لفهرس البحث لـ CJK، فقد أكون بعض المساعدة.

يبدو أن CppjiebaRb أو cppjieba، الذي ذكره @tgxworld، مخصص للصينية. هل يتم استخدامه للغة اليابانية؟

sam · 2 فبراير 2022، 10:21ص

Mecab ليس خيارًا للأسف، فهو GPL ونحن نفضل فقط اعتماد تراخيص MIT و BSD في التبعيات

لدينا طلب سحب (PR) سيضيف TinySegmenter: Javascriptだけで実装されたコンパクトな分かち書きソフトウェア والذي لديه ترخيص متوافق. هل يمكنك تجربة التجزئة وإخبارنا بمدى نجاحها، يوجد نموذج على الموقع يمكنك استخدامه للاختبار

yashi · 2 فبراير 2022، 10:58ص

لقد جربت tiny_segmenter من Rubygems وعلى الأقل فإنه يولد الكلمات التي أدرجتها في التعليق السابق.

# coding: utf-8
require 'tiny_segmenter'
require 'pp'

s = File.read('topic27.txt')

ts = TinySegmenter.new
sg = ts.segment(s, ignore_punctuation: true)
pp(sg)

bundle exec ruby test.rb | grep -e 北側 -e 真上 -e 一般
 "北側",
 "真上",
 "一般",
 "一般",
 "一般",
 "北側",
 "一般",

أخبرني بحث سريع حول TinySegmenter أن النموذج الذي يستخدمه ليس جيدًا. يوجد مولد نماذج له.

لم أجربه بعد.

الموضوع		الردود	مرات العرض
Korean words can't be searched Support	36	1617	22 نوفمبر 2020
Chinese search doesn't work to some words Support	15	1705	31 أكتوبر 2021
Thai language support for searching Bug	4	1202	11 أغسطس 2020
What's the word tokenizer for different languages in discourse? Support	1	595	27 مايو 2020
Optimizing Discourse search for CJK languages Site Management how-to , localization	3	3203	13 مارس 2017

Search a term in Japanese

الموضوعات ذات الصلة