Korean words cannot be searched

I don’t think it’s about spaces.
No matter what I search for—공, 공을, or 공을 치다—the search never works for Korean.

Can you try changing min search term length in admin/site_settings to 2 or even 1?

1 лайк

Discourse does have reasonable Chinese support. Can you first try with a Chinese post to see whether you can search for Chinese?

Possible result #1: Cannot search for Chinese – there is something wrong with the settings, because Chinese should work.

Possible result #2: Can search for Chinese – it may be the issue of not having a tokenizer for Korean.

4 лайка

WTF it works with min search term length 1 LOL
rofl thanks Sam LOL

1 лайк

tested with chinese language too,
chinese even worked before I set min search term lengh to 1

I do not understand Korean. Can you clarify: do we need word splitting in Korean or not?

1 лайк

I think I know why it didn’t work even with min search term length 2

image

I never put empty space between syllable when search Korean words
but the search terms showed in dashboard have space after every syllables…

so what I’m saying is

what I searched is 말랑 but the search query appeared in dashboard is 말 랑
what I searched is 중국 but the search query appeared in dashboard is 중 국
what I searched is 차트 but the search query appeared in dashboard is 차 트

this is why discourse couldn’t show any search result I think.
How can I fix this lovely bug?

I hope my reply above helps a bit to sort out the issue.

One more thing I want to say is:
I think someone said earlier that the Korean language has no spaces, but we do.
We use spaces just like English.

I am a handsome boy.
나는 잘생긴 소년입니다.

나는 = I
잘생긴 = handsome
소년 = boy
입니다 = am (sort of)

We still use spaces between words.

3 лайка

So we should remove all the splitting and just rely on space? Just confirming

2 лайка

I don’t 100% understand what you mean but if you meant what I mean then yes.
WTF am I saying LOL

1 лайк

it works almost perfect when I set min search term length to 1.
only 1 thing why it’s almost perfect but not perfect is

it literally catches all. I mean,
if I search for 안녕하세요
then it brings up not only 안녕하세요 but also 안 which is totally different.

it’s like when you search for ‘interesting’
search results are not only ‘interesting’ but also ‘i’ which is not relevant at all.

But still ok part is that search result showing by the most relevant order so the bad results are far away :stuck_out_tongue:

In this case, it seems that you can just split by punctuation or white space in Korean (unless the Korean texts contain Chinese characters, which is rare in contemporary writing).

Japanese, unfortunately, is purely no-space though…

3 лайка

If word spacing is used, it is much easier to build the index since the algorithm can treat it like English. For better understanding, does Korean use some really common words to compose sentences even if they do not provide significant meaning? These are what we call stop words. We exclude them because they are extremely common and it is not useful to index them at all.

3 лайка

@k11 try this commit out:

(upgrade once the tests passed)

It removes word segmentation from Korean and changes min search length to 2 for Korean and Japanese

6 лайков

thank you very much Sam I will go try now :slight_smile: