Korean words can't be searched

it’s not about space I think
no matter I search 공 or 공을 or 공을 치다 whatsoever, search never works for Korean

Can you try changing min search term length in admin/site_settings to 2 or even 1?

1 Like

Discourse does have reasonable Chinese support. Can you first try with a Chinese post to see whether you can search for Chinese?

Possible result #1: Cannot search for Chinese – there is something wrong with the settings, because Chinese should work.

Possible result #2: Can search for Chinese – it may be the issue of not having a tokenizer for Korean.

4 Likes

WTF it works with min search term length 1 LOL
rofl thanks Sam LOL

1 Like

tested with chinese language too,
chinese even worked before I set min search term lengh to 1

I do not understand Korean, can you clarify, do we need word splitting in Korean or not?

1 Like

I think I know why it didn’t work even with min search term length 2

image

I never put empty space between syllable when search Korean words
but the search terms showed in dashboard have space after every syllables…

so what I’m saying is

what I searched is 말랑 but the search query appeared in dashboard is 말 랑
what I searched is 중국 but the search query appeared in dashboard is 중 국
what I searched is 차트 but the search query appeared in dashboard is 차 트

this is why discourse couldn’t show any search result I think.
How can I fix this lovely bug?

I hope my reply above helps a bit to sort out the issue.

one more thing I wanna say is
I think someone said earlier that Korean language has no space but we do.
We use space just like English.

I am a handsome boy.
나는 잘생긴 소년입니다.

나는 = I
잘생긴 = handsome
소년 = boy
입니다 = am (sort of)

we still use space between words.

3 Likes

So we should remove all the splitting and just rely on space? Just confirming

2 Likes

I don’t 100% understand what you mean but if you meant what I mean then yes.
WTF am I saying LOL

1 Like

it works almost perfect when I set min search term length to 1.
only 1 thing why it’s almost perfect but not perfect is

it literally catches all. I mean,
if I search for 안녕하세요
then it brings up not only 안녕하세요 but also 안 which is totally different.

it’s like when you search for ‘interesting’
search results are not only ‘interesting’ but also ‘i’ which is not relevant at all.

But still ok part is that search result showing by the most relevant order so the bad results are far away :stuck_out_tongue:

In this case, it seems that you can just split by punctuation or white space in Korean (unless the Korean texts contain Chinese characters, which is rare in contemporary writing).

Japanese, unfortunately, is purely no-space though…

3 Likes

If word spacing is used, it’s much easier to build the index since the algorithm can treat it like English. For better understanding, does Korean use some really common words to compose the sentence even it doesn’t provide an significant meaning? This is what we called stop words. We leave them out because they are extremely common and it’s not useful to index them at all.

3 Likes

@k11 try this commit out:

https://github.com/discourse/discourse/commit/c677877e4fe5381f613279901f36ae255c909573

(upgrade once the tests passed)

It removes word segmentation from Korean and changes min search length to 2 for Korean and Japanese

6 Likes

thank you very much Sam I will go try now :slight_smile: