Do not strip diacritics for search in Vietnamese

The results shown in the search do not display the Vietnamese sign

My post example has the title and the content as follows

title

“không hiển thị dấu Tiếng Việt trong kết quả tìm kiếm”

content

Khi tôi đăng câu hỏi lên diễn đàn với ngôn ngữ Tiếng Việt lúc tôi tìm kiếm thì không hiển thị đầy đủ ngôn ngữ

Enter search keywords “không hiển thị dấu Tiếng Việt trong kết quả tìm kiếm

the result will be

Khi toi dang cau hoi len dien dan voi ngon ngu Tieng Viet luc toi tim kiem thi khong hien thi đay đu ngon ngu

how to want to display the correct accented text?

Thanks.

3 Likes

Yes this is pretty tricky, I can see how it is a problem for Vietnamese communities, the excerpts must look very confusing.

In Vietnamese is there ever a reason to type without diacritics? eg type khong and mean không? I imagine it is a super duper hard no cause the language is tone based, so the is the equivalent of me typing “dog” and meaning “milk”.

I think the best way forward here is to make diacritic stripping optional and turn this off in Vietnamese communities.

A bigger and more complex change is to amend it so excerpts are generated based off cooked and not the “normalized” cooked text.

FYI English speakers:

dấu: A sign
dàu: Head
dãu: Pudding

So yeah this is a pretty giant issue for Vietnamese.

5 Likes

I see a page displaying bookmarks in search results
Example, this page See details

this configuration where? Thanks

Not yet, I will add an option today, hold tight.

The reason it looks “good” on old search results is cause they have not been indexed yet using the new algorithm. If you edit any of the posts there the bug will pop up.

1 Like

This is fixed per:

https://github.com/discourse/discourse/commit/9b7cab589ac15a034d1c0e700230c1b3f63f8ba0

After you rebuild with this commit, make sure you run:

./launcher enter app
rake search:reindex

You can try rebuilding in 10-20 minutes when the commit hits tests passed.

6 Likes

This topic was automatically closed after 30 hours. New replies are no longer allowed.