Do not strip diacritics for search in Vietnamese


(Phạm Quyết Nghị) #1

The results shown in the search do not display the Vietnamese sign

My post example has the title and the content as follows

title

“không hiển thị dấu Tiếng Việt trong kết quả tìm kiếm”

content

Khi tôi đăng câu hỏi lên diễn đàn với ngôn ngữ Tiếng Việt lúc tôi tìm kiếm thì không hiển thị đầy đủ ngôn ngữ

Enter search keywords “không hiển thị dấu Tiếng Việt trong kết quả tìm kiếm

the result will be

Khi toi dang cau hoi len dien dan voi ngon ngu Tieng Viet luc toi tim kiem thi khong hien thi đay đu ngon ngu

how to want to display the correct accented text?

Thanks.


(Sam Saffron) #2

Yes this is pretty tricky, I can see how it is a problem for Vietnamese communities, the excerpts must look very confusing.

In Vietnamese is there ever a reason to type without diacritics? eg type khong and mean không? I imagine it is a super duper hard no cause the language is tone based, so the is the equivalent of me typing “dog” and meaning “milk”.

I think the best way forward here is to make diacritic stripping optional and turn this off in Vietnamese communities.

A bigger and more complex change is to amend it so excerpts are generated based off cooked and not the “normalized” cooked text.

FYI English speakers:

dấu: A sign
dàu: Head
dãu: Pudding

So yeah this is a pretty giant issue for Vietnamese.


Discourse should ignore if a character is accented when doing a search
(Phạm Quyết Nghị) #4

I see a page displaying bookmarks in search results
Example, this page See details

this configuration where? Thanks


(Sam Saffron) #5

Not yet, I will add an option today, hold tight.

The reason it looks “good” on old search results is cause they have not been indexed yet using the new algorithm. If you edit any of the posts there the bug will pop up.


(Sam Saffron) #6

This is fixed per:

After you rebuild with this commit, make sure you run:

./launcher enter app
rake search:reindex

You can try rebuilding in 10-20 minutes when the commit hits tests passed.


(Sam Saffron) #7

This topic was automatically closed after 30 hours. New replies are no longer allowed.