Chinese search excerpts appear broken

When I try to search Chinese in my forum, I can see the search result returned broken sentences, the punctuation is missing, and there are unexpected whitespaces between words, and some words are missing.

E.g. I tried to search 管理员, the original sentence is

管理人员可见的分类。只有管理员和版主才能阅览主题

But what I saw in the search result looks like below.

As you can see, 可见的 is missing, full stop mark is also missing, which breaks the sentence. Then 只有 and , 才能 are also missing. And there are unexpected whitespace in between.

Can someone help me with this issue? Thanks

2 Likes

It seems like those missing characters are considered stop words in the Chinese language

(byebug) data = CppjiebaRb.segment(search_data, mode: mode)
["管理人员", "可见", "的", "分类", "。", "只有", "管理员", "和", "版主", "才能", "阅览", "主题"]
(byebug) CppjiebaRb.filter_stop_word(data)
["管理人员", "分类", "管理员", "版主", "阅览", "主题"]
3 Likes

Wait so the bug here is that the “summary” in the result looks strange? Not that there is an actual functional problem with search?

Yup search is still working just that the excerpt being shown is not ideal. For the Chinese language, search is kind of handled differently. Instead of ignoring stop words during the search query itself, we exclude it completely from the search data.

3 Likes

Thanks for looking into this.

This is not a stop word in Chinese, this is an adjective word, means visible.

Is it possible to fix this issue? (Including everything in search results) Or is there any walk-around for this?

Thanks.

1 Like

Stop words means words that are very popular and damage search performance

“And” for example, in English, is a stop word

It is possible to fix this, will take a few months to get to this, in the mean time if you need to rush a fix there is marketplace

Marking as pr-welcome

2 Likes

The stop words are determined using GitHub - yanyiwu/cppjieba: "结巴"中文分词的C++版本. With the recent changes to how search excerpts are displayed, we should just remove the following line since it messes with the actual search data.

Either way, our search support for Chinese is not great but there are PG extensions that we may want to consider so that we can properly support languages that do not have native support. Perhaps https://pgroonga.github.io/?

5 Likes

Thanks! Let me try this and see how it goes.

@tgxworld Not sure I understand it correctly, pgroonga doesn’t support Chinese and Japanese. In https://pgroonga.github.io/,

PostgreSQL supports full text search against languages that use only alphabet and digit. It means that PostgreSQL doesn’t support full text search against Japanese, Chinese and so on. You can use super fast full text search feature against all languages by installing PGroonga into your PostgreSQL!

@tgxworld I created a PR per your suggestion https://github.com/discourse/discourse/pull/11530

The sense is the exact opposite of that. Normal PostgreSQL does not support Chinese and Japanese. PGroonga adds support for those languages.

3 Likes

BTW @riking Just to confirm, currently discourse implement the full-text search using PostgreSQL built-in functions as in https://github.com/discourse/discourse/blob/1cf057fb1c4e168ce441ddde918636725abeb668/lib/search.rb#L911

Is that correct?

1 Like

Sorry for not being more specific early. If we really want to fix this for now, we need to ensure that we don’t remove stop words for Chinese in the search data while still ensuring that stop words are still removed when used as the search query.

1 Like

@tgxworld I don’t understand the difference between search data vs search query. Can you please provide more details? Thanks

If we add stop words to the index, it bloats the index, and makes search performance bad

https://github.com/discourse/discourse/blob/d2a04621862aa7f7fc283112d542648e9f3fcab8/app/models/post_search_data.rb#L12-L13

There are two columns which we store in the PostSearchData table, #search_data is used when querying against search terms. #raw_data is what we used when displaying the search excerpt. The fix here should be that Chinese stop words should not be removed from the #raw_data while still removed from #search_data.

Any progress on this bug?

I thought I made some changes here:

Is your locale set to zh_TW , zh_CN or ja? if not is search_tokenize_chinese_japanese_korean set to true?

We have a bypass here:

2 Likes