When I try to search Chinese in my forum, I can see the search result returned broken sentences, the punctuation is missing, and there are unexpected whitespaces between words, and some words are missing.
E.g. I tried to search 管理员, the original sentence is
管理人员可见的分类。只有管理员和版主才能阅览主题
But what I saw in the search result looks like below.
As you can see, 可见的 is missing, full stop mark 。 is also missing, which breaks the sentence. Then 只有 and 和, 才能 are also missing. And there are unexpected whitespace in between.
Yup search is still working just that the excerpt being shown is not ideal. For the Chinese language, search is kind of handled differently. Instead of ignoring stop words during the search query itself, we exclude it completely from the search data.
The stop words are determined using GitHub - yanyiwu/cppjieba: "结巴"中文分词的C++版本. With the recent changes to how search excerpts are displayed, we should just remove the following line since it messes with the actual search data.
Either way, our search support for Chinese is not great but there are PG extensions that we may want to consider so that we can properly support languages that do not have native support. Perhaps https://pgroonga.github.io/?
PostgreSQL supports full text search against languages that use only alphabet and digit. It means that PostgreSQL doesn’t support full text search against Japanese, Chinese and so on. You can use super fast full text search feature against all languages by installing PGroonga into your PostgreSQL!
Sorry for not being more specific early. If we really want to fix this for now, we need to ensure that we don’t remove stop words for Chinese in the search data while still ensuring that stop words are still removed when used as the search query.
There are two columns which we store in the PostSearchData table, #search_data is used when querying against search terms. #raw_data is what we used when displaying the search excerpt. The fix here should be that Chinese stop words should not be removed from the #raw_data while still removed from #search_data.