Chinese search excerpts appear broken

haroldfy · November 11, 2020, 9:39pm

When I try to search Chinese in my forum, I can see the search result returned broken sentences, the punctuation is missing, and there are unexpected whitespaces between words, and some words are missing.

E.g. I tried to search 管理员, the original sentence is

管理人员可见的分类。只有管理员和版主才能阅览主题

But what I saw in the search result looks like below.

As you can see, 可见的 is missing, full stop mark 。 is also missing, which breaks the sentence. Then 只有 and 和, 才能 are also missing. And there are unexpected whitespace in between.

Can someone help me with this issue? Thanks

tgxworld · December 3, 2020, 2:12am

It seems like those missing characters are considered stop words in the Chinese language

(byebug) data = CppjiebaRb.segment(search_data, mode: mode)
["管理人员", "可见", "的", "分类", "。", "只有", "管理员", "和", "版主", "才能", "阅览", "主题"]
(byebug) CppjiebaRb.filter_stop_word(data)
["管理人员", "分类", "管理员", "版主", "阅览", "主题"]

sam · December 3, 2020, 5:07am

Wait so the bug here is that the “summary” in the result looks strange? Not that there is an actual functional problem with search?

tgxworld · December 4, 2020, 12:51am

Yup search is still working just that the excerpt being shown is not ideal. For the Chinese language, search is kind of handled differently. Instead of ignoring stop words during the search query itself, we exclude it completely from the search data.

haroldfy · December 6, 2020, 6:58pm

Thanks for looking into this.

This is not a stop word in Chinese, this is an adjective word, means visible.

Is it possible to fix this issue? (Including everything in search results) Or is there any walk-around for this?

Thanks.

sam · December 6, 2020, 8:16pm

Stop words means words that are very popular and damage search performance

“And” for example, in English, is a stop word

It is possible to fix this, will take a few months to get to this, in the mean time if you need to rush a fix there is marketplace

Marking as pr-welcome

tgxworld · December 8, 2020, 1:22am

The stop words are determined using GitHub - yanyiwu/cppjieba: "结巴"中文分词的C++版本. With the recent changes to how search excerpts are displayed, we should just remove the following line since it messes with the actual search data.

github.com

discourse/discourse/blob/1cf057fb1c4e168ce441ddde918636725abeb668/lib/search.rb#L75


      
          # For Japanese we should investigate using kakasi
          if ['zh_TW', 'zh_CN', 'ja'].include?(SiteSetting.default_locale) || SiteSetting.search_tokenize_chinese_japanese_korean
            require 'cppjieba_rb' unless defined? CppjiebaRb
            mode = (purpose == :query ? :query : :mix)
            data = CppjiebaRb.segment(search_data, mode: mode)
          
            # TODO: we still want to tokenize here but the current stopword list is too wide
            # in cppjieba leading to words such as volume to be skipped. PG already has an English
            # stopword list so use that vs relying on cppjieba
            if ts_config != 'english'
              data = CppjiebaRb.filter_stop_word(data)
            else
              data = data.filter { |s| s.present? }
            end
          
            data = data.join(' ')
          
          else
            data.squish!
          end

Either way, our search support for Chinese is not great but there are PG extensions that we may want to consider so that we can properly support languages that do not have native support. Perhaps https://pgroonga.github.io/?

haroldfy · December 9, 2020, 1:21am

Thanks! Let me try this and see how it goes.

yiksanchan · December 18, 2020, 7:08am

@tgxworld Not sure I understand it correctly, pgroonga doesn’t support Chinese and Japanese. In https://pgroonga.github.io/,

PostgreSQL supports full text search against languages that use only alphabet and digit. It means that PostgreSQL doesn’t support full text search against Japanese, Chinese and so on. You can use super fast full text search feature against all languages by installing PGroonga into your PostgreSQL!

yiksanchan · December 18, 2020, 7:21am

@tgxworld I created a PR per your suggestion https://github.com/discourse/discourse/pull/11530

riking · December 18, 2020, 8:06pm

The sense is the exact opposite of that. Normal PostgreSQL does not support Chinese and Japanese. PGroonga adds support for those languages.

yiksanchan · December 18, 2020, 8:14pm

BTW @riking Just to confirm, currently discourse implement the full-text search using PostgreSQL built-in functions as in https://github.com/discourse/discourse/blob/1cf057fb1c4e168ce441ddde918636725abeb668/lib/search.rb#L911

Is that correct?

tgxworld · December 21, 2020, 1:25am

Sorry for not being more specific early. If we really want to fix this for now, we need to ensure that we don’t remove stop words for Chinese in the search data while still ensuring that stop words are still removed when used as the search query.

yiksanchan · December 21, 2020, 1:37am

@tgxworld I don’t understand the difference between search data vs search query. Can you please provide more details? Thanks

sam · December 21, 2020, 1:46am

If we add stop words to the index, it bloats the index, and makes search performance bad

tgxworld · December 21, 2020, 3:01am

https://github.com/discourse/discourse/blob/d2a04621862aa7f7fc283112d542648e9f3fcab8/app/models/post_search_data.rb#L12-L13

There are two columns which we store in the PostSearchData table, #search_data is used when querying against search terms. #raw_data is what we used when displaying the search excerpt. The fix here should be that Chinese stop words should not be removed from the #raw_data while still removed from #search_data.

th21 · May 6, 2021, 5:02am

Any progress on this bug?

sam · May 20, 2021, 5:50am

I thought I made some changes here:

github.com

discourse/discourse/blob/626b8465baed15799b89135d79c9b8a00eda3bb7/lib/search.rb#L67-L70

    
      
          def self.segment_cjk?
            ['zh_TW', 'zh_CN', 'ja'].include?(SiteSetting.default_locale) ||
              SiteSetting.search_tokenize_chinese_japanese_korean
          end

Is your locale set to zh_TW , zh_CN or ja? if not is search_tokenize_chinese_japanese_korean set to true?

We have a bypass here:

github.com

discourse/discourse/blob/626b8465baed15799b89135d79c9b8a00eda3bb7/lib/search/grouped_search_results.rb#L90-L104

    
      
          if post.post_search_data.version > SearchIndexer::MIN_POST_REINDEX_VERSION && !Search.segment_cjk?
            if SiteSetting.use_pg_headlines_for_excerpt
              scrubbed_headline = post.headline.gsub(SCRUB_HEADLINE_REGEXP, '\1')
              prefix_omission = scrubbed_headline.start_with?(post.leading_raw_data) ? '' : OMISSION
              postfix_omission = scrubbed_headline.end_with?(post.trailing_raw_data) ? '' : OMISSION
              return "#{prefix_omission}#{post.headline}#{postfix_omission}"
            else
              opts[:cooked] = post.post_search_data.raw_data
              opts[:scrub] = false
            end
          else
            opts[:cooked] = post.cooked
          end
          
          
GroupedSearchResults.blurb_for(**opts)

Topic		Replies	Views
Chinese searching is broken Bug	4	1000	October 12, 2016
Searching Chinese terms in middle of sentence Feature	23	3458	October 8, 2016
Chinese search doesn't work to some words Support	15	1698	October 31, 2021
What's the word tokenizer for different languages in discourse? Support	1	592	May 27, 2020
Korean words can't be searched Support	36	1588	November 22, 2020

Chinese search excerpts appear broken

Related topics