The word_count column of Post and Topic seems to be calculated directly using the number of spaces, which is completely inappropriate for languages like Chinese, Japanese, and Korean that do not use spaces.
This is not a big problem because word_count is rarely used, but I encountered trouble in AI summary backfill minimum word count. Long Chinese posts will be filtered out, but short posts with mixed Chinese and English (with many spaces) will be summarized.
I think we should use a word segmenter that supports multiple languages, or simply use char count in something like AI summary backfill minimum word count.
Hmm, if we are smart about our pipeline we could use cppjieba.
It would require that update_index! would take care of this:
char count is probably the simplest thing though, given that reading the word bla is far faster than reading supercalifragilisticexpialidocious
I wonder if you can make some PR that changes so we lean on char count, then we can divide char count by 4 say for English and 2 for Chinese? (via some setting)
For Chinese, character count is the most commonly used method for measuring text length. In terms of implementation, we could use a regular expression to filter Chinese characters and then directly count them. This approach is efficient enough and aligns with Chinese usage habits. Although naming it word_count instead of char_count might seem a bit confusing, perhaps we could clarify this point in the description of the relevant settings.