Improve word_count calculation for CJK posts, or use char count

The word_count column of Post and Topic seems to be calculated directly using the number of spaces, which is completely inappropriate for languages ​​like Chinese, Japanese, and Korean that do not use spaces.

This is not a big problem because word_count is rarely used, but I encountered trouble in AI summary backfill minimum word count. Long Chinese posts will be filtered out, but short posts with mixed Chinese and English (with many spaces) will be summarized.

I think we should use a word segmenter that supports multiple languages, or simply use char count in something like AI summary backfill minimum word count.

1 Like

举个例子如果在数据资源管理器里检查这个帖子的单词数量会发现仅仅只有一个

举个例子如果在数据资源管理器里检查这个帖子的单词数量会发现仅仅只有一个

(translation: for example, if you check the word count of this post in the Data Explorer, you will find that there is only one)

This is clearly wrong and may have been affecting the user’s reading time calculation, since read_time_word_count depends on word count.

Hmm, if we are smart about our pipeline we could use cppjieba.

It would require that update_index! would take care of this:


char count is probably the simplest thing though, given that reading the word bla is far faster than reading supercalifragilisticexpialidocious

I wonder if you can make some PR that changes so we lean on char count, then we can divide char count by 4 say for English and 2 for Chinese? (via some setting)

@lindsey this is an interesting topic for you.

1 Like

For Chinese, character count is the most commonly used method for measuring text length. In terms of implementation, we could use a regular expression to filter Chinese characters and then directly count them. This approach is efficient enough and aligns with Chinese usage habits. Although naming it word_count instead of char_count might seem a bit confusing, perhaps we could clarify this point in the description of the relevant settings.

1 Like