改进 CJK帖子的 word_count 计算，或使用 char count

Lhc_fl · 2025 年8 月 24 日 15:10

Post 和 Topic 的 word_count 列似乎是直接通过空格数量计算的，这对于不使用空格的中文、日文和韩文等语言来说是完全不合适的。

这并不是一个大问题，因为 word_count 很少被使用，但在 AI summary backfill minimum word count 中遇到了麻烦。长的中文帖子会被过滤掉，但夹杂着中文和英文（有很多空格）的短帖子却会被总结。

我认为我们应该使用支持多种语言的单词分词器，或者在 AI summary backfill minimum word count 等地方简单地使用字符计数。

Lhc_fl · 2025 年8 月 24 日 15:10

举个例子如果在数据资源管理器里检查这个帖子的单词数量会发现仅仅只有一个

Lhc_fl · 2025 年8 月 24 日 15:14

例如，如果在数据资源管理器中检查此帖子的单词数，您会发现只有一个。

这显然是错误的，并且可能影响了用户的阅读时间计算，因为 read_time_word_count 依赖于单词数。

sam · 2025 年8 月 24 日 23:37

嗯，如果我们能聪明地处理我们的管道，我们可以使用 cppjieba。

github.com/discourse/discourse

lib/search.rb

a8ed5b19f


      
          segments = CppjiebaRb.segment(match_data.to_s, mode: :mix)
          
          segments = CppjiebaRb.filter_stop_word(segments) if ts_config != "english"
          
          segments = segments.filter { |s| s.present? }
          segmented_data << segments.join(" ")

这将需要 update_index! 来处理这个：

github.com/discourse/discourse

app/services/search_indexer.rb

a8ed5b19f


      
          def self.update_index(table:, id:, a_weight: nil, b_weight: nil, c_weight: nil, d_weight: nil)

考虑到阅读单词 bla 比阅读 supercalifragilisticexpialidocious 快得多，字符计数可能是最简单的方法。

我想知道你是否可以做一个 PR 来改变我们依赖字符计数，然后我们可以将字符计数除以 4（例如，对于英语）和 2（对于中文）？（通过一些设置）

@lindsey 这是你感兴趣的一个话题。

pangbo · 2025 年8 月 25 日 11:24

For Chinese, character count is the most commonly used method for measuring text length. In terms of implementation, we could use a regular expression to filter Chinese characters and then directly count them. This approach is efficient enough and aligns with Chinese usage habits. Although naming it word_count instead of char_count might seem a bit confusing, perhaps we could clarify this point in the description of the relevant settings.

话题		回复	浏览量
How does the "read time word count" be handled on CJK characters? Support	0	21	2024 年8 月 23 日
What does posts.word_count column mean in the database? Support	6	1284	2016 年1 月 11 日
Don't allow super long words if there is a word length maximum Support	6	2310	2019 年1 月 24 日
Chinese search excerpts appear broken Bug pr-welcome	17	1771	2021 年5 月 20 日
Average character count stats for user Feature	6	769	2018 年4 月 23 日

改进 CJK帖子的 word_count 计算，或使用 char count

相关话题