שפר את חישוב ספירת המילים עבור פוסטים ב-CJK, או השתמש בספירת תווים

Lhc_fl · 24 באוגוסט,‏ 2025,‏ 3:10pm

עמודת word_count של Post ו-Topic מחושבת ישירות באמצעות מספר הרווחים, דבר שאינו מתאים כלל לשפות כמו סינית, יפנית וקוריאנית שאינן משתמשות ברווחים.

זו אינה בעיה גדולה מכיוון ש-word_count משמשת לעתים רחוקות, אך נתקלתי בבעיה ב-AI summary backfill minimum word count. פוסטים סיניים ארוכים מסוננים החוצה, אך פוסטים קצרים המשלבים סינית ואנגלית (עם הרבה רווחים) מסוכמים.

אני חושב שעלינו להשתמש במחלק מילים התומך במספר שפות, או פשוט להשתמש בספירת תווים במשהו כמו AI summary backfill minimum word count.

Lhc_fl · 24 באוגוסט,‏ 2025,‏ 3:10pm

לדוגמה, אם תבדוק את מספר המילים בפוסט הזה במנהל הנתונים, תגלה שיש רק מילה אחת

Lhc_fl · 24 באוגוסט,‏ 2025,‏ 3:14pm

举个例子如果在数据资源管理器里检查这个帖子的单词数量会发现仅仅只有一个

(translation: for example, if you check the word count of this post in the Data Explorer, you will find that there is only one)

This is clearly wrong and may have been affecting the user’s reading time calculation, since read_time_word_count depends on word count.

sam · 24 באוגוסט,‏ 2025,‏ 11:37pm

Hmm, if we are smart about our pipeline we could use cppjieba.

github.com/discourse/discourse

lib/search.rb

a8ed5b19f


      
          segments = CppjiebaRb.segment(match_data.to_s, mode: :mix)
          
          segments = CppjiebaRb.filter_stop_word(segments) if ts_config != "english"
          
          segments = segments.filter { |s| s.present? }
          segmented_data << segments.join(" ")

It would require that update_index! would take care of this:

github.com/discourse/discourse

app/services/search_indexer.rb

a8ed5b19f


      
          def self.update_index(table:, id:, a_weight: nil, b_weight: nil, c_weight: nil, d_weight: nil)

char count is probably the simplest thing though, given that reading the word bla is far faster than reading supercalifragilisticexpialidocious

I wonder if you can make some PR that changes so we lean on char count, then we can divide char count by 4 say for English and 2 for Chinese? (via some setting)

@lindsey this is an interesting topic for you.

pangbo · 25 באוגוסט,‏ 2025,‏ 11:24am

עבור סינית, ספירת תווים היא השיטה הנפוצה ביותר למדידת אורך טקסט. מבחינת יישום, נוכל להשתמש בביטוי רגולרי כדי לסנן תווים סיניים ואז לספור אותם ישירות. גישה זו יעילה מספיק ומתאימה להרגלי השימוש הסיניים. למרות שקריאתה word_count במקום char_count עשויה להיראות מעט מבלבלת, אולי נוכל להבהיר נקודה זו בתיאור של ההגדרות הרלוונטיות.

נושא		תגובות	צפיות
How does the "read time word count" be handled on CJK characters? Support	0	29	23 באוגוסט,‏ 2024
What does posts.word_count column mean in the database? Support	5	1307	11 בינואר,‏ 2016
Don't allow super long words if there is a word length maximum Support	4	2342	27 במאי,‏ 2016
Chinese search excerpts appear broken Bug pr-welcome	17	1878	20 במאי,‏ 2021
Average character count stats for user Feature	6	787	23 באפריל,‏ 2018

שפר את חישוב ספירת המילים עבור פוסטים ב-CJK, או השתמש בספירת תווים

נושאים קשורים