Verbeter woord_aantal berekening voor CJK berichten, of gebruik teken_aantal

Lhc_fl · 24 augustus 2025 om 15:10

De word_count-kolom van Post en Topic lijkt direct te worden berekend met het aantal spaties, wat volkomen ongepast is voor talen zoals Chinees, Japans en Koreaans die geen spaties gebruiken.

Dit is geen groot probleem omdat word_count zelden wordt gebruikt, maar ik ondervond problemen bij AI summary backfill minimum word count. Lange Chinese berichten worden eruit gefilterd, maar korte berichten met gemengd Chinees en Engels (met veel spaties) worden samengevat.

Ik denk dat we een woordsegmenteerder moeten gebruiken die meerdere talen ondersteunt, of simpelweg het aantal tekens moeten gebruiken in iets als AI summary backfill minimum word count.

Lhc_fl · 24 augustus 2025 om 15:10

Om een voorbeeld te geven, als je het aantal woorden in dit bericht zou controleren in de gegevensverkenner, zou je ontdekken dat er slechts één is

Lhc_fl · 24 augustus 2025 om 15:14

举个例子如果在数据资源管理器里检查这个帖子的单词数量会发现仅仅只有一个

(translation: for example, if you check the word count of this post in the Data Explorer, you will find that there is only one)

This is clearly wrong and may have been affecting the user’s reading time calculation, since read_time_word_count depends on word count.

sam · 24 augustus 2025 om 23:37

Hmm, if we are smart about our pipeline we could use cppjieba.

github.com/discourse/discourse

lib/search.rb

a8ed5b19f


      
          segments = CppjiebaRb.segment(match_data.to_s, mode: :mix)
          
          segments = CppjiebaRb.filter_stop_word(segments) if ts_config != "english"
          
          segments = segments.filter { |s| s.present? }
          segmented_data << segments.join(" ")

It would require that update_index! would take care of this:

github.com/discourse/discourse

app/services/search_indexer.rb

a8ed5b19f


      
          def self.update_index(table:, id:, a_weight: nil, b_weight: nil, c_weight: nil, d_weight: nil)

char count is probably the simplest thing though, given that reading the word bla is far faster than reading supercalifragilisticexpialidocious

I wonder if you can make some PR that changes so we lean on char count, then we can divide char count by 4 say for English and 2 for Chinese? (via some setting)

@lindsey this is an interesting topic for you.

pangbo · 25 augustus 2025 om 11:24

Voor Chinees is het aantal tekens de meestgebruikte methode om tekstlengte te meten. Wat de implementatie betreft, zouden we een reguliere expressie kunnen gebruiken om Chinese tekens te filteren en ze vervolgens direct te tellen. Deze aanpak is efficiënt genoeg en sluit aan bij de Chinese gebruiksgewoonten. Hoewel het misschien een beetje verwarrend lijkt om het word_count te noemen in plaats van char_count, zouden we dit punt kunnen verduidelijken in de beschrijving van de relevante instellingen.

Topic		Antwoorden	Weergaven
How does the "read time word count" be handled on CJK characters? Support	0	23	23 augustus 2024
What does posts.word_count column mean in the database? Support	6	1285	11 januari 2016
Don't allow super long words if there is a word length maximum Support	6	2319	24 januari 2019
Chinese search excerpts appear broken Bug pr-welcome	17	1802	20 mei 2021
Average character count stats for user Feature	6	770	23 april 2018

Verbeter woord_aantal berekening voor CJK berichten, of gebruik teken_aantal

Gerelateerde topics