Don't allow super long words if there is a word length maximum

(Helperhaps) #1

I got an invalid title alert when i create a topic. DIg into the code i found it raised by TextSentinel class. And I have already read this There is a related method in the class.

  def seems_unpretentious?
    # Don't allow super long words if there is a word length maximum
    @opts[:max_word_length].blank? || @text.split(/\s|\/|-|\./).map(&:size).max <= @opts[:max_word_length]

I understand what it means. But for other language such as Chinese or Japanese whose word is not splited by blank or something about the patern.

Besides SiteSetting.title_max_word_length’s value is 0 when i leave it blank in the setting panel, so the expression @opts[:max_word_length].blank? is always true. it is meaningless.

My users use Chinese more, So I have to set SiteSetting.title_max_word_length as same value as SiteSetting.title_max_topic_title_length to make it work.

Is there some other way to solve it?

(Sam Saffron) #2

@fantasticfears any idea what to do here?

(Erick Guan) #3

For CJK, it is meaningless. Word segmentation algorithm would happily chops a sentence into characters and words whenever it can. Without understanding the sentences, I’m afraid it’s not easy to identify good/bad word.

3 ways to disable it:

  1. Put some locale check
  2. Put a comment about how to disable it by setting the value to the same as topic length.
  3. Add another setting to disable it.

The latter two is much better. Although locale-based site settings should be introduced at some moment for convenience.

BTW, if my mentor @tgxworld agrees, I could ask some pointers about how this can be done at some moment in June:

(Helperhaps) #4

How about just enabling it when all the characters in title is consisted of ASCII code?

(Erick Guan) #5

I think would have to wait until introducing some Unicode filtering libraries instead of regex. I could take this for 1.6 and Unicode username.

It doesn’t make much sense for checking the ascii part within the Chinese sentences. The only possible use case is a multilingual forum which might need such checks based on the title sequences (still, shouldn’t look into ascii sequences in the Chinese sentences for example)