Hmm, if we are smart about our pipeline we could use cppjieba.
It would require that update_index! would take care of this:
char count is probably the simplest thing though, given that reading the word bla is far faster than reading supercalifragilisticexpialidocious
I wonder if you can make some PR that changes so we lean on char count, then we can divide char count by 4 say for English and 2 for Chinese? (via some setting)
@lindsey this is an interesting topic for you.