When search using for Chinese keywords, any terms in the middle of the sentence do not turn up in search results. Searching on the first few words of a sentence however works.
It suggests that we need to turn on site wide Chinese locale in order to enable such search. Is this still the case?
We’re currently on En locale but need to support multiple languages.
Are there any plugins that can include rare translated terms like that automatically?
There are many such terms that is specific to our field in photonics but we don’t keep track of it 'cause there are many.
Or we can try to find such a glossary list on the internet and send to you in some format? but it might be a very time-consuming exercise for us.
I see. Here’re the text for the example:
Topic: 能否设计量子阱材料
search terms: 量子阱, 设计量子阱, 量子阱材料, 材料, 设计
Is it possible to enable it for Korean and Japanese as well?
thanks!
We will have it deployed on the business tier next week. The site setting search_tokenize_chinese_japanese_korean will enable the CJK tokenizer for search regardless of locale.
To take effect you will have to enable the site setting and edit the topic in question (to refresh the search index)
Hello, that has been done, but some unusual problems seem to be occurring.
Searches for many Chinese terms which appear on the website are not returning any result. One thing I am noticing is that problems seem more likely to occur in searches for terms including traditional characters (as are used in Taiwan) for which there are simplified (as used in China) equivalents. In some case I don’t get results if the term includes one such character, and less then three characters for which there is no such equivalent (the character is the same in Taiwan and China)
Some things I’m seeing: 台北 (Taipei) gets plenty of hits, but 台灣 (Taiwan) doesn’t get any (I changed our minimum search length setting to 2 to accommodate such Chinese place names/terms.) (characters with simplified equivalents bolded).
顆老鼠屎壞了一鍋湯 gets results, but some smaller subsets including a bolded character don’t. A lot of common terms like the name of President Lee I mentioned,李登輝 aren’t getting any results at all, even the test post I made a few days ago (which was being found previously.) Even a “search this category” search in that category or thread do not find the post. The same is true of 李登, which would seem to contradict my traditional/simplified theory. Perhaps I’m barking up the wrong tree with that, I don’t know.
Something else odd seems to be happening in English searches, that I did not notice previously. For example, a search on “anyone who I know” (no quotes) is returning a lot of results for “I”, even in words where the letter is present. I’m not sure if this would be expected or not.
I hope this makes some sense. Thanks for the assistance.
My first guess is that the “dictionary” being used for Postgres searches is omitting what would be Chinese "stop words* (words that are so common they are considered to be of little to no value for searches).
But I do not know where or how Discourse configures what dictionary to use based on the locale to support or discredit the guess. .
I don’t understand Chinese but maybe you can see something here?
I would guess that wouldn’t be a concept that could be applied to Chinese searches. In a nutshell, there aren’t characters that are common enough in the way that “the” for example is common, that it would make searching for them unprofitable, and it would make it impossible to search for any terms including that character.
I don’t It seems to say that PostgreSQL tokenized searches require the installation of Bamboo, and some instructions for doing so.