Searching Chinese terms in middle of sentence


#1

When search using for Chinese keywords, any terms in the middle of the sentence do not turn up in search results. Searching on the first few words of a sentence however works.

Here’s a topic:

Searching for terms in middle of sentence doesn’t work:

Including the first few words in the beginning of the sentence works:

This phenomenon is mentioned in the first half of this thread:
https://meta.discourse.org/t/can-t-search-with-chinese-keywords/4500/10?source_topic_id=35440

It suggests that we need to turn on site wide Chinese locale in order to enable such search. Is this still the case?
We’re currently on En locale but need to support multiple languages.

Thanks.


Chinese searching is broken
Chinese search issues
#2

Actually I think this might be because “量子阱” is a rare term and not recognized as a keyword?
Anyway, is there any viable way we can improve this?


(Sam Saffron) #3

Can you post the exact terms/sentence being used, I can enable a mixed search tokenizer behind a site setting


#4

Are there any plugins that can include rare translated terms like that automatically?
There are many such terms that is specific to our field in photonics but we don’t keep track of it 'cause there are many.
Or we can try to find such a glossary list on the internet and send to you in some format? but it might be a very time-consuming exercise for us.


(Sam Saffron) #5

We have a tokenizer that is enabled for Chinese locale I can add a site setting to enable it unconditionally

Just need some examples pasted here of sentences and expected word breaks so I can confirm it works well, can not use the pics need it in text


#6

I see. Here’re the text for the example:
Topic: 能否设计量子阱材料
search terms: 量子阱, 设计量子阱, 量子阱材料, 材料, 设计
Is it possible to enable it for Korean and Japanese as well?
thanks!


(Sam Saffron) #7

This is now completed per:

We will have it deployed on the business tier next week. The site setting search_tokenize_chinese_japanese_korean will enable the CJK tokenizer for search regardless of locale.

To take effect you will have to enable the site setting and edit the topic in question (to refresh the search index)


#8

Hi, am I reading this right? Does this mean the only way to ensure that old posts turn up in such searches is by editing them individually? Thanks.


(Sam Saffron) #9

Once enabled we can kick off another reindex for you


#10

Hello, that has been done, but some unusual problems seem to be occurring.

Searches for many Chinese terms which appear on the website are not returning any result. One thing I am noticing is that problems seem more likely to occur in searches for terms including traditional characters (as are used in Taiwan) for which there are simplified (as used in China) equivalents. In some case I don’t get results if the term includes one such character, and less then three characters for which there is no such equivalent (the character is the same in Taiwan and China)

Some things I’m seeing: 台北 (Taipei) gets plenty of hits, but 台 (Taiwan) doesn’t get any (I changed our minimum search length setting to 2 to accommodate such Chinese place names/terms.) (characters with simplified equivalents bolded).

老鼠屎了一鍋湯 gets results, but some smaller subsets including a bolded character don’t. A lot of common terms like the name of President Lee I mentioned,李登 aren’t getting any results at all, even the test post I made a few days ago (which was being found previously.) Even a “search this category” search in that category or thread do not find the post. The same is true of 李登, which would seem to contradict my traditional/simplified theory. Perhaps I’m barking up the wrong tree with that, I don’t know.

Something else odd seems to be happening in English searches, that I did not notice previously. For example, a search on “anyone who I know” (no quotes) is returning a lot of results for “I”, even in words where the letter is present. I’m not sure if this would be expected or not.

I hope this makes some sense. Thanks for the assistance.


(Mittineague) #11

My first guess is that the “dictionary” being used for Postgres searches is omitting what would be Chinese "stop words* (words that are so common they are considered to be of little to no value for searches).

But I do not know where or how Discourse configures what dictionary to use based on the locale to support or discredit the guess. .

I don’t understand Chinese but maybe you can see something here?

https://code.google.com/archive/p/nlpbamboo/wikis/TSearch2.wiki

I find the documentation fairly bewildering even when it is in English, that I can read, but hopefully you find it easier going.


#12

I would guess that wouldn’t be a concept that could be applied to Chinese searches. In a nutshell, there aren’t characters that are common enough in the way that “the” for example is common, that it would make searching for them unprofitable, and it would make it impossible to search for any terms including that character.

I don’t :slight_smile: It seems to say that PostgreSQL tokenized searches require the installation of Bamboo, and some instructions for doing so.


(Sam Saffron) #13

Can I get a couple of sentences in context so I can test the word splitter?


#14

Sure, here are some with the characters for Taiwan, let me know if you need other examples:

台灣籍

但我爸爸是台灣公民

台灣沙發衝浪客交流

台灣是中國的一部份

紐約時報曾形容台灣對亞洲同志來說

但我的台灣護照已過期一年多了


(Sam Saffron) #15

Can you make it a bit easier for me, 3 strings of say 20 chars or so that exhibit the problem?


#16

Sure. I’m starting work now, but I will test for a few such strings and post back later.


#17

This string doesn’t find the first post above (Getting Married):

國外健檢證明須經中華民國駐外館處驗證

I’ve found one so far. It seems hard to find such long strings that fit the bill–most work. I’m still looking.

This string 亦應視同受承辦檢察官所選 任或囑託而執行鑑定業務 doesn’t find this post (most other strings did)

http://tw.forumosa.com/t/zain-dean-conviction-fatal-hit-run-case-part-ii/64402


#18

Here’s another:

收元大證券馬志玲兩億元賄賂,協助元大購併復華金

doesn’t find this post:


(Sam Saffron) #19

I think our lines are crossed a bit so let’s start with an example like this:

Sentence: iliketaiwanalot
word: taiwan

I am trying to test if our tokenizer that turns: “iliketaiwanalot” to “i like taiwan a lot” is working correctly.


#20

I think so :slight_smile:

I hope I have this right now. 3 sentences, 20 characters or so, including the term Taiwan?

我非常喜歡台灣 is I like Taiwan a lot.

我們的總統目前不敢說台灣是個獨立的國家

Our president currently does not dare to say that Taiwan is an independent country.

Punctuation OK?

台灣主要的城市包括台北、台中、台南、高雄、台東和花蓮

Taiwan’s main cities include Taipei, Taichung, Kaohsiung, Taidong and Hualian

在我認識的人中,大部分不知道台灣在哪裡或以為是泰國

Of the people I know, most don’t know where Taiwan is or think it is Thailand