Searching Chinese terms in middle of sentence

dleung · November 10, 2015, 11:27pm

When search using for Chinese keywords, any terms in the middle of the sentence do not turn up in search results. Searching on the first few words of a sentence however works.

Here’s a topic:

Searching for terms in middle of sentence doesn’t work:

Including the first few words in the beginning of the sentence works:

This phenomenon is mentioned in the first half of this thread:
https://meta.discourse.org/t/can-t-search-with-chinese-keywords/4500/10

It suggests that we need to turn on site wide Chinese locale in order to enable such search. Is this still the case?
We’re currently on En locale but need to support multiple languages.

Thanks.

dleung · November 10, 2015, 11:41pm

Actually I think this might be because “量子阱” is a rare term and not recognized as a keyword?
Anyway, is there any viable way we can improve this?

sam · November 11, 2015, 1:22am

Can you post the exact terms/sentence being used, I can enable a mixed search tokenizer behind a site setting

dleung · November 11, 2015, 1:55am

Are there any plugins that can include rare translated terms like that automatically?
There are many such terms that is specific to our field in photonics but we don’t keep track of it 'cause there are many.
Or we can try to find such a glossary list on the internet and send to you in some format? but it might be a very time-consuming exercise for us.

sam · November 11, 2015, 2:11am

We have a tokenizer that is enabled for Chinese locale I can add a site setting to enable it unconditionally

Just need some examples pasted here of sentences and expected word breaks so I can confirm it works well, can not use the pics need it in text

dleung · November 11, 2015, 5:18pm

I see. Here’re the text for the example:
Topic: 能否设计量子阱材料
search terms: 量子阱, 设计量子阱, 量子阱材料, 材料, 设计
Is it possible to enable it for Korean and Japanese as well?
thanks!

sam · November 27, 2015, 5:38am

This is now completed per:

https://github.com/discourse/discourse/commit/f74a6457ee9a3ef4cb9f0d4283a8604e5b05551e

We will have it deployed on the business tier next week. The site setting search_tokenize_chinese_japanese_korean will enable the CJK tokenizer for search regardless of locale.

To take effect you will have to enable the site setting and edit the topic in question (to refresh the search index)

tempogain · October 2, 2016, 11:21am

Hi, am I reading this right? Does this mean the only way to ensure that old posts turn up in such searches is by editing them individually? Thanks.

sam · October 2, 2016, 12:33pm

Once enabled we can kick off another reindex for you

tempogain · October 5, 2016, 5:11am

Hello, that has been done, but some unusual problems seem to be occurring.

Searches for many Chinese terms which appear on the website are not returning any result. One thing I am noticing is that problems seem more likely to occur in searches for terms including traditional characters (as are used in Taiwan) for which there are simplified (as used in China) equivalents. In some case I don’t get results if the term includes one such character, and less then three characters for which there is no such equivalent (the character is the same in Taiwan and China)

Some things I’m seeing: 台北 (Taipei) gets plenty of hits, but 台灣 (Taiwan) doesn’t get any (I changed our minimum search length setting to 2 to accommodate such Chinese place names/terms.) (characters with simplified equivalents bolded).

顆老鼠屎壞了一鍋湯 gets results, but some smaller subsets including a bolded character don’t. A lot of common terms like the name of President Lee I mentioned,李登輝 aren’t getting any results at all, even the test post I made a few days ago (which was being found previously.) Even a “search this category” search in that category or thread do not find the post. The same is true of 李登, which would seem to contradict my traditional/simplified theory. Perhaps I’m barking up the wrong tree with that, I don’t know.

Something else odd seems to be happening in English searches, that I did not notice previously. For example, a search on “anyone who I know” (no quotes) is returning a lot of results for “I”, even in words where the letter is present. I’m not sure if this would be expected or not.

I hope this makes some sense. Thanks for the assistance.

Mittineague · October 5, 2016, 5:34am

My first guess is that the “dictionary” being used for Postgres searches is omitting what would be Chinese "stop words* (words that are so common they are considered to be of little to no value for searches).

But I do not know where or how Discourse configures what dictionary to use based on the locale to support or discredit the guess. .

I don’t understand Chinese but maybe you can see something here?

https://code.google.com/archive/p/nlpbamboo/wikis/TSearch2.wiki

I find the documentation fairly bewildering even when it is in English, that I can read, but hopefully you find it easier going.

tempogain · October 5, 2016, 5:54am

I would guess that wouldn’t be a concept that could be applied to Chinese searches. In a nutshell, there aren’t characters that are common enough in the way that “the” for example is common, that it would make searching for them unprofitable, and it would make it impossible to search for any terms including that character.

I don’t It seems to say that PostgreSQL tokenized searches require the installation of Bamboo, and some instructions for doing so.

sam · October 5, 2016, 6:01am

Can I get a couple of sentences in context so I can test the word splitter?

tempogain · October 5, 2016, 6:38am

Sure, here are some with the characters for Taiwan, let me know if you need other examples:

台灣籍

Forumosa – 9 Oct 07

嫁娶外籍配偶的結婚手續 - Getting Married - for your TW spouse to read

Legal Legal - Marriage

公証結婚外籍: 單身証明到所屬國家的在台辦事處辦理，並送至外交部領事事務局做文件驗証 boca.gov.tw/ct.asp?xItem=1284&ctNode=16&mp=1 在做文件驗証時，外交部會把正本收走，只給你影本，所以最好多申請1-2份，以免弄不見就又要再跑一趟‧ 台灣籍: 身分證正本、印章及具有台灣公民身份的証人2位 (20歲以上，帶身分證正本及印章)‧沒有印章簽名也可以的！結婚登記外籍: 護照、中文姓名聲明書 ...

但我爸爸是台灣公民

Forumosa – 15 Apr 15

ABC obtaining Taiwanese Passport

Legal Dual Nationality

如果別人已經問過我表的主題請告訴我！給我鏈接～我是在美國出生的，但我爸爸是台灣公民。我好像在哪裡讀過如果你爸媽是出生時是台灣公民，公民入籍就能流傳下來給小孩。對嗎？我為了工作機會及看親屬很想拿台灣公民入籍但是現在是在不知道該怎麼辦。好像小孩超過十八歲就拿不到ROC ID卡。再過幾個禮拜就是我二十一歲的生日… 所以可能太晚了。挫賽！ 😢 😢 有人知道超過幾歲就不能拿ROC ID卡嗎？ :ponder:...

台灣沙發衝浪客交流

台灣是中國的一部份

紐約時報曾形容台灣對亞洲同志來說

但我的台灣護照已過期一年多了

sam · October 5, 2016, 6:41am

Can you make it a bit easier for me, 3 strings of say 20 chars or so that exhibit the problem?

tempogain · October 5, 2016, 6:45am

Sure. I’m starting work now, but I will test for a few such strings and post back later.

tempogain · October 5, 2016, 9:37am

This string doesn’t find the first post above (Getting Married):

國外健檢證明須經中華民國駐外館處驗證

I’ve found one so far. It seems hard to find such long strings that fit the bill–most work. I’m still looking.

This string 亦應視同受承辦檢察官所選任或囑託而執行鑑定業務 doesn’t find this post (most other strings did)

http://tw.forumosa.com/t/zain-dean-conviction-fatal-hit-run-case-part-ii/64402

tempogain · October 5, 2016, 3:56pm

Here’s another:

收元大證券馬志玲兩億元賄賂，協助元大購併復華金

doesn’t find this post:

sam · October 6, 2016, 2:46am

I think our lines are crossed a bit so let’s start with an example like this:

Sentence: iliketaiwanalot
word: taiwan

I am trying to test if our tokenizer that turns: “iliketaiwanalot” to “i like taiwan a lot” is working correctly.

tempogain · October 6, 2016, 4:22am

I think so

I hope I have this right now. 3 sentences, 20 characters or so, including the term Taiwan?

我非常喜歡台灣 is I like Taiwan a lot.

我們的總統目前不敢說台灣是個獨立的國家

Our president currently does not dare to say that Taiwan is an independent country.

Punctuation OK?

台灣主要的城市包括台北、台中、台南、高雄、台東和花蓮

Taiwan’s main cities include Taipei, Taichung, Kaohsiung, Taidong and Hualian

在我認識的人中，大部分不知道台灣在哪裡或以為是泰國

Of the people I know, most don’t know where Taiwan is or think it is Thailand

Topic		Replies	Views
Chinese searching is broken Bug	4	1003	October 12, 2016
Thai language support for searching Bug	4	1200	August 11, 2020
Korean words can't be searched Support	36	1614	November 22, 2020
Chinese search excerpts appear broken Bug pr-welcome	17	1767	May 20, 2021
What's the word tokenizer for different languages in discourse? Support	1	594	May 27, 2020

Related topics