No search results are shown for stopwords

Hi friends, I discovered very strange problems on my website.

My website is an online dictionary that contains many words, so every topic is in the format of “WORD meaning-Pronunciation-Example”.

However, when I searched some words in my website, many of them don’t show result.For example, “off”, “few”, “able”, “add”, “age”, “but”, “gain”, “she”, “here”, “him”, “top”, “very”, “why” etc.

And for some other words, when I search the whole word, no result there; but when I search part of the word, the result appears. For example:

when I search “city”, no result; but when I search “cit”, the result appears;
when I search “policy”, no result; but when I search “poli”, the result appears;
when I search “industry”, no result; but when I search “indu”, the result appears;
when I search “should”, no result; but when I search “shou”, the result appears;
when I search “story”, no result; but when I search “sto”, the result appears;
when I search “memory”, no result; but when I search “memo”, the result appears;
when I search “pretty”, no result; but when I search “prett”, the result appears;
when I search “happy”, no result; but when I search “happ”, the result appears;

There are totally about 200 words that have such issue and I believe, if I don’t do something, I will see more and more and more in the future.

So please, my friends, I NEED YOUR HELP :scream_cat:

Are you using our official Docker install?

@sam Hi Sam, I mean when I search these words in my OWN website, they have the problem.

For example:

But these topics DO exist. So I’m wondering why :worried:

Those are “stop” words

Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index. However, stop words do affect the positions in tsvector, which in turn affect ranking:

@Mittineague Thank you for your kind reply :slight_smile:

Is that possible to solve the problem? Since my website is an online dictionary, it would be super complicated if my user cannot search the English words they want to study

no, it is impossible, there is no setting to disable stemming if we would it would cause severe performance issues for search.

@sam :sob: can’t be worse if so… but anyway thank you for your help :+1:

Stemming and stop words are unrelated, though. @Jiaqi, I’m pretty sure you can do this, if you’re willing to rummage around enough, but it’ll involve learning a lot more about full-text search and PostgreSQL internals than you ever cared to know. It’s certainly not something that would ever be a “standard” feature, because it’s so incredibly niche. You’d be better off creating some sort of custom index plugin that kept a mapping of all the words and their associated topic IDs, and offered a custom search box to find the relevant topic.

There sure is a lot more to it than I ever imagined.

The custom index could be doable. AFAIK there are 127 stops. A bit, but not overwhelmingly so.

https://apt-browse.org/browse/ubuntu/trusty/main/i386/postgresql-9.3/9.3.4-1/file/usr/share/postgresql/9.3/tsearch_data/english.stop

The trick, if you want to use a different stop word list (or not have any at all) is to define a custom dictionary and language profile (or whatever PostgreSQL calls it) that doesn’t have the stop words in it, and then reconfigure your full-text indexing to use that instead of the standard English one. I’ve done it once, a long long time ago, and I have no interest in repeating the experience.

ストップワードで構成された有用なフレーズ(例:“to do”)を、索引がインデックス化するように強制する方法はありますか?この場合、ToDo とすることはできますが、少し面倒です。(現在はサイトを非公開にしているため、Google は選択肢ではありません。)索引が単語のみをインデックス化する場合、これは困難になると思われます。(この場合、フロントエンドが最終的なフレーズ解析を行うと想定しています。)

技術的には可能ですが、実施する予定はありません。

それによりインデックス作成が複雑化し、エッジケースが増え、検索の説明も難しくなるためです。

ご質問の意図は理解しました:現在の検索動作は、どこで完全に説明されていますか?

最も良い確認先はソースコード、特に当社のテストスイートとオンラインの PostgreSQL 全文検索ドキュメントです。