No search results are shown for stopwords

Hi friends, I discovered very strange problems on my website.

My website is an online dictionary that contains many words, so every topic is in the format of “WORD meaning-Pronunciation-Example”.

However, when I searched some words in my website, many of them don’t show result.For example, “off”, “few”, “able”, “add”, “age”, “but”, “gain”, “she”, “here”, “him”, “top”, “very”, “why” etc.

And for some other words, when I search the whole word, no result there; but when I search part of the word, the result appears. For example:

when I search “city”, no result; but when I search “cit”, the result appears;
when I search “policy”, no result; but when I search “poli”, the result appears;
when I search “industry”, no result; but when I search “indu”, the result appears;
when I search “should”, no result; but when I search “shou”, the result appears;
when I search “story”, no result; but when I search “sto”, the result appears;
when I search “memory”, no result; but when I search “memo”, the result appears;
when I search “pretty”, no result; but when I search “prett”, the result appears;
when I search “happy”, no result; but when I search “happ”, the result appears;

There are totally about 200 words that have such issue and I believe, if I don’t do something, I will see more and more and more in the future.

So please, my friends, I NEED YOUR HELP :scream_cat:

1 Like

Are you using our official Docker install?

@sam Hi Sam, I mean when I search these words in my OWN website, they have the problem.

For example:

But these topics DO exist. So I’m wondering why :worried:

Those are “stop” words

https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS

Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index. However, stop words do affect the positions in tsvector, which in turn affect ranking:

7 Likes

@Mittineague Thank you for your kind reply :slight_smile:

Is that possible to solve the problem? Since my website is an online dictionary, it would be super complicated if my user cannot search the English words they want to study

no, it is impossible, there is no setting to disable stemming if we would it would cause severe performance issues for search.

@sam :sob: can’t be worse if so… but anyway thank you for your help :+1:

Stemming and stop words are unrelated, though. @Jiaqi, I’m pretty sure you can do this, if you’re willing to rummage around enough, but it’ll involve learning a lot more about full-text search and PostgreSQL internals than you ever cared to know. It’s certainly not something that would ever be a “standard” feature, because it’s so incredibly niche. You’d be better off creating some sort of custom index plugin that kept a mapping of all the words and their associated topic IDs, and offered a custom search box to find the relevant topic.

There sure is a lot more to it than I ever imagined.

The custom index could be doable. AFAIK there are 127 stops. A bit, but not overwhelmingly so.

https://apt-browse.org/browse/ubuntu/trusty/main/i386/postgresql-9.3/9.3.4-1/file/usr/share/postgresql/9.3/tsearch_data/english.stop

The trick, if you want to use a different stop word list (or not have any at all) is to define a custom dictionary and language profile (or whatever PostgreSQL calls it) that doesn’t have the stop words in it, and then reconfigure your full-text indexing to use that instead of the standard English one. I’ve done it once, a long long time ago, and I have no interest in repeating the experience.

1 Like

Is there any way to force the index to index useful phrases like "to do" that are made of stop words, but in useful combinations? We can use ToDo in this case, but it is a little annoying. (Google is not an option, as we are keeping the site hidden for now.) If the index only indexes words, then I see that this would be challenging. (I assume in this case that the front-end does the final phrase parsing.)

2 Likes

Technically possible but not something we plan to do.

It would change complicate our indexing, add edge cases and make it harder to explain search

2 Likes

I understand: Where is the current search behaviour fully explained?

The best place to look would be the source code, specifically our test suite and postgres fulltext documentation online.

1 Like