Custom dictionary for Postgres fulltext search

So here goes a tale about my quest to improve search on a Czech forum instance.

Unfortunately, current PostgreSQL (12) does not come with a builtin Czech dictionary. The Czech language has 7 declensions (i.e. a noun can take 7 different forms depending on context) so as you can imagine, the search works quite poorly for us.

I figured out how to add the dictionary and make it work inside the docker instance (details below) but I still couldn’t make it work even after reindex. Turns out that Discourse selects the dictionary based on default_locale settings, and only selects it for the languages which ship with Postgres — it uses simple dict for all the others.

Pretty please, can we have an extra site setting to specify a custom search dictionary?
I verified that everything starts working when I edited lib/search.rb manually.

Here’s what I did to add the dictionary, essentially replicating the steps from here:
https://postgres.cz/wiki/Instalace_PostgreSQL#Instalace_Fulltextu

and here:

sudo ./launcher enter app
# The following should be also added to container/app.yml so it's executed on every container rebuild
curl -L https://github.com/freaz/docker-postgres-czech-unaccent/raw/master/czech_unaccent.tar.gz | tar -xzC /tmp/ && mv /tmp/fulltext_dicts/czech* /usr/share/postgresql/1?/tsearch_data/
sudo -u postgres psql discourse
CREATE TEXT SEARCH DICTIONARY czech
   (template=ispell, dictfile = czech_unaccent, afffile=czech_unaccent, stopwords=czech_unaccent);
CREATE TEXT SEARCH CONFIGURATION czech (copy=english);
ALTER TEXT SEARCH CONFIGURATION czech
  ALTER MAPPING FOR word, asciiword WITH cspell, simple;
# Verify
\dF
# Test
select * from ts_debug('czech_unaccent','Prilis zlutoucky kun se napil zlute vody');
Ctrl-D
rake search:reindex
Ctrl-D
# In containers/app.yml set 
db_default_text_search_config: "public.czech"
# rebuild

There is one more gotcha with diacritics. The approach above downloads dictionary without diacritics so it will work together with search ignore accents settings turned on. If you want to search with diacritics, you should download dictionary from https://postgres.cz/data/czech.tar.gz.

I think that this gotcha applies to other languages with default Postgres support as well. :frowning: If you strip accents, you’re essentially turning off the stemmer for your language for words containing accented characters. So it’s not at all clear whether this feature should be turned on for these languages.

3 Likes

@sam would you be open for a PR for the above?

Alternatively, we could modify the docker base image to include more Postgres search dictionaries by default, but that would be much harder.