So here goes a tale about my quest to improve search on a Czech forum instance.
Unfortunately, current PostgreSQL (12) does not come with a builtin Czech dictionary. The Czech language has 7 declensions (i.e. a noun can take 7 different forms depending on context) so as you can imagine, the search works quite poorly for us.
I figured out how to add the dictionary and make it work inside the docker instance (details below) but I still couldn’t make it work even after reindex. Turns out that Discourse selects the dictionary based on default_locale
settings, and only selects it for the languages which ship with Postgres — it uses simple
dict for all the others.
Pretty please, can we have an extra site setting to specify a custom search dictionary?
I verified that everything starts working when I edited lib/search.rb
manually.
Here’s what I did to add the dictionary, essentially replicating the steps from here:
https://postgres.cz/wiki/Instalace_PostgreSQL#Instalace_Fulltextu
and here:
https://github.com/tjelen/postgres-tsearch-czech
sudo ./launcher enter app
# The following should be also added to container/app.yml so it's executed on every container rebuild
curl -L https://github.com/freaz/docker-postgres-czech-unaccent/raw/master/czech_unaccent.tar.gz | tar -xzC /tmp/ && mv /tmp/fulltext_dicts/czech* /usr/share/postgresql/1?/tsearch_data/
sudo -u postgres psql discourse
CREATE TEXT SEARCH DICTIONARY czech
(template=ispell, dictfile = czech_unaccent, afffile=czech_unaccent, stopwords=czech_unaccent);
CREATE TEXT SEARCH CONFIGURATION czech (copy=english);
ALTER TEXT SEARCH CONFIGURATION czech
ALTER MAPPING FOR word, asciiword WITH cspell, simple;
# Verify
\dF
# Test
select * from ts_debug('czech_unaccent','Prilis zlutoucky kun se napil zlute vody');
Ctrl-D
rake search:reindex
Ctrl-D
# In containers/app.yml set
db_default_text_search_config: "public.czech"
# rebuild
There is one more gotcha with diacritics. The approach above downloads dictionary without diacritics so it will work together with search ignore accents
settings turned on. If you want to search with diacritics, you should download dictionary from https://postgres.cz/data/czech.tar.gz.
I think that this gotcha applies to other languages with default Postgres support as well. If you strip accents, you’re essentially turning off the stemmer for your language for words containing accented characters. So it’s not at all clear whether this feature should be turned on for these languages.