So here goes a tale about my quest to improve search on a Czech forum instance.
Unfortunately, current PostgreSQL (12) does not come with a builtin Czech dictionary. The Czech language has 7 declensions (i.e. a noun can take 7 different forms depending on context) so as you can imagine, the search works quite poorly for us.
I figured out how to add the dictionary and make it work inside the docker instance (details below) but I still couldn’t make it work even after reindex. Turns out that Discourse selects the dictionary based on
default_locale settings, and only selects it for the languages which ship with Postgres — it uses
simple dict for all the others.
Pretty please, can we have an extra site setting to specify a custom search dictionary?
I verified that everything starts working when I edited
Here’s what I did to add the dictionary, essentially replicating the steps from here:
sudo ./launcher enter app # The following should be also added to container/app.yml so it's executed on every container rebuild curl -L https://github.com/freaz/docker-postgres-czech-unaccent/raw/master/czech_unaccent.tar.gz | tar -xzC /tmp/ && mv /tmp/fulltext_dicts/czech* /usr/share/postgresql/1?/tsearch_data/ sudo -u postgres psql discourse CREATE TEXT SEARCH DICTIONARY czech (template=ispell, dictfile = czech_unaccent, afffile=czech_unaccent, stopwords=czech_unaccent); CREATE TEXT SEARCH CONFIGURATION czech (copy=english); ALTER TEXT SEARCH CONFIGURATION czech ALTER MAPPING FOR word, asciiword WITH cspell, simple; # Verify \dF # Test select * from ts_debug('czech_unaccent','Prilis zlutoucky kun se napil zlute vody'); Ctrl-D rake search:reindex Ctrl-D # In containers/app.yml set db_default_text_search_config: "public.czech" # rebuild
There is one more gotcha with diacritics. The approach above downloads dictionary without diacritics so it will work together with
search ignore accents settings turned on. If you want to search with diacritics, you should download dictionary from https://postgres.cz/data/czech.tar.gz.
I think that this gotcha applies to other languages with default Postgres support as well. If you strip accents, you’re essentially turning off the stemmer for your language for words containing accented characters. So it’s not at all clear whether this feature should be turned on for these languages.