Diacritics and search

icaria36 · November 9, 2017, 6:57am

Is it possible that Discourse’s search engine doesn’t know about fuzzy search with diacritics?

Diacritics are glyphs added to letters in many languages (notably not in English, except when borrowing foreign words and even then). Speakers of languages with diacritics expect search engines to basically ignore them (i.e. searching for “Mexico” or “México” provides the same results).

In Discourse, diacritics seem to be critical:

https://confederac.io/search?q=poesia (no diacritic) provides zero results.
https://confederac.io/search?q=poesía (with diacritic) provides one result.

Languages with diacritics use them everywhere, but then their speakers are often lazy or wrong using them. The difference between finding the content you are looking for or not is huge.

I Discourse using an own search engine or embedding an existing one i.e. Elasticsearch? In any case, hopefully there is a library or something that Discourse developers could use to integrate diacritics fuzziness in search?

schungx · November 9, 2017, 10:09am

I remember that the full text search index built works best with English text… There have been a lot of discussions on meta regarding Chinese.

For other languages, you may search meta further. However it may involving changing the default locale of PostgreSql or similar that I remember reading here.

codinghorror · November 9, 2017, 3:27pm

If you are talking about non English sites you need to set the database encoding to match the language the site is in.

icaria36 · November 9, 2017, 8:57pm

Right, that might explain. Currently the locale is English. The plan is to change to Catalan after resolving Catalan translation.

When the default locale of the site is changed, is the corresponding database encoding automatically changed?

codinghorror · November 10, 2017, 12:18am

No, you must set that at install time. How do we set database encoding during setup @techapj? I can’t recall.

MakaryGo · November 10, 2017, 12:22am

I believe it has something to do with

  db_default_text_search_config: "pg_catalog.english"

In the .yml file

icaria36 · November 10, 2017, 7:10am

Please bare with me, I don’t know much about databases. If someone has a Discourse instance set to English as default language that has already real users and real content… What do they need to do in order to change the default locale effectively, in addition to changing the default locale setting in the admin interface?

A step by step process would be very helpful. Discourse installation documentation is so good that it drove someone like me successfully through the entire process. The little detail about the implications of defining a locale in a fresh installation could be documented too, in order to avoid surprises like this one.

schungx · November 10, 2017, 11:19am

An already-existing database is always a problem regardless what database engine you use… I doubt it is very simple to change database locales, especially when there is already data in it.

I’d suggest you search via Google to see how to change PostgreSQL database default locale when the database is not empty.

icaria36 · November 11, 2017, 9:20am

This might be a silly question. Would having the database encoded to UTF-8 by default in Discourse solve these kinds of problems? Isn’t UTF-8 as good for English as for many other languages with diacritics and more?

schungx · November 11, 2017, 11:46am

UTF8 is an encoding. It does not know of languages and locales and how to split and compare words.

icaria36 · November 12, 2017, 2:37pm

I think I will live with this problem for now. Anyway my site is multilingual and therefore no single locale will fit perfectly.

Maybe at some point in the future Discourse’s search allows for more fuzziness in general (users have to be accurate in English searches as well). That would probably be good enough to solve the problem of diacritics regardless of language.

MakaryGo · November 12, 2017, 5:24pm

Well, actually I found out it would require enabling for given language in the database an extension called ‘Unaccent’, but I’m afraid this is somewhat above my expertise with pgsql, sorry.

sam · May 1, 2018, 6:26am

I think this is a feature request if anything, but looking at the doco by @MakaryGo I think running something like:

./launcher enter app
sudo postgres psql discourse
discourse=# CREATE EXTENSION unaccent;
discourse=# CREATE TEXT SEARCH CONFIGURATION en ( COPY = english);
discourse=# ALTER TEXT SEARCH CONFIGURATION en
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, english_stem;
discourse=# select to_tsvector('en','Hôtels de la Mer');
           to_tsvector           
---------------------------------
 'de':2 'hotel':1 'la':3 'mer':4
(1 row)

I do wonder @codinghorror if we should unaccent out-of-the-box or not.

tophee · May 2, 2018, 5:44am

If you decide to unaccent out of the box, it would be good if this could be overwritten by using quotation marks:

poesía => poesia

but

“poesía” => poesía

sam · May 2, 2018, 5:46am

sadly it is not that simple, we only carry one index around and this would impact said index. The only easy change is default unaccent, anything else is major engineering work.

MakaryGo · May 6, 2018, 4:47pm

I’ve managed to add Polish fulltext search, thanks to PostgreSQL ability to use hunspell dictionaries for languages that don’t have snowball stemmer algorithm and this Github script:

It does unaccent as well and should be easy to use with other languages as well.
I’d submit it as a PR, but it’s not really production quality, rather a hack. But perhaps some dev will be able to make it work, what do you think @sam?
It’s a pups template that’s later included to main app.yml
Edit:

Unfortunately I’ve managed for it to work only on new install, adding it to existing instance does not seem to work, even after issuing rake search:reindex

params:
  LANGFULL: polish
  IDENTIFIER: pl_PL
  
run:
  - exec:
      cmd:
        - apt-get update
        - apt-get install postgresql-server-dev-10
        - wget https://raw.githubusercontent.com/lemonskyjwt/plpstgrssearch/master/pg_hunspell_install
        - chmod +x pg_hunspell_install
        - /bin/sh pg_hunspell_install pl PL $LANGFULL
hooks:
  after_postgres:
     - exec: su postgres -c 'psql template1 -c "create extension if not exists unaccent;"'
     - exec: su postgres -c 'psql $db_name -c "create extension if not exists unaccent;"'
     - exec:
         stdin: |
           CREATE TEXT SEARCH DICTIONARY $LANGFULL_hunspell (TEMPLATE  = ispell, DictFile  = $IDENTIFIER, AffFile   = $IDENTIFIER, StopWords = $LANGFULL);\nCOMMENT ON TEXT SEARCH DICTIONARY $LANGFULL_hunspell\nIS '[USER ADDED] Hunspell dictionary for $LANGFULL';\nCREATE TEXT SEARCH CONFIGURATION public.$LANGFULL (COPY = pg_catalog.english);\nALTER TEXT SEARCH CONFIGURATION $LANGFULL\nALTER MAPPING FOR asciiword, asciihword, hword_asciipart,  word, hword, hword_part\nWITH $LANGFULL_hunspell, unaccent, simple;\nCOMMENT ON TEXT SEARCH CONFIGURATION $LANGFULL\nIS '[USER ADDED] configuration for $LANGFULL';
         cmd: su - postgres -c 'psql discourse'

sam · August 31, 2018, 1:25am

OK I know this is a newer topic but all the discussion work happened here:

Closing this one in favor of

Topic		Replies	Views
Discourse should ignore if a character is accented when doing a search Feature completed , search	53	6038	February 13, 2024
Custom dictionary for Postgres fulltext search Feature	1	1990	June 22, 2020
Search problems in v2.3 Support	15	1293	April 16, 2023
Korean words can't be searched Support	36	1615	November 22, 2020
Db_default_text_search_config language options? Support	1	689	February 1, 2021

Diacritics and search

Related topics