Diacritics and search

Is it possible that Discourse’s search engine doesn’t know about fuzzy search with diacritics?

Diacritics are glyphs added to letters in many languages (notably not in English, except when borrowing foreign words and even then). Speakers of languages with diacritics expect search engines to basically ignore them (i.e. searching for “Mexico” or “México” provides the same results).

In Discourse, diacritics seem to be critical: :wink:

Languages with diacritics use them everywhere, but then their speakers are often lazy or wrong using them. The difference between finding the content you are looking for or not is huge.

I Discourse using an own search engine or embedding an existing one i.e. Elasticsearch? In any case, hopefully there is a library or something that Discourse developers could use to integrate diacritics fuzziness in search?

2 Likes

I remember that the full text search index built works best with English text… There have been a lot of discussions on meta regarding Chinese.

For other languages, you may search meta further. However it may involving changing the default locale of PostgreSql or similar that I remember reading here.

If you are talking about non English sites you need to set the database encoding to match the language the site is in.

1 Like

Right, that might explain. Currently the locale is English. The plan is to change to Catalan after resolving Catalan translation.

When the default locale of the site is changed, is the corresponding database encoding automatically changed?

No, you must set that at install time. How do we set database encoding during setup @techapj? I can’t recall.

I believe it has something to do with

  db_default_text_search_config: "pg_catalog.english"

In the .yml file

5 Likes

Please bare with me, I don’t know much about databases. If someone has a Discourse instance set to English as default language that has already real users and real content… What do they need to do in order to change the default locale effectively, in addition to changing the default locale setting in the admin interface?

A step by step process would be very helpful. Discourse installation documentation is so good that it drove someone like me successfully through the entire process. :wink: The little detail about the implications of defining a locale in a fresh installation could be documented too, in order to avoid surprises like this one.

An already-existing database is always a problem regardless what database engine you use… I doubt it is very simple to change database locales, especially when there is already data in it.

I’d suggest you search via Google to see how to change PostgreSQL database default locale when the database is not empty.

This might be a silly question. Would having the database encoded to UTF-8 by default in Discourse solve these kinds of problems? Isn’t UTF-8 as good for English as for many other languages with diacritics and more?

UTF8 is an encoding. It does not know of languages and locales and how to split and compare words.

4 Likes

I think I will live with this problem for now. Anyway my site is multilingual and therefore no single locale will fit perfectly.

Maybe at some point in the future Discourse’s search allows for more fuzziness in general (users have to be accurate in English searches as well). That would probably be good enough to solve the problem of diacritics regardless of language.

Well, actually I found out it would require enabling for given language in the database an extension called ‘Unaccent’, but I’m afraid this is somewhat above my expertise with pgsql, sorry.

3 Likes

I think this is a feature request if anything, but looking at the doco by @MakaryGo I think running something like:

./launcher enter app
sudo postgres psql discourse
discourse=# CREATE EXTENSION unaccent;
discourse=# CREATE TEXT SEARCH CONFIGURATION en ( COPY = english);
discourse=# ALTER TEXT SEARCH CONFIGURATION en
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, english_stem;
discourse=# select to_tsvector('en','Hôtels de la Mer');
           to_tsvector           
---------------------------------
 'de':2 'hotel':1 'la':3 'mer':4
(1 row)

I do wonder @codinghorror if we should unaccent out-of-the-box or not.

2 Likes

If you decide to unaccent out of the box, it would be good if this could be overwritten by using quotation marks:

poesía => poesia

but

“poesía” => poesía

sadly it is not that simple, we only carry one index around and this would impact said index. The only easy change is default unaccent, anything else is major engineering work.

1 Like

I’ve managed to add Polish fulltext search, thanks to PostgreSQL ability to use hunspell dictionaries for languages that don’t have snowball stemmer algorithm and this Github script:

It does unaccent as well and should be easy to use with other languages as well.
I’d submit it as a PR, but it’s not really production quality, rather a hack. But perhaps some :discourse: dev will be able to make it work, what do you think @sam?
It’s a pups template that’s later included to main app.yml
Edit:

Unfortunately I’ve managed for it to work only on new install, adding it to existing instance does not seem to work, even after issuing rake search:reindex

params:
  LANGFULL: polish
  IDENTIFIER: pl_PL
  
run:
  - exec:
      cmd:
        - apt-get update
        - apt-get install postgresql-server-dev-10
        - wget https://raw.githubusercontent.com/lemonskyjwt/plpstgrssearch/master/pg_hunspell_install
        - chmod +x pg_hunspell_install
        - /bin/sh pg_hunspell_install pl PL $LANGFULL
hooks:
  after_postgres:
     - exec: su postgres -c 'psql template1 -c "create extension if not exists unaccent;"'
     - exec: su postgres -c 'psql $db_name -c "create extension if not exists unaccent;"'
     - exec:
         stdin: |
           CREATE TEXT SEARCH DICTIONARY $LANGFULL_hunspell (TEMPLATE  = ispell, DictFile  = $IDENTIFIER, AffFile   = $IDENTIFIER, StopWords = $LANGFULL);\nCOMMENT ON TEXT SEARCH DICTIONARY $LANGFULL_hunspell\nIS '[USER ADDED] Hunspell dictionary for $LANGFULL';\nCREATE TEXT SEARCH CONFIGURATION public.$LANGFULL (COPY = pg_catalog.english);\nALTER TEXT SEARCH CONFIGURATION $LANGFULL\nALTER MAPPING FOR asciiword, asciihword, hword_asciipart,  word, hword, hword_part\nWITH $LANGFULL_hunspell, unaccent, simple;\nCOMMENT ON TEXT SEARCH CONFIGURATION $LANGFULL\nIS '[USER ADDED] configuration for $LANGFULL';
         cmd: su - postgres -c 'psql discourse'
2 Likes

OK I know this is a newer topic but all the discussion work happened here:

Closing this one in favor of :arrow_double_up:

4 Likes