Suggestion of a slight improvement regarding French search processing ("œ" and "æ" special characters)

Some French words use the following characters:

  • Œ, œ, like in œuf (egg) or cœur (heart), œuvre (multiple translations and meanings so I won’t detail here, it doesn’t matter), etc.
  • Æ, æ, like in nævus (the scientific term for mole (the dot on the skin, not the animal) or others.

æ is rarely used (I think it’s always scientific/medical terms from Latin?), but œ, however, is quite present in the French language.

Sadly, these special characters aren’t present on the French keyboard layout and a lot of people simply write “oe” or “ae” instead.
But some users have custom layouts, and autocorrection from smartphones, or some other writing help systems often automatically correct the words using the proper “merged characters”, like “oeuvre” → “œuvre”.

Currently, the search processes “oe” and “œ” as different strings, which leads to different results that should be common instead.

Example:
https://forum.monocycle.info/search?q=coeur
https://forum.monocycle.info/search?q=cœur

My suggestion is that “oe” and “œ”, “OE” and “Œ” should be processed as identical strings. And also the same for “ae” and “æ”, “AE” and “Æ”.

9 Likes

Very interesting problem @zogstrip / @joffreyjaffeux what do you think? We could add a normalizer behind a site setting.

3 Likes
SELECT to_tsvector('french', E'Cette oeuvre d\'art n\'est pas une œuvre.');
            to_tsvector             
------------------------------------
 'art':4 'cet':1 'oeuvr':2 'œuvr':9
(1 row)

For some reasons, I thought handling diacritics, ligatures and such was a solved problem when dealing with search… I guess not? :man_shrugging:

As a :fr: I definitely support that. Looks like we could use postgresql’s unaccent which remove accents and also deal with common ligatures.

SELECT to_tsvector('french', unaccent('œuvre poêle œuf Noël électroencéphalogramme æ Æ'));
                               to_tsvector                               
-------------------------------------------------------------------------
 'ae':6,7 'electroencephalogramm':5 'noel':4 'oeuf':3 'oeuvr':1 'poel':2
(1 row)
4 Likes

And as a French, do you also hate the useless complexity of this (though interesting) language as much as I do? :smile:
Sorry for the slight off-topic humor

4 Likes

Do you feel we should simply amend the implementation of search_ignore_accents to use unaccent or would we need a whole new setting?

I kind of like simply changing the implementation of ignore accents cause there is parity with what PG does anyway.

3 Likes

That’s a good question :thinking:

It would definitely work for :fr: but there might be other locales where it might not work as expected?

After looking at the /usr/share/postgresql/13/tsearch_data/unaccent.rules files it looks like it’s pretty safe.

I definitely support removing switching our search_ignore_accents setting to use postgresql’s unaccent :+1:

@nbianca can you add this to your list?

3 Likes

I replaced our old Ruby implementation with Postgresql’s unaccent in this PR:

5 Likes

This topic was automatically closed after 3 days. New replies are no longer allowed.