Suggestion of a slight improvement regarding French search processing ("œ" and "æ" special characters)

Canapin · February 21, 2022, 5:05pm

Some French words use the following characters:

Œ, œ, like in œuf (egg) or cœur (heart), œuvre (multiple translations and meanings so I won’t detail here, it doesn’t matter), etc.
Æ, æ, like in nævus (the scientific term for mole (the dot on the skin, not the animal) or others.

æ is rarely used (I think it’s always scientific/medical terms from Latin?), but œ, however, is quite present in the French language.

Sadly, these special characters aren’t present on the French keyboard layout and a lot of people simply write “oe” or “ae” instead.
But some users have custom layouts, and autocorrection from smartphones, or some other writing help systems often automatically correct the words using the proper “merged characters”, like “oeuvre” → “œuvre”.

Currently, the search processes “oe” and “œ” as different strings, which leads to different results that should be common instead.

Example:
https://forum.monocycle.info/search?q=coeur
https://forum.monocycle.info/search?q=cœur

My suggestion is that “oe” and “œ”, “OE” and “Œ” should be processed as identical strings. And also the same for “ae” and “æ”, “AE” and “Æ”.

sam · February 23, 2022, 11:57pm

Very interesting problem @zogstrip / @j.jaffeux what do you think? We could add a normalizer behind a site setting.

zogstrip · February 24, 2022, 1:30pm

SELECT to_tsvector('french', E'Cette oeuvre d\'art n\'est pas une œuvre.');
            to_tsvector             
------------------------------------
 'art':4 'cet':1 'oeuvr':2 'œuvr':9
(1 row)

For some reasons, I thought handling diacritics, ligatures and such was a solved problem when dealing with search… I guess not?

As a I definitely support that. Looks like we could use postgresql’s unaccent which remove accents and also deal with common ligatures.

SELECT to_tsvector('french', unaccent('œuvre poêle œuf Noël électroencéphalogramme æ Æ'));
                               to_tsvector                               
-------------------------------------------------------------------------
 'ae':6,7 'electroencephalogramm':5 'noel':4 'oeuf':3 'oeuvr':1 'poel':2
(1 row)

Canapin · February 24, 2022, 1:33pm

And as a French, do you also hate the useless complexity of this (though interesting) language as much as I do?
Sorry for the slight off-topic humor

sam · February 25, 2022, 2:09am

Do you feel we should simply amend the implementation of search_ignore_accents to use unaccent or would we need a whole new setting?

I kind of like simply changing the implementation of ignore accents cause there is parity with what PG does anyway.

zogstrip · February 25, 2022, 2:44pm

That’s a good question

It would definitely work for but there might be other locales where it might not work as expected?

After looking at the /usr/share/postgresql/13/tsearch_data/unaccent.rules files it looks like it’s pretty safe.

I definitely support removing switching our search_ignore_accents setting to use postgresql’s unaccent

@nbianca can you add this to your list?

nbianca · March 8, 2022, 5:52pm

I replaced our old Ruby implementation with Postgresql’s unaccent in this PR:

nbianca · March 12, 2022, 6:00am

This topic was automatically closed after 3 days. New replies are no longer allowed.

Topic		Replies	Views
Discourse should ignore if a character is accented when doing a search Feature search , completed	53	6004	February 13, 2024
Fulltext should ignore accents Feature	1	1073	December 27, 2015
Diacritics and search Feature	17	2331	August 31, 2018
Search problems in v2.3 Support	15	1283	April 16, 2023
The search should match fancy characters with their "regular" equivalent UX search	9	1043	December 23, 2020

Suggestion of a slight improvement regarding French search processing ("œ" and "æ" special characters)

Related topics