Canapin
(Coin-coin le Canapin)
February 21, 2022, 5:05pm
1
Some French words use the following characters:
Œ, œ, like in œuf (egg) or cœur (heart), œuvre (multiple translations and meanings so I won’t detail here, it doesn’t matter), etc .
Æ, æ, like in nævus (the scientific term for mole (the dot on the skin, not the animal) or others .
æ is rarely used (I think it’s always scientific/medical terms from Latin?), but œ , however, is quite present in the French language.
Sadly, these special characters aren’t present on the French keyboard layout and a lot of people simply write “oe” or “ae” instead.
But some users have custom layouts, and autocorrection from smartphones, or some other writing help systems often automatically correct the words using the proper “merged characters”, like “oeuvre” → “œuvre”.
Currently, the search processes “oe” and “œ” as different strings, which leads to different results that should be common instead.
Example:
https://forum.monocycle.info/search?q=coeur
https://forum.monocycle.info/search?q=cœur
My suggestion is that “oe” and “œ”, “OE” and “Œ” should be processed as identical strings. And also the same for “ae” and “æ”, “AE” and “Æ”.
9 Likes
sam
(Sam Saffron)
February 23, 2022, 11:57pm
2
Very interesting problem @zogstrip / @joffreyjaffeux what do you think? We could add a normalizer behind a site setting.
3 Likes
zogstrip
(Régis Hanol)
February 24, 2022, 1:30pm
3
SELECT to_tsvector('french', E'Cette oeuvre d\'art n\'est pas une œuvre.');
to_tsvector
------------------------------------
'art':4 'cet':1 'oeuvr':2 'œuvr':9
(1 row)
For some reasons, I thought handling diacritics, ligatures and such was a solved problem when dealing with search… I guess not?
As a I definitely support that. Looks like we could use postgresql’s unaccent
which remove accents and also deal with common ligatures.
SELECT to_tsvector('french', unaccent('œuvre poêle œuf Noël électroencéphalogramme æ Æ'));
to_tsvector
-------------------------------------------------------------------------
'ae':6,7 'electroencephalogramm':5 'noel':4 'oeuf':3 'oeuvr':1 'poel':2
(1 row)
4 Likes
Canapin
(Coin-coin le Canapin)
February 24, 2022, 1:33pm
4
zogstrip:
As a
And as a French, do you also hate the useless complexity of this (though interesting) language as much as I do?
Sorry for the slight off-topic humor
4 Likes
sam
(Sam Saffron)
February 25, 2022, 2:09am
5
Do you feel we should simply amend the implementation of search_ignore_accents
to use unaccent
or would we need a whole new setting?
I kind of like simply changing the implementation of ignore accents cause there is parity with what PG does anyway.
3 Likes
zogstrip
(Régis Hanol)
February 25, 2022, 2:44pm
6
That’s a good question
It would definitely work for but there might be other locales where it might not work as expected?
After looking at the /usr/share/postgresql/13/tsearch_data/unaccent.rules
files it looks like it’s pretty safe.
I definitely support removing switching our search_ignore_accents
setting to use postgresql’s unaccent
@nbianca can you add this to your list?
3 Likes
nbianca
(Bianca)
March 8, 2022, 5:52pm
7
I replaced our old Ruby implementation with Postgresql’s unaccent in this PR:
discourse:main
← discourse:feature_unaccent
opened 05:13PM - 04 Mar 22 UTC
The search_ignore_accents site setting can be used to make the search
indexer r… emove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
5 Likes
nbianca
(Bianca)
Closed
March 12, 2022, 6:00am
8
This topic was automatically closed after 3 days. New replies are no longer allowed.