Canapin
(Coin-coin le Canapin)
21. Februar 2022 um 17:05
1
Some French words use the following characters:
Œ, œ, like in œuf (egg) or cœur (heart), œuvre (multiple translations and meanings so I won’t detail here, it doesn’t matter), etc .
Æ, æ, like in nævus (the scientific term for mole (the dot on the skin, not the animal) or others .
æ is rarely used (I think it’s always scientific/medical terms from Latin?), but œ , however, is quite present in the French language.
Sadly, these special characters aren’t present on the French keyboard layout and a lot of people simply write “oe” or “ae” instead.
But some users have custom layouts, and autocorrection from smartphones, or some other writing help systems often automatically correct the words using the proper “merged characters”, like “oeuvre” → “œuvre”.
Currently, the search processes “oe” and “œ” as different strings, which leads to different results that should be common instead.
Example:
https://forum.monocycle.info/search?q=coeur
https://forum.monocycle.info/search?q=cœur
My suggestion is that “oe” and “œ”, “OE” and “Œ” should be processed as identical strings. And also the same for “ae” and “æ”, “AE” and “Æ”.
9 „Gefällt mir“
sam
(Sam Saffron)
23. Februar 2022 um 23:57
2
Very interesting problem @zogstrip / @j.jaffeux what do you think? We could add a normalizer behind a site setting.
3 „Gefällt mir“
zogstrip
(Régis Hanol)
24. Februar 2022 um 13:30
3
SELECT to_tsvector('french', E'Cette oeuvre d\'art n\'est pas une œuvre.');
to_tsvector
------------------------------------
'art':4 'cet':1 'oeuvr':2 'œuvr':9
(1 row)
For some reasons, I thought handling diacritics, ligatures and such was a solved problem when dealing with search… I guess not?
As a I definitely support that. Looks like we could use postgresql’s unaccent
which remove accents and also deal with common ligatures.
SELECT to_tsvector('french', unaccent('œuvre poêle œuf Noël électroencéphalogramme æ Æ'));
to_tsvector
-------------------------------------------------------------------------
'ae':6,7 'electroencephalogramm':5 'noel':4 'oeuf':3 'oeuvr':1 'poel':2
(1 row)
4 „Gefällt mir“
Canapin
(Coin-coin le Canapin)
24. Februar 2022 um 13:33
4
zogstrip:
As a
And as a French, do you also hate the useless complexity of this (though interesting) language as much as I do?
Sorry for the slight off-topic humor
4 „Gefällt mir“
sam
(Sam Saffron)
25. Februar 2022 um 02:09
5
Do you feel we should simply amend the implementation of search_ignore_accents
to use unaccent
or would we need a whole new setting?
I kind of like simply changing the implementation of ignore accents cause there is parity with what PG does anyway.
3 „Gefällt mir“
zogstrip
(Régis Hanol)
25. Februar 2022 um 14:44
6
That’s a good question
It would definitely work for but there might be other locales where it might not work as expected?
After looking at the /usr/share/postgresql/13/tsearch_data/unaccent.rules
files it looks like it’s pretty safe.
I definitely support removing switching our search_ignore_accents
setting to use postgresql’s unaccent
@nbianca can you add this to your list?
3 „Gefällt mir“
nbianca
(Bianca)
8. März 2022 um 17:52
7
I replaced our old Ruby implementation with Postgresql’s unaccent in this PR:
discourse:main
← discourse:feature_unaccent
opened 05:13PM - 04 Mar 22 UTC
The search_ignore_accents site setting can be used to make the search
indexer r… emove the accents before indexing the content. The unaccent
function from PostgreSQL is better than Ruby's unicode_normalize(:nfkd).
5 „Gefällt mir“
nbianca
(Bianca)
Geschlossen,
12. März 2022 um 06:00
8
This topic was automatically closed after 3 days. New replies are no longer allowed.