Hi.
In French, we have a lot of accented characters, and also a lot of people that don’t bother typing them when writing.
Because Discourse seems to be strict regarding accents, this leads to partial results only when a user enters a string which can contain accents, from the search feature or the category filter when creating a topic :
OMG, thanks in advance! This is huge issue for Czech alphabet as well.
I am slightly confused though because I am pretty sure I saw a related topic here on META which appeared to claim that this should be handled by a proper setting of PostgreSQL DB.
But I feel like the simplest and most robust thing to do here is just to never send these accents up to the index in the first place, do the work in ruby to remove them. Then we do not reply on PG text search being configured just so.
That said if you are feeling brave I recommend you try some of those options above to correct this.
Hm, just so I understand, this was fixed only partially, right? The full search is still sensitive to accents as far as I can tell. I am only confused because this was marked as done in
OK this is absolutely not PR welcome territory cause it is a very tricky problem to solve.
In particular there are 2 very important things to keep in mind:
Diacritic stripping for search should be optional and default off on some languages like Vietnamese. In Vietnamese you never want to strip diacritics cause you end up getting nonsensical results.
The excerpts we show for search results should always show the diacritics, cause otherwise it just looks like there are bunch of silly spelling mistakes even in French. This is a hard problem cause we lean on PG to create the excerpts.
Given (1) I am reverting this feature for now, and given (2) this is something we will tackle for 2.2 release.
Do I understand correctly that there is currently no option to keep this feature, right?
In which case I might want to keep older version on our Czech site for a while…
Yeah, hold tight, expect us to redo the feature in the next 3-4 weeks. But if there are security fixes I do recommend just giving up and upgrading. (there was one yesterday)
I am curious, in Czech don’t the excerpts looks a bit “kindergarten” level of spelling when we strip the diacritics out of the excerpts or is it something people do not notice?
There are many people who just do not bother with diacritics because it is harder to type (we have basically QWERTY keyboard with accented characters instead of numbers on the top row). So while it does not look great, incomplete search is MUCH bigger problem for us, especially because people are expecting this to just work (i.e. habits from internet search engines).
I expect the the same is true for most Slavic languages.
Thanks for working on this! Good to see that Discourse is spreading and i18n issues are becoming more important.
EDIT: You can multiply the pain of typing accented characters x10 on mobile.
We can add the feature back behind a site setting very very shortly post our upcoming release in 1 - 1.5 weeks, I just want to avoid adding strings now which we end up needing to translate.
Then we can spend the extra time needed on fixing excerpts after we get the optional feature in.