Discourse should ignore if a character is accented when doing a search

sam · August 31, 2018, 7:39am

If you really want to push this forward prior to release I am open to a site setting here, I guess we can live with one less translation if people do not pick it in time, but it must be enabled specifically in languages that we know need this vs globally.

@Falco what is your call in Portuguese ?

Falco · August 31, 2018, 1:40pm

Same thing as french. People never type accents when doing a quick search.

danekhollas · August 31, 2018, 3:12pm

We are talking about a string in admin section, right?

Mittineague · August 31, 2018, 9:09pm

My take is that what Sam means by “one less translation” has to do with the setting’s token-value pair. eg.
ignore_accents: 'ignore accents in search'

So, say for French, that might (a total google translate guess here) be more like
ignore_accents: 'ignorer les accents dans la recherche'

The blog post you linked to is more like why it wouldn’t be like
ignorer_les_accents: 'ignorer les accents dans la recherche'

To get to the point, if a site was the French locale, but had not pulled in the new token-value with the translation in their locale yml file, the fallback would be to display the English text. It would look out of place, but it would be only “one translation”.

danekhollas · August 31, 2018, 9:53pm

Yes, I understand that, but my point was different.

I was partly joking, but merely pointing out that the untranslated string would be visible only to forum admins, who supposedly should have at least rudimentary English knowledge anyway.

If we are weighing alternatives for major release, then working search seems like a much bigger deal then one untranslated string.

sam · August 31, 2018, 9:55pm

We also need to walk through every locale we have (there are quite a lot) and make a reasonable informed decision about if we enable the feature or not.

A list here would help

danekhollas · August 31, 2018, 9:58pm

Could you point us to the list of currently available Discourse locales? I can help with the Slavic ones.

EDIT: Sorry, perhaps I misunderstood and you are asking for the same list…

sam · August 31, 2018, 10:10pm

our master list is at https://github.com/discourse/discourse/blob/master/config/locales/names.yml

Reading Diacritic - Wikipedia shows how complex the situation of stripping diacritics is also not great for Scandinavian languages, this is a hairy issue

In Hebrew for example you probably would not strip since zero people write with diacritics online, so a bunch of domain knowledge is required

Mittineague · August 31, 2018, 10:12pm

I wasn’t sure. If an Admin Setting, then yes. If a Search page option, then no.

danekhollas · August 31, 2018, 10:39pm

Thanks, will have a look tomorrow. Agree it is not simple, but to be on the safe side, we can leave the default off for most langs and keep the decision on admins.

I think German is another one, do we have somebody to confirm? @tobiaseigen sounds like a German name perhaps?

Hungarian and Turkish are also strong candidates, I will ask around.

danekhollas · August 31, 2018, 10:42pm

I wonder, is there a way to find out what Google does for each language? Seems like a good benchmark.

sam · September 2, 2018, 7:16am

I suspect that Google has an automatic per word heuristic that is language neutral. I doubt we have any chance in matching what they do.

@Osama / @Pad_Pors what is the correct thing to do in Arabic, I can see plenty of diacritics in macdiscussions.udacity.com but not sure which ought to be stripped for search and which should be kept in tact.

Osama · September 2, 2018, 10:00am

In Arabic we almost never type diacritics in day-to-day communications, because an Arabic diacritic is separate character that you need to type in addition to the base character you want to add the diacritic to, you can imagine you painful that is. So I’d expect search engines to always find results with and without diacritics whether I type diacritics or not.

Pad_Pors · September 2, 2018, 10:25am

the same as @Osama has described goes for Persian locale (fa_IR), people rarely type diacritics; neither in search terms nor in their daily dialogues.

so I’d say one can forget about them at least for this locale.

danekhollas · September 3, 2018, 10:05am

Okay, from discussion above it seems stripping diacritics for search should be enabled by default at least for:

French, Portugal (and by extension Spanish as well I guess), Arabic, Farsi, Czech.

I have it on good authority that Turkish should be stripped as well. Hungarian most probably as well, but would be good to have additional confirmation, maybe @asrob could confirm?

From the other langs from the list, I can definitely say that Slovak should be included (very very similar to Czech) and most probably Polish and Slovene as well.

chrispanag · September 5, 2018, 10:12pm

@danekhollas Also, the Greek language needs to be included in that list.

TheBestPessimist · September 6, 2018, 5:08am

In Romanian you can strip all diacritics: we can write and read with the english letters w/o any problems.

icaria36 · September 10, 2018, 10:24am

Same with Catalan

(20 chars)

zogstrip · September 13, 2018, 5:29pm

The PR is up

https://github.com/discourse/discourse/pull/6397/files

Bretty · September 14, 2018, 2:22am

While I’m not involved in this i18n, it has been a most interesting discussion to learn about how other languages are actually written. Thanks to everyone who shared their written language perspective. A virtual like to all of you!

Topic		Replies	Views
Diacritics and search Feature	17	2328	August 31, 2018
Search problems in v2.3 Support	15	1282	April 16, 2023
Removing diacritics when tokenizing for search does not work retroactively? Support	6	894	September 22, 2018
Do not strip diacritics for search in Vietnamese Bug	5	1213	September 1, 2018
Macrons and search results in NZ Support search	9	123	May 30, 2025

Discourse should ignore if a character is accented when doing a search

Related topics