Discourse should ignore if a character is accented when doing a search

If you really want to push this forward prior to release I am open to a site setting here, I guess we can live with one less translation if people do not pick it in time, but it must be enabled specifically in languages that we know need this vs globally.

@Falco what is your call in Portuguese ?

7 Likes

Same thing as french. People never type accents when doing a quick search.

8 Likes

We are talking about a string in admin section, right? :slight_smile:

https://blog.codinghorror.com/the-ugly-american-programmer/

My take is that what Sam means by “one less translation” has to do with the setting’s token-value pair. eg.
ignore_accents: 'ignore accents in search'

So, say for French, that might (a total google translate guess here) be more like
ignore_accents: 'ignorer les accents dans la recherche'

The blog post you linked to is more like why it wouldn’t be like
ignorer_les_accents: 'ignorer les accents dans la recherche'

To get to the point, if a site was the French locale, but had not pulled in the new token-value with the translation in their locale yml file, the fallback would be to display the English text. It would look out of place, but it would be only “one translation”.

2 Likes

Yes, I understand that, but my point was different.

I was partly joking, but merely pointing out that the untranslated string would be visible only to forum admins, who supposedly should have at least rudimentary English knowledge anyway.

If we are weighing alternatives for major release, then working search seems like a much bigger deal then one untranslated string.

We also need to walk through every locale we have (there are quite a lot) and make a reasonable informed decision about if we enable the feature or not.

A list here would help

Could you point us to the list of currently available Discourse locales? I can help with the Slavic ones.

EDIT: Sorry, perhaps I misunderstood and you are asking for the same list…

our master list is at https://github.com/discourse/discourse/blob/master/config/locales/names.yml

Reading Diacritic - Wikipedia shows how complex the situation of stripping diacritics is also not great for Scandinavian languages, this is a hairy issue

In Hebrew for example you probably would not strip since zero people write with diacritics online, so a bunch of domain knowledge is required

3 Likes

I wasn’t sure. If an Admin Setting, then yes. If a Search page option, then no.

Thanks, will have a look tomorrow. Agree it is not simple, but to be on the safe side, we can leave the default off for most langs and keep the decision on admins.

I think German is another one, do we have somebody to confirm? :slight_smile: @tobiaseigen sounds like a German name perhaps? :face_with_monocle: :slight_smile:

Hungarian and Turkish are also strong candidates, I will ask around.

2 Likes

I wonder, is there a way to find out what Google does for each language? Seems like a good benchmark.

I suspect that Google has an automatic per word heuristic that is language neutral. I doubt we have any chance in matching what they do.

@Osama / @Pad_Pors what is the correct thing to do in Arabic, I can see plenty of diacritics in macdiscussions.udacity.com but not sure which ought to be stripped for search and which should be kept in tact.

3 Likes

In Arabic we almost never type diacritics in day-to-day communications, because an Arabic diacritic is separate character that you need to type in addition to the base character you want to add the diacritic to, you can imagine you painful that is. So I’d expect search engines to always find results with and without diacritics whether I type diacritics or not.

9 Likes

the same as @Osama has described goes for Persian locale (fa_IR), people rarely type diacritics; neither in search terms nor in their daily dialogues.

so I’d say one can forget about them at least for this locale.

7 Likes

Okay, from discussion above it seems stripping diacritics for search should be enabled by default at least for:

French, Portugal (and by extension Spanish as well I guess), Arabic, Farsi, Czech.

I have it on good authority that Turkish should be stripped as well. Hungarian most probably as well, but would be good to have additional confirmation, maybe @asrob could confirm?

From the other langs from the list, I can definitely say that Slovak should be included (very very similar to Czech) and most probably Polish and Slovene as well.

4 Likes

@danekhollas Also, the Greek language needs to be included in that list.

3 Likes

In Romanian you can strip all diacritics: we can write and read with the english letters w/o any problems.

3 Likes

Same with Catalan

(20 chars)

3 Likes

The PR is up :+1:

https://github.com/discourse/discourse/pull/6397/files

9 Likes

While I’m not involved in this i18n, it has been a most interesting discussion to learn about how other languages are actually written. Thanks to everyone who shared their written language perspective. A virtual like to all of you! :slight_smile:

6 Likes