If you really want to push this forward prior to release I am open to a site setting here, I guess we can live with one less translation if people do not pick it in time, but it must be enabled specifically in languages that we know need this vs globally.
My take is that what Sam means by “one less translation” has to do with the setting’s token-value pair. eg. ignore_accents: 'ignore accents in search'
So, say for French, that might (a total google translate guess here) be more like ignore_accents: 'ignorer les accents dans la recherche'
The blog post you linked to is more like why it wouldn’t be like ignorer_les_accents: 'ignorer les accents dans la recherche'
To get to the point, if a site was the French locale, but had not pulled in the new token-value with the translation in their locale yml file, the fallback would be to display the English text. It would look out of place, but it would be only “one translation”.
Yes, I understand that, but my point was different.
I was partly joking, but merely pointing out that the untranslated string would be visible only to forum admins, who supposedly should have at least rudimentary English knowledge anyway.
If we are weighing alternatives for major release, then working search seems like a much bigger deal then one untranslated string.
We also need to walk through every locale we have (there are quite a lot) and make a reasonable informed decision about if we enable the feature or not.
Reading Diacritic - Wikipedia shows how complex the situation of stripping diacritics is also not great for Scandinavian languages, this is a hairy issue
In Hebrew for example you probably would not strip since zero people write with diacritics online, so a bunch of domain knowledge is required
Thanks, will have a look tomorrow. Agree it is not simple, but to be on the safe side, we can leave the default off for most langs and keep the decision on admins.
I think German is another one, do we have somebody to confirm? @tobiaseigen sounds like a German name perhaps?
Hungarian and Turkish are also strong candidates, I will ask around.
I suspect that Google has an automatic per word heuristic that is language neutral. I doubt we have any chance in matching what they do.
@Osama / @Pad_Pors what is the correct thing to do in Arabic, I can see plenty of diacritics in macdiscussions.udacity.com but not sure which ought to be stripped for search and which should be kept in tact.
In Arabic we almost never type diacritics in day-to-day communications, because an Arabic diacritic is separate character that you need to type in addition to the base character you want to add the diacritic to, you can imagine you painful that is. So I’d expect search engines to always find results with and without diacritics whether I type diacritics or not.
Okay, from discussion above it seems stripping diacritics for search should be enabled by default at least for:
French, Portugal (and by extension Spanish as well I guess), Arabic, Farsi, Czech.
I have it on good authority that Turkish should be stripped as well. Hungarian most probably as well, but would be good to have additional confirmation, maybe @asrob could confirm?
From the other langs from the list, I can definitely say that Slovak should be included (very very similar to Czech) and most probably Polish and Slovene as well.
While I’m not involved in this i18n, it has been a most interesting discussion to learn about how other languages are actually written. Thanks to everyone who shared their written language perspective. A virtual like to all of you!