Discourse should ignore if a character is accented when doing a search

Canapin · June 19, 2018, 10:52am

Hi.
In French, we have a lot of accented characters, and also a lot of people that don’t bother typing them when writing.

Because Discourse seems to be strict regarding accents, this leads to partial results only when a user enters a string which can contain accents, from the search feature or the category filter when creating a topic :

And

And so on…

In the topic creation:

But:

I think this is a major flaw regarding the search efficiency in some languages.

sam · June 20, 2018, 1:03am

I support a change that strips off all accent chars prior to tokenizing it feels reasonably safe.

Going to add this to our 2.1 roadmap.

danekhollas · June 20, 2018, 1:45am

OMG, thanks in advance! This is huge issue for Czech alphabet as well.

I am slightly confused though because I am pretty sure I saw a related topic here on META which appeared to claim that this should be handled by a proper setting of PostgreSQL DB.

sam · June 20, 2018, 1:50am

In theory it should once we configure unaccent right per: https://www.postgresql.org/docs/9.6/static/unaccent.html

But I feel like the simplest and most robust thing to do here is just to never send these accents up to the index in the first place, do the work in ruby to remove them. Then we do not reply on PG text search being configured just so.

That said if you are feeling brave I recommend you try some of those options above to correct this.

Falco · June 26, 2018, 12:04am

@j.jaffeux what is your opinion on this regarding select-kit ?

We would need something like this:

j.jaffeux · June 26, 2018, 7:04am

Should be easy to support and a good idea. normalize is not supported by IE but maybe we can make it work with https://github.com/walling/unorm/blob/master/src/unorm.js

j.jaffeux · June 26, 2018, 10:21am

I just merged a commit supporting this:

https://github.com/discourse/discourse/commit/eb9b99e5193ab82e551351d07b973ad9714ff378

For now it won’t work with browser which don’t have native support for normalize as shown here: String.prototype.normalize() - JavaScript | MDN

@Canapin if you have time to test this it should solve the issue you had in topic creation.

rriemann · August 18, 2018, 4:38pm

This bug report from 2017 is related:

danekhollas · August 23, 2018, 2:40am

Hm, just so I understand, this was fixed only partially, right? The full search is still sensitive to accents as far as I can tell. I am only confused because this was marked as done in

sam · August 23, 2018, 2:46am

Yes this is not complete yet, we still need to get to stripping accents from search here:

https://github.com/discourse/discourse/blob/5a6d1ee25788f308985b8f5801d32fce0e4505bc/app/services/search_indexer.rb#L168-L216

zogstrip · August 23, 2018, 3:17pm

Ít̊’̊s̘ n̐ǫw͋ d̻oͭn͑e̐

https://github.com/discourse/discourse/commit/2fcf2b899e6193701b5c56414980789f0d0f2cbd

zogstrip · August 23, 2018, 4:02pm

This topic was automatically closed after 41 minutes. New replies are no longer allowed.

sam · August 23, 2018, 9:12pm

Note search within topic is still going to have the issue, sadly it is much harder to sort out

sam · August 31, 2018, 1:30am

OK this is absolutely not PR welcome territory cause it is a very tricky problem to solve.

In particular there are 2 very important things to keep in mind:

Diacritic stripping for search should be optional and default off on some languages like Vietnamese. In Vietnamese you never want to strip diacritics cause you end up getting nonsensical results.
The excerpts we show for search results should always show the diacritics, cause otherwise it just looks like there are bunch of silly spelling mistakes even in French. This is a hard problem cause we lean on PG to create the excerpts.

Given (1) I am reverting this feature for now, and given (2) this is something we will tackle for 2.2 release.

Reverted per:

https://github.com/discourse/discourse/commit/9b7cab589ac15a034d1c0e700230c1b3f63f8ba0

danekhollas · August 31, 2018, 6:52am

Do I understand correctly that there is currently no option to keep this feature, right?
In which case I might want to keep older version on our Czech site for a while…

sam · August 31, 2018, 7:11am

Yeah, hold tight, expect us to redo the feature in the next 3-4 weeks. But if there are security fixes I do recommend just giving up and upgrading. (there was one yesterday)

I am curious, in Czech don’t the excerpts looks a bit “kindergarten” level of spelling when we strip the diacritics out of the excerpts or is it something people do not notice?

danekhollas · August 31, 2018, 7:21am

There are many people who just do not bother with diacritics because it is harder to type (we have basically QWERTY keyboard with accented characters instead of numbers on the top row). So while it does not look great, incomplete search is MUCH bigger problem for us, especially because people are expecting this to just work (i.e. habits from internet search engines).

I expect the the same is true for most Slavic languages.

Thanks for working on this! Good to see that Discourse is spreading and i18n issues are becoming more important.

EDIT: You can multiply the pain of typing accented characters x10 on mobile.

sam · August 31, 2018, 7:28am

We can add the feature back behind a site setting very very shortly post our upcoming release in 1 - 1.5 weeks, I just want to avoid adding strings now which we end up needing to translate.

Then we can spend the extra time needed on fixing excerpts after we get the optional feature in.

zogstrip · August 31, 2018, 7:34am

I also agree with @danekhollas. Having the search work is a lot better than having accents in search results. Most people would not even notice…

Topic		Replies	Views
Diacritics and search Feature	17	2334	August 31, 2018
Search problems in v2.3 Support	15	1287	April 16, 2023
Removing diacritics when tokenizing for search does not work retroactively? Support	6	896	September 22, 2018
Do not strip diacritics for search in Vietnamese Bug	5	1214	September 1, 2018
Macrons and search results in NZ Support search	9	129	May 30, 2025

Discourse should ignore if a character is accented when doing a search

Related topics