Discourse should ignore if a character is accented when doing a search

Hi.
In French, we have a lot of accented characters, and also a lot of people that don’t bother typing them when writing.

Because Discourse seems to be strict regarding accents, this leads to partial results only when a user enters a string which can contain accents, from the search feature or the category filter when creating a topic :

And

And so on…

In the topic creation:

But:

I think this is a major flaw regarding the search efficiency in some languages.

21 Likes

I support a change that strips off all accent chars prior to tokenizing it feels reasonably safe.

Going to add this to our 2.1 roadmap.

16 Likes

OMG, thanks in advance! This is huge issue for Czech alphabet as well.

I am slightly confused though because I am pretty sure I saw a related topic here on META which appeared to claim that this should be handled by a proper setting of PostgreSQL DB.

In theory it should once we configure unaccent right per: https://www.postgresql.org/docs/9.6/static/unaccent.html

But I feel like the simplest and most robust thing to do here is just to never send these accents up to the index in the first place, do the work in ruby to remove them. Then we do not reply on PG text search being configured just so.

That said if you are feeling brave I recommend you try some of those options above to correct this.

4 Likes

@joffreyjaffeux what is your opinion on this regarding select-kit ?

We would need something like this:

2 Likes

Should be easy to support and a good idea. normalize is not supported by IE but maybe we can make it work with https://github.com/walling/unorm/blob/master/src/unorm.js

2 Likes

I just merged a commit supporting this:

https://github.com/discourse/discourse/commit/eb9b99e5193ab82e551351d07b973ad9714ff378

For now it won’t work with browser which don’t have native support for normalize as shown here: String.prototype.normalize() - JavaScript | MDN

@Canapin if you have time to test this it should solve the issue you had in topic creation.

8 Likes

This bug report from 2017 is related:

3 Likes

Hm, just so I understand, this was fixed only partially, right? The full search is still sensitive to accents as far as I can tell. I am only confused because this was marked as done in

1 Like

Yes this is not complete yet, we still need to get to stripping accents from search here:

https://github.com/discourse/discourse/blob/5a6d1ee25788f308985b8f5801d32fce0e4505bc/app/services/search_indexer.rb#L168-L216

5 Likes

Ít̊’̊s̘ n̐ǫw͋ d̻oͭn͑e̐ :ok_hand:

https://github.com/discourse/discourse/commit/2fcf2b899e6193701b5c56414980789f0d0f2cbd

11 Likes

This topic was automatically closed after 41 minutes. New replies are no longer allowed.

Note search within topic is still going to have the issue, sadly it is much harder to sort out

2 Likes

OK this is absolutely not PR welcome territory cause it is a very tricky problem to solve.

In particular there are 2 very important things to keep in mind:

  1. Diacritic stripping for search should be optional and default off on some languages like Vietnamese. In Vietnamese you never want to strip diacritics cause you end up getting nonsensical results.

  2. The excerpts we show for search results should always show the diacritics, cause otherwise it just looks like there are bunch of silly spelling mistakes even in French. This is a hard problem cause we lean on PG to create the excerpts.

Given (1) I am reverting this feature for now, and given (2) this is something we will tackle for 2.2 release.

Reverted per:

https://github.com/discourse/discourse/commit/9b7cab589ac15a034d1c0e700230c1b3f63f8ba0

8 Likes

Do I understand correctly that there is currently no option to keep this feature, right?
In which case I might want to keep older version on our Czech site for a while…

Yeah, hold tight, expect us to redo the feature in the next 3-4 weeks. But if there are security fixes I do recommend just giving up and upgrading. (there was one yesterday)

I am curious, in Czech don’t the excerpts looks a bit “kindergarten” level of spelling when we strip the diacritics out of the excerpts or is it something people do not notice?

3 Likes

There are many people who just do not bother with diacritics because it is harder to type (we have basically QWERTY keyboard with accented characters instead of numbers on the top row). So while it does not look great, incomplete search is MUCH bigger problem for us, especially because people are expecting this to just work (i.e. habits from internet search engines).

I expect the the same is true for most Slavic languages.

Thanks for working on this! Good to see that Discourse is spreading and i18n issues are becoming more important. :stuck_out_tongue:

EDIT: You can multiply the pain of typing accented characters x10 on mobile. :slight_smile:

5 Likes

We can add the feature back behind a site setting very very shortly post our upcoming release in 1 - 1.5 weeks, I just want to avoid adding strings now which we end up needing to translate.

Then we can spend the extra time needed on fixing excerpts after we get the optional feature in.

2 Likes

I also agree with @danekhollas. Having the search work is a lot better than having accents in :fr: search results. Most people would not even notice…

7 Likes