Discourse should ignore if a character is accented when doing a search

planned

(Coin-coin le Canapin) #1

Hi.
In French, we have a lot of accented characters, and also a lot of people that don’t bother typing them when writing.

Because Discourse seems to be strict regarding accents, this leads to partial results only when a user enters a string which can contain accents, from the search feature or the category filter when creating a topic :

And

And so on…

In the topic creation:

But:

I think this is a major flaw regarding the search efficiency in some languages.


2018: The Year in Review
Discourse Version 2.2
Diacritics and search
(Sam Saffron) #2

I support a change that strips off all accent chars prior to tokenizing it feels reasonably safe.

Going to add this to our 2.1 roadmap.


(Daniel Hollas) #3

OMG, thanks in advance! This is huge issue for Czech alphabet as well.

I am slightly confused though because I am pretty sure I saw a related topic here on META which appeared to claim that this should be handled by a proper setting of PostgreSQL DB.


(Sam Saffron) #4

In theory it should once we configure unaccent right per: PostgreSQL: Documentation: 9.6: unaccent

But I feel like the simplest and most robust thing to do here is just to never send these accents up to the index in the first place, do the work in ruby to remove them. Then we do not reply on PG text search being configured just so.

That said if you are feeling brave I recommend you try some of those options above to correct this.


(Rafael dos Santos Silva) #5

@joffreyjaffeux what is your opinion on this regarding select-kit ?

We would need something like this:


(Joffrey Jaffeux) #6

Should be easy to support and a good idea. normalize is not supported by IE but maybe we can make it work with unorm/unorm.js at master · walling/unorm · GitHub


(Joffrey Jaffeux) #7

I just merged a commit supporting this:

For now it won’t work with browser which don’t have native support for normalize as shown here: String.prototype.normalize() | MDN

@canapin if you have time to test this it should solve the issue you had in topic creation.


(Robert) #8

This bug report from 2017 is related:


(Daniel Hollas) #9

Hm, just so I understand, this was fixed only partially, right? The full search is still sensitive to accents as far as I can tell. I am only confused because this was marked as done in


(Sam Saffron) #10

Yes this is not complete yet, we still need to get to stripping accents from search here:


(Régis Hanol) #12

Ít̊’̊s̘ n̐ǫw͋ d̻oͭn͑e̐ :ok_hand:


Removing diacritics when tokenizing for search does not work retroactively?
(Régis Hanol) closed #13

This topic was automatically closed after 41 minutes. New replies are no longer allowed.


(Sam Saffron) #14

Note search within topic is still going to have the issue, sadly it is much harder to sort out


(Sam Saffron) opened #15

(Sam Saffron) #16

OK this is absolutely not PR welcome territory cause it is a very tricky problem to solve.

In particular there are 2 very important things to keep in mind:

  1. Diacritic stripping for search should be optional and default off on some languages like Vietnamese. In Vietnamese you never want to strip diacritics cause you end up getting nonsensical results.

  2. The excerpts we show for search results should always show the diacritics, cause otherwise it just looks like there are bunch of silly spelling mistakes even in French. This is a hard problem cause we lean on PG to create the excerpts.

Given (1) I am reverting this feature for now, and given (2) this is something we will tackle for 2.2 release.

Reverted per:


(Daniel Hollas) #17

Do I understand correctly that there is currently no option to keep this feature, right?
In which case I might want to keep older version on our Czech site for a while…


(Sam Saffron) #18

Yeah, hold tight, expect us to redo the feature in the next 3-4 weeks. But if there are security fixes I do recommend just giving up and upgrading. (there was one yesterday)

I am curious, in Czech don’t the excerpts looks a bit “kindergarten” level of spelling when we strip the diacritics out of the excerpts or is it something people do not notice?


(Daniel Hollas) #19

There are many people who just do not bother with diacritics because it is harder to type (we have basically QWERTY keyboard with accented characters instead of numbers on the top row). So while it does not look great, incomplete search is MUCH bigger problem for us, especially because people are expecting this to just work (i.e. habits from internet search engines).

I expect the the same is true for most Slavic languages.

Thanks for working on this! Good to see that Discourse is spreading and i18n issues are becoming more important. :stuck_out_tongue:

EDIT: You can multiply the pain of typing accented characters x10 on mobile. :slight_smile:


(Sam Saffron) #20

We can add the feature back behind a site setting very very shortly post our upcoming release in 1 - 1.5 weeks, I just want to avoid adding strings now which we end up needing to translate.

Then we can spend the extra time needed on fixing excerpts after we get the optional feature in.


(Régis Hanol) #21

I also agree with @danekhollas. Having the search work is a lot better than having accents in :fr: search results. Most people would not even notice…