Discourse should ignore if a character is accented when doing a search

zogstrip · September 17, 2018, 9:37am

This is now in with a new name

https://github.com/discourse/discourse/commit/4481836de2feb4813b6282a6ec4ae4fdde509627

danekhollas · October 2, 2018, 9:17pm

Hmm, unfortunately it seems we’re not quite there yet.

I see two big issues after bit of testing (upgraded today to the latest Discourse version):

diacritics is not elided from the query string itself, i.e. if I search with a word including diacritics, I will not find anything
diacritics is not stripped from post titles

sam · October 17, 2018, 11:34pm

@zogstrip this feels like something we got to get sorted, @danekhollas how brave are you feeling? Do you want to try a PR?

danekhollas · October 18, 2018, 1:45am

Thank you @sam for asking, I am humbled. I can do some explorations during weekend, but I want to be honest with you that in any case this would most likely require a lot of hand holding and I do not want you to spend more time with me than you would need to fix this yourselves . It’s also possible that I will hit a wall in the process and you’d have to finish up. (by which I mean I’d need to go too deep into Rails/Ruby/etc to solve the issues).

If you’re okay with that, some initial pointers where to look would be appreciated. I’ll be looking at the PRs from @zogstrip from this topic, but I don’t think that they hit all the components that might need to be modified.

Otherwise, don’t let me stop you…

zogstrip · October 18, 2018, 9:56am

You’ll probably need to extract the strip_diacritics method so it can also be used in the Search.prepare_data method as well when SiteSetting.search_ignore_accents is enabled.

asrob · October 19, 2018, 4:25pm

There might be problems, some examples:

“álom” means “dream”.
“alom” means “litter”.
“rag” means “suffix” / “inflection”.
“rág” means “chew”.
“kar” means “arm”.
“kár” means “damage”.

But I think end users can search more precisely, so +1 from me.

danekhollas · October 19, 2018, 8:50pm

Thanks for chiming in! I guess it depends on how many of these examples are there. If there are not that many, then I’d say it is always better to get more search results, albeit sometimes irrelevant, rather than getting no results at all.

Hector · October 19, 2018, 11:20pm

This can be better handled at the database level with an appropriate collation, instead of the blunt approach of stripping accents. Many databases offer accent insensitive collations for different languages. Surprisingly, compared to other databases Postgres is lacking in this respect.

Since version 10 Postgres has started incorporating support for ICU (International Components for Unicode)[1]. This library provides proper handling of accents for sorting and searching. Unfortunately not all functionality has been integrated yet. But maybe it’s worth keeping an eye on this area of Postgres development.

[1] https://blog.2ndquadrant.com/icu-support-postgresql-10/

danekhollas · October 22, 2018, 4:22pm

@zogstrip Thank you for pointers, they seem to do the trick! I’ve made a PR

https://github.com/discourse/discourse/pull/6518

I’ve tried adding some tests, but was generally super confused about how rspec works. They do seem to work though.

One open questions: The search log still includes accents if the strip_diacritics function is called from Search.prepare_data. I am not sure that is the desired behaviour since there will be separate logs for queries that are identical from the search perspective.

danekhollas · October 23, 2018, 5:41pm

Thanks Sam for merging! Just upgraded our forum and it works well.

I am quite confused about the second issue. I could not reproduce it in my dev environment, but it was definitely a problem on our production forum, even after upgrade. (and it was affecting new post titles as well) I eventually fixed it by rebuilding the Postgre index. i.e.

cd /var/discourse
./launcher enter app
rake search:reindex

@sam we’ve already discussed triggering the search index rebuild for everyone, but I am not sure that it happened. Perhaps now is a good time after my fix was merged.

FYI for others: outstanding issues related to search w/o diacritics:

does not work for in-topic search
diacritics should not be stripped from excerpts on the search page
- need to make sure that “word-boldening” on search page works correctly after 2. is fixed

sam · October 23, 2018, 8:56pm

I want to wait a tiny bit more, rebuild of index is quite expensive so I want to make sure I don’t dish out the cost to early.

danekhollas · October 24, 2018, 3:25am

Okay, if it is tiny. for others: if you update to tests passed right now and have a locale for which this site setting is default on, your search will be badly broken, so you need to either turn of the settings or rebuild index manually as described above.

Actually, when somebody changes this setting, it needs to trigger the rebuild, otherwise it’s just not gonna work. Is that even possible to do? @sam @zogstrip

sam · October 24, 2018, 3:29am

Technically yes, we could do something like that by running a query to reset the version of the index, the trouble here is that re-indexes really are pretty expensive. Keeping this in mind though.

Topic		Replies	Views
Diacritics and search Feature	17	2311	August 31, 2018
Auto-Linkify Words Theme component official , linkify-words	236	24876	November 13, 2024
Discourse 2.7.0.beta2 Release Notes Announcements release-notes	1	3288	January 21, 2021
Options to disable hijack of CMD+F / CTRL+F and "/" keys for search? Feature keyboard-shortcuts	45	49557	June 25, 2019
Introducing Discourse Discover Announcements new-feature , discourse-discover	101	6893	October 15, 2024

Discourse should ignore if a character is accented when doing a search

Related topics