Discourse should ignore if a character is accented when doing a search

This is now in with a new name :tada:

https://github.com/discourse/discourse/commit/4481836de2feb4813b6282a6ec4ae4fdde509627

8 Likes

Hmm, unfortunately it seems we’re not quite there yet. :frowning:

I see two big issues after bit of testing (upgraded today to the latest Discourse version):

  1. diacritics is not elided from the query string itself, i.e. if I search with a word including diacritics, I will not find anything
  2. diacritics is not stripped from post titles
2 Likes

@zogstrip this feels like something we got to get sorted, @danekhollas how brave are you feeling? Do you want to try a PR?

1 Like

Thank you @sam for asking, I am humbled. I can do some explorations during weekend, but I want to be honest with you that in any case this would most likely require a lot of hand holding and I do not want you to spend more time with me than you would need to fix this yourselves :slight_smile:. It’s also possible that I will hit a wall in the process and you’d have to finish up. (by which I mean I’d need to go too deep into Rails/Ruby/etc to solve the issues).

If you’re okay with that, some initial pointers where to look would be appreciated. I’ll be looking at the PRs from @zogstrip from this topic, but I don’t think that they hit all the components that might need to be modified.

Otherwise, don’t let me stop you… :slight_smile:

3 Likes

You’ll probably need to extract the strip_diacritics method so it can also be used in the Search.prepare_data method as well when SiteSetting.search_ignore_accents is enabled.

4 Likes

There might be problems, some examples:

“álom” means “dream”.
“alom” means “litter”.
“rag” means “suffix” / “inflection”.
“rág” means “chew”.
“kar” means “arm”.
“kár” means “damage”.

But I think end users can search more precisely, so +1 from me.

1 Like

Thanks for chiming in! I guess it depends on how many of these examples are there. If there are not that many, then I’d say it is always better to get more search results, albeit sometimes irrelevant, rather than getting no results at all. :slight_smile:

1 Like

This can be better handled at the database level with an appropriate collation, instead of the blunt approach of stripping accents. Many databases offer accent insensitive collations for different languages. Surprisingly, compared to other databases Postgres is lacking in this respect.

Since version 10 Postgres has started incorporating support for ICU (International Components for Unicode)[1]. This library provides proper handling of accents for sorting and searching. Unfortunately not all functionality has been integrated yet. But maybe it’s worth keeping an eye on this area of Postgres development.

[1] https://blog.2ndquadrant.com/icu-support-postgresql-10/

5 Likes

@zogstrip Thank you for pointers, they seem to do the trick! I’ve made a PR :tada:

https://github.com/discourse/discourse/pull/6518

I’ve tried adding some tests, but was generally super confused about how rspec works. They do seem to work though. :crossed_fingers:

One open questions: The search log still includes accents if the strip_diacritics function is called from Search.prepare_data. I am not sure that is the desired behaviour since there will be separate logs for queries that are identical from the search perspective.

4 Likes

Thanks Sam for merging! Just upgraded our forum and it works well.

I am quite confused about the second issue. I could not reproduce it in my dev environment, but it was definitely a problem on our production forum, even after upgrade. (and it was affecting new post titles as well) I eventually fixed it by rebuilding the Postgre index. i.e.

cd /var/discourse
./launcher enter app
rake search:reindex

@sam we’ve already discussed triggering the search index rebuild for everyone, but I am not sure that it happened. Perhaps now is a good time after my fix was merged.

FYI for others: outstanding issues related to search w/o diacritics:

  1. does not work for in-topic search
  2. diacritics should not be stripped from excerpts on the search page
    • need to make sure that “word-boldening” on search page works correctly after 2. is fixed

I want to wait a tiny bit more, rebuild of index is quite expensive so I want to make sure I don’t dish out the cost to early.

3 Likes

Okay, if it is tiny. :slight_smile: for others: if you update to tests passed right now and have a locale for which this site setting is default on, your search will be badly broken, so you need to either turn of the settings or rebuild index manually as described above.

Actually, when somebody changes this setting, it needs to trigger the rebuild, otherwise it’s just not gonna work. Is that even possible to do? @sam @zogstrip

Technically yes, we could do something like that by running a query to reset the version of the index, the trouble here is that re-indexes really are pretty expensive. Keeping this in mind though.

4 Likes