This is now in with a new name
https://github.com/discourse/discourse/commit/4481836de2feb4813b6282a6ec4ae4fdde509627
This is now in with a new name
https://github.com/discourse/discourse/commit/4481836de2feb4813b6282a6ec4ae4fdde509627
Hmm, unfortunately it seems we’re not quite there yet.
I see two big issues after bit of testing (upgraded today to the latest Discourse version):
@zogstrip this feels like something we got to get sorted, @danekhollas how brave are you feeling? Do you want to try a PR?
Thank you @sam for asking, I am humbled. I can do some explorations during weekend, but I want to be honest with you that in any case this would most likely require a lot of hand holding and I do not want you to spend more time with me than you would need to fix this yourselves . It’s also possible that I will hit a wall in the process and you’d have to finish up. (by which I mean I’d need to go too deep into Rails/Ruby/etc to solve the issues).
If you’re okay with that, some initial pointers where to look would be appreciated. I’ll be looking at the PRs from @zogstrip from this topic, but I don’t think that they hit all the components that might need to be modified.
Otherwise, don’t let me stop you…
You’ll probably need to extract the strip_diacritics
method so it can also be used in the Search.prepare_data
method as well when SiteSetting.search_ignore_accents
is enabled.
There might be problems, some examples:
“álom” means “dream”.
“alom” means “litter”.
“rag” means “suffix” / “inflection”.
“rág” means “chew”.
“kar” means “arm”.
“kár” means “damage”.
But I think end users can search more precisely, so +1 from me.
Thanks for chiming in! I guess it depends on how many of these examples are there. If there are not that many, then I’d say it is always better to get more search results, albeit sometimes irrelevant, rather than getting no results at all.
This can be better handled at the database level with an appropriate collation, instead of the blunt approach of stripping accents. Many databases offer accent insensitive collations for different languages. Surprisingly, compared to other databases Postgres is lacking in this respect.
Since version 10 Postgres has started incorporating support for ICU (International Components for Unicode)[1]. This library provides proper handling of accents for sorting and searching. Unfortunately not all functionality has been integrated yet. But maybe it’s worth keeping an eye on this area of Postgres development.
@zogstrip Thank you for pointers, they seem to do the trick! I’ve made a PR
https://github.com/discourse/discourse/pull/6518
I’ve tried adding some tests, but was generally super confused about how rspec works. They do seem to work though.
One open questions: The search log still includes accents if the strip_diacritics
function is called from Search.prepare_data
. I am not sure that is the desired behaviour since there will be separate logs for queries that are identical from the search perspective.
Thanks Sam for merging! Just upgraded our forum and it works well.
I am quite confused about the second issue. I could not reproduce it in my dev environment, but it was definitely a problem on our production forum, even after upgrade. (and it was affecting new post titles as well) I eventually fixed it by rebuilding the Postgre index. i.e.
cd /var/discourse
./launcher enter app
rake search:reindex
@sam we’ve already discussed triggering the search index rebuild for everyone, but I am not sure that it happened. Perhaps now is a good time after my fix was merged.
FYI for others: outstanding issues related to search w/o diacritics:
I want to wait a tiny bit more, rebuild of index is quite expensive so I want to make sure I don’t dish out the cost to early.
Okay, if it is tiny. for others: if you update to tests passed right now and have a locale for which this site setting is default on, your search will be badly broken, so you need to either turn of the settings or rebuild index manually as described above.
Actually, when somebody changes this setting, it needs to trigger the rebuild, otherwise it’s just not gonna work. Is that even possible to do? @sam @zogstrip
Technically yes, we could do something like that by running a query to reset the version of the index, the trouble here is that re-indexes really are pretty expensive. Keeping this in mind though.