The search should match fancy characters with their "regular" equivalent

I’ve copy-pasted a topic title (as it is displayed, with fancy entities) in the search
New Lowe’s commercial with UniGeezer

No result:

I replace the fancy apostrophe by the “regular” one in the search field:
New Lowe's commercial with UniGeezer

Now the topic appears.

My suggestion is that the search should match every fancy character with the original one.

8 Likes

Good point, how should we handle this @sam?

3 Likes

What about diacritics?

We have some normalization for diacritics already so maybe we can also correct this in a similar path.

@tgxworld can have a think about it.

3 Likes

@Canapin Are you still able to reproduce this? I tried to reproduce this locally but couldn’t. The apostrophe is stripped from the search data so it should not have any effect on search.

discourse_development=# SELECT TO_TSVECTOR('english', 'New Lowe’s commercial with UniGeezer') @@ PLAINTO_TSQUERY('english', 'New Lowe’s commercial with UniGeezer');
 ?column? 
----------
 t
(1 row)

Are you able to point me to the site which you’re facing this problem so that I can get a repro? Thank you!

1 Like

I still have the issue, and it’s when I search for the exact string (wrapped by "):

https://unicyclist.com/search?q=%22New%20Lowe%E2%80%99s%20commercial%20with%20UniGeezer%22

vs

https://unicyclist.com/search?q=%22New%20Lowe%27s%20commercial%20with%20UniGeezer%22

1 Like

Thank you for the repro. This basically affects search for exact terms when the search terms are wrapped in ". The problem here is that the real title of the topic is actually New Lowe's commercial with UniGeezer but the fancy title is New Lowe’s commercial with UniGeezer. When we do a search for exact terms, we’re only matching the given terms to the topic’s title and not the fancy title.

The difficulty here is that we can’t just replace with ' unconditinally becasue a topic title with in it will end up not matching. I kind of unsure what we can do here because we’re displaying different characters on the client side when displaying the topic title.

@gerhard @sam It seems like you have tackled this issue around quoting before, any ideas what we can do here? To be honest though, it is an edge case that will affect a very small portion of search queries. I’m inclined to just pun on this.

1 Like

This is no laughing matter! :stuck_out_tongue_winking_eye:

I guess we could normalize to ' in the index and search term. But I am honestly not sure it is worth a giant effort fixing this.

1 Like

This is not related to the search index. For exact matches, we match it against Post#raw and Topic#title:

https://github.com/discourse/discourse/blob/755627caa512058ebee332f40f9743024bb262f3/lib/search.rb#L942

1 Like

I see, yeah … no easy solution here at all, I think this is just a nit we have to live with.

2 Likes