Certain numerical substrings are ignored


(Eric Toombs) #1

If a topic title has the string CHEM894 and somebody searches for 894, that topic will appear as you would expect. But if it were CHEM123 and somebody searches for 123, it doesn’t get found. It’s weird. Furthermore, if somebody searches for chem 1, the topic will appear. CHEM321 doesn’t work either with similar behaviour. It looks like the built-in search engine won’t search for certain numbers within words. There are other unexpected inconsistencies with the same theme, like “chem 123” sometimes finds a title with “chem123” and sometimes doesn’t. Could somebody maybe try to reproduce this behaviour?

edit: adding ‘chem132’ to test this site’s search engine.

(Sam Saffron) #2

I have seen a similar request on a customer complaining about search ranking.

I wonder if we should give a special relevance bump for stuff that happened in the last 10 days or so.

Also… upgrade to latest, I fixed stuff here in the last release.

(Eric Toombs) #3

I’m running this version. Is this before or after your changes? I upgraded just a few days ago. This has been happening with posts submitted and searched for after the upgrade.

Interesting screen cap. My post may have only been found because it had the string ‘123’ as well as ‘chem123’. If it only had ‘chem123’, it might not have been found at all.

(Sam Saffron) #4

It s being stemmed out, will have a look at why

(Eric Toombs) #5

I added the string ‘chem132’ to this topic to further test this. I just confirmed that neither ‘A132A’ without the 'A’s nor ‘chemA132’ with the ‘A’ replaced with a space find this topic. Sorry about the 'A’s, but I wanted to make sure those strings stayed out of the topic so that others could test the same search strings under the same conditions.

In other words, I was right and the only reason 123 found this topic was because 123 was in it by itself. 123 would not have found chem123.

Thanks for your investigation, @sam!

(Eric Toombs) #6

Has any progress been made to address this? Our discourse website is otherwise ready, but we would really like to fix it before we release. I have some time to try help fix this bug. I am actually learning RoR specifically for this reason. Can I work with whomever is addressing this bug? Or else, could anybody just point me in the right direction?

(James Kiesel) #7

You’ll be interested in the post_search_data table, which stores tsvector information about each post (the tsvectors are weighted collections of lexemes / word roots, which postgres can compare to a search term much faster than a raw post body).

The class you’re looking for is lib/search.rb which is… well, a bit tangled, but the line which executes the search is here:

If I had a wild guess, I would guess that converting the word ‘chem123’ to a lexeme results in {'chem123':1}, which doesn’t match the tsquery 123

(The ‘f’ is for false, aka, the document ‘chem123’ isn’t a match for the search term ‘123’)

I definitely don’t have my head around a good fix for that, but I wish you luck! Maybe you could write a clause in there specifically for if the query is a number.

PS I found this post to be a really great intro to how postgres does its fulltext search.