Search within topic is omitting results

This issue was first brought to my attention some days ago by a forum member, and I’ve since noticed it on a few occasions, but I’ve been unable to establish the cause. (By it’s very nature, it’s hard to spot when it happens.)

We have an ongoing game thread where we use search to check that an entry has not been posted previously. A search for “legs” in that thread shows it in the wiki post #1, and also at post #2503 by SuperTed, but it omits the earlier post by FemaleAdda at #2430.

There is no formatting or anything in the omitted post to explain this, as far as I can see.

I thought it might just be a glitch caused by the length of that topic, but I’ve since encountered it elsewhere. For example here:

“Trending” doesn’t appear in the search results, although it’s clearly in the post. Likewise “hashtags”:

But a general search (rather than topic search) does show that post:

The term “trending hashtag” also appears in an earlier post in that topic, but does not appear in a topic search for either of those terms. In both these cases, the term is in a list, which I thought might be causing an issue, but this post in the same topic has no lists and doesn’t appear in topic search for any of the terms I tried.

Bizarrely, if I search for “quality”, it doesn’t show post #40 in the results, but does show my reply - including the instance where I quoted that post.

The term “quality” also appears in posts #14, #16 and #37 in that topic - none of which appeaars in the search results.

I use this kind of search extensively, where I think I’ve seen a post - or something very similar - before, and I want to check whether it’s been copied from earlier in the topic. At the moment, it’s not reliable. Unfortunately, I cannot find any common factor in the omitted results.

4 Likes

Another oddity, it is highlighting the word, but isn’t showing it in the search result

I even tried searching for that user specifically using user:FemaleAdda and that didn’t find it either.

Another intriguing thing. Searching the forum can return that post as a result

Doing similar searches for ‘trending’ and ‘hashtags’ with user:SmithKelvin11 also returns the posts that “Search this Topic” fails to return.

1 Like

I have seen it doing that, too, but as you can see from my screen-shots above, it doesn’t always.

Ah, I know how to quantify that a bit. It seems the highlighting is case-sensitive (to a degree). Searching for “Hashtag”, highlights the word in post #3, but fails to highlight the ones further down. Likewise, “hashtag” highlights the ones further down, but fails to highlight “Hashtag” in post #3.

The highlighting seems very temperamental to me though. Just searching for the two words and scrolling up and down produce very inconsistent results. Sometimes zero words are highlighted, other times, 2 of the 3 references (especially when searching “hashtag”), sometimes all 3 references (especially when searching “Hashtag”, but not always).

Maybe related to this topic discussing “over-riding browser search” fix?

-      showSearch = $('.cooked').length < this.container.lookup('controller:topic').get('model.postStream.stream.length');
+      showSearch = $('.topic-post .cooked, .small-action:not(.time-gap)').length < this.container.lookup('controller:topic').get('model.postStream.stream.length');

With the thought being… that the in-topic search is having issues around deleted/closed/reopend messages and maybe skipping posts? Or what exactly?

I eventually want to try and look into what the in-topic search is doing, as I don’t think the search options are being applied to in-topic searches either. I think instead they are being treated as part of the search query.

Same here now.

I thought it might be a “not in the DOM yet” thing. but testing with my longest localhost topic (109 posts) an in-topic search found the last post without a problem

I thought it might be because of deleted or system action posts interfering with post count - topic length. but again in-topic search found the last post OK in a couple of ~40 posts topics that had non-posts.

I know there can be at-first-unexpected results where quotes and emoji names can be returned. And there are some stop words such as “the”, for" etc. but trending and hashtags wouldn’t be in those, except maybe quotes.

So I’ve been digging into the code and I don’t see a rhyme or reason as to why this is happening and it does the advanced filters should be taken into account.

This is bizarre.

Good research effort, scanning what you’ve found so far, it does look like a bug of some kind. And it is specific to extra-long topics? Can you cause it to happen on a topic with 20 or less replies?

1 Like

I wouldn’t consider 33 posts (the second topic in my example) “extra-long”. Even taking into account deleted posts, it’s still only 45.

But I’ll try to reproduce it in a shorter topic. (Not easy, as we have no idea what’s causing it. :upside_down:)

2 Likes

OK - this topic has 15 posts and 0 deleted posts.

Search terms are not appearing for the first post in that topic. For example, “browser”:

As far as I can tell from my testing, if a post does not appear in topic search for one term, it will not appear for any term.

Search for “flicker” in that topic, and it shows just one result, post #13.

But the term also occurs twice in post #1, and again in post #11. (It is, in fact, a quote from post #11 which is shown in the results from post #13.)

Using Firefox’s native Ctrl+F search finds all the instances, as expected.

Hope that helps.

3 Likes

I “feel” the “word game” problem has more to do with the extreme number of posts it has

I noticed yesterday that some of the problem might be because the words were in numbered lists instead of paragraphs.

The generated output is

<ol>\n<li>this</li>\n<li>that</li>\n<li>another</li>\n</ol>\n

My thought is maybe the extra characters are effecting the blurb_length (though I don’t think blurb_length is what I’m after here) or maybe the escapes were the problem.

But now with these recent examples I’m thinking it might be more to do with where the lexemes are created during the raw to cooked process. (but I doubt if that is it either)

Thankfully because of my recent experimentation with PHP Transliteration and metaphone et al. Discourse’s use of lexemes, ispell, isn’t completely foreign territory.

I am coincidentally experimenting with creating a numbered lists topic on my localhost so I can look at the post_search_data and topic_search_data tables for any clues.

Unfortunately, up to this point I have been unable to consistently reproduce the problem or determine where in the code flow the problem might be.

For the life of me I can’t pin down any "it always occurs when"s or ":it only occurs when"s

Frustrating very it is.

5 Likes

I will have a look at this next week, FYI @cpradio how I debug this stuff is by looking at the queries MiniProfiler is running on problem cases. The query 99% explains why stuff is pear shape.

3 Likes

Unfortunately, I can’t see those queries on our production instance (I did attempt to do that), as I felt if I could get the exact query, I could work backwards to see how it came about :frowning:

That was my goal. I too have had a very hard time reproducing it on my own installs and on try.discourse.org. I have no idea if it is related to the number of posts, the type of posts, or additional factors that may be in play when creating the search data it uses and whether that is playing into any of it.

2 Likes

GAH !

I was fairly certain @TechnoBear was on to something with words being in lists, but try as I might, no repro.

I though it might involve re-categorizing a topic somehow breaking the foreign key. But some topics have been, others haven’t

I thought maybe it was the time gap post or other non-post posts. somehow breaking the post numbering. Again, some topics have them, others don’t

I thought it might be a non-English Accept-Language header that didn’t have a matching locale.
I disabled allow user locale and enabled set locale from accept language header then set my browser language to Hindi (hi) in honor of our most frequent visitors, created a new account and made a post.
The post_search_data locale field was set to the default en.
I set the default to Spanish (es) had SpanishGuy1 make a post.
Switched the locale bank to English (en)
The post_search_data locale field is es.
But words in that post still come up in the in-topic search.

I looked at the profiler queries,
It seems a bit wasteful to hit the database multiple times as the word is being typed out, but such is the cost of a “live” search.

I don’t know how to check it but maybe the multiple queries are crashing something. Or if they are aborting, maybe earlier finds are discarded. eg.

t - r - e - query finds a match
t - r - e - n - new query finds a match
t - r - e - n - d - new query finds a match
t - r - e - n - d - i - last query finds matching words

Not so easy to go simply by post sequence as different matches have different weight.

And then, if you copy - paste which I’m assuming equates to a single “keydown” the results are no different than if the letters were typed singly.

4 Likes

Ah, finally a repro of sorts.
HindiGuy1 posted a Hindi Ipsum (though Google Translate said it was Napali, I sure wouldn’t know) followed by the English translation. The English words in that post are not found.

But I’m not seeing any suspicious character encoding in the example topics at SitePoint. Maybe an “invisible” BOM?

उपेक्ष वेबजाल व्यवहार पहेला भारत बिना प्रमान अंतर्गत विकेन्द्रित कराना वर्ष विवरण ब्रौशर निर्माता व्याख्यान बेंगलूर उपयोगकर्ता विवरन अपने ७०है समाजो वर्णित अनुवादक गोपनीयता आवश्यकत प्रौध्योगिकी सीमित सुनत आवश्यक सम्पर्क कार्यकर्ता २४भि ध्येय मजबुत किएलोग तकनीकी काम जाने खयालात प्रतिबध्दता विवरण

Under the decentralized India Prove without ignoring yellow vebajala behavior to provide details brausara year manufacturer lecture Bangalore user description is your 70 societies described Translator Privacy necessarily need to contact the Worker sunata limited praudhyogiki 24 deals with technical work aims to strengthen kieloga fashioned pratibadhdata Details

EDIT
Nevermind, It is consistent on my localhost, yet isn’t here.

2 Likes

Well. I’ve been chasing rabbits and bull-dogging this one for a while and I think I’ve finally found the “what”

For most users the name field is either an empty or a string.
I noticed the user I could not get results for has a users.name of NULL

The query that was choking has this

AND (posts.raw  || ' ' || u.username || ' ' || u.name ilike '%Sorr%') 

(I was searching for the word “Sorry” in this case)

A couple of anonymized users also had name NULL
searches for words in their posts also failed.

From Postgres docs

The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.

Maybe changing the ILIKE to SIMILAR TO would fix things?

6 Likes

I’ve been thinking of various ways to solve the problem.

1: change lib/search.rb line 484

posts = posts.where("posts.raw  || ' ' || u.username || ' ' || u.name ilike ?", "%#{@term}%")

to simply not test against users.name
Works, but eg. will not match searches for “Atwood” with “codinghorror”

2: add conditional
if null do this else if not null do this
Feels like bloat to me.

3: change lib/search.rb line 484 to

posts = posts.where("posts.raw  || ' ' || u.username || ' ' || quote_nullable(u.name) ilike ?", "%#{@term}%")

This essentially changes a NULL into the text string “NULL”
One possible drawback is if one were to search for the string “NULL” it would find posts by such members.
But IMHO this would be extreme edge case and I feel adding quote_nullable would suffice and eliminate any need to make code changes elsewhere.

http://www.postgresql.org/docs/9.1/static/plpgsql-statements.html#PLPGSQL-QUOTE-LITERAL-EXAMPLE

Tested on localhost and no side effects that I could see.

4 Likes

Try as I might, trying to get an RSpec test to work in my localhost Windows - Vagrant set-up was proving to be as much if not more of a pain as tracking down this bug was.
(I haven’t given up yet on getting it to work, just that it looks like it’s going to be while)

So, though no test :sadpanda: the PR has been made

https://github.com/discourse/discourse/pull/4217

Why would we want that when we could use COALESCE(u.name, '')?

2 Likes