lovely idea, it just seems to be wildly inaccurate making it a mild popup annoyance. like Clippy
Going to need specific examples of how this fails for you to be able to call it a bug. Screenshots please.
Here’s a specific example.
I was asked to post this here as I brought it up via PM on our forum and Simon recommended that I post this here to provide an example. I would have expected any of the multiple existing “calibration” posts (where the actual words, Calibration or Calibrate existed) to show up…instead, I show a list of totally unrelated recommendations.
It looks like what is happening is that the search prioritizes matches against topic titles, and the “similar to” matches against post content.
In your example, “calibrate” is only one word. There is also “recommend” and “discuss” and others that could be getting matched against.
So basically it’s worthless. It might as well “search” on the word “the” and “and”.
I think worthless is a bit strong, but yes I agree it may make sense to split suggested into 2 queries.
One for exact title match (say first 2 results) and then the rest from body+title.
The big problem is that people are really terrible at writing titles (though if you are good at titles we should not punish you)
I think the missing part is that we should refine the matches.
@zogstrip recently changed it so it will match only on title if that’s the only thing that is entered, however, there is a bit of a “cry wolf” problem here when we match immediately on title. So I think any recommendation should be delayed a fair bit so we have time to gather input.
Agree on the delay, but I do think it also makes sense to sneak in 1-3 exact matches on title at the top, in case you have “very high value title” and “low value body” which can often happen.
Yes, in my post…I think we can all agree that the “key word” is “calibration” and/or Calibrate". If the algorithms could be smarter at determining important words from those that aren’t (like recommendation since that can be a recommendation for anything.) Now, if my title and/or body was “I’m looking for Calibration recommendations, or recommendations on how to calibrate correctly”…then OK, I get it hitting on the word recommendation AS LONG as it’s in the context of the calibration and not some other extraneous topic (like travel bags).
What you are describing here is machine learning and neural nets as recommendation engines. Though technically possible it would be many many month of R&D to build and high risk of failure.
We got to focus on simple refinements for now.
But I did a test where I typed “Calibration” as the Subject and only a single word in the body “Calibration” and the above previously created threads were the ones that appeared.
I mean honestly, we’re OK because in my situation, IF people really wanted to search out specific Calibration threads, they can do so using the search function on the main page (and that works great). I understand why it’s difficult, so get how it can be hard to incorporate into Discourse, but the above example is just to demonstrate how the system gets it totally wrong many times based on the context of the actual post.
Sounds like Discourse needs a Watson.
Also try looking at the lowest-frequency words with matches.
E.g. if “correctly” is found in 60 other posts but “aubergine” is found in 5 other, perhaps we should prioritize matches for “aubergine” instead of “correctly”.
(This is for matching on suggested, not for search.)
Not sure if we have this information at our fingertips though short of scanning the entire full text index.
Oh yeah, that would require word population statistics.
Looking at pg’s relevance scoring, that feels like something they should have considered as a potential relevance indicator! (It’s mostly counting how much of a query term is in the document, and using that as word relevance. So a document that mentions “relevance” a lot has a high relevance for the query “relevance”.)
edit: how well does ts_stat() perform?
I did some experimenting.
select count(*) from topics; => 17 select count(*) from posts; => 52 CREATE TEMPORARY TABLE topic_search_stats AS SELECT ndoc, plainto_tsquery(word) as word FROM ts_stat('select search_data from topic_search_data') ; => SELECT 216 CREATE TEMPORARY TABLE post_search_stats AS SELECT ndoc, plainto_tsquery(word) as word FROM ts_stat('select search_data from post_search_data') ; -- NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored -- NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored => SELECT 1977 select ndoc, word from topic_search_stats where word @@ to_tsvector('Calibration I want to learn to calibrate correctly. What should I do? Let''s see what "recommendations" discourse gives us when discussing a topic that has been discussed before.') ORDER BY ndoc ASC; ndoc | word ------+----------- 1 | 'see' 1 | 'learn' 2 | 'want' 3 | 'topic' 3 | 'discuss' (5 rows) select ndoc, word from post_search_stats where word @@ to_tsvector('Calibration I want to learn to calibrate correctly. What should I do? Let''s see what "recommendations" discourse gives us when discussing a topic that has been discussed before.') ORDER BY ndoc ASC; ndoc | word ------+------------- 2 | 'recommend' 2 | 'give' 4 | 'learn' 4 | 'us' 4 | 'correct' 6 | 'discuss' 8 | 'let' 13 | 'want' 14 | 'see' 17 | 'topic' (10 rows) explain analyze ... post_search_stats ... Execution time: ~50 ms explain analyze ... topic_search_stats ... Execution time: 5.662 ms \set joined_query '(plainto_tsquery(''recommend'') || plainto_tsquery(''give'') || plainto_tsquery(''learn'') || plainto_tsquery(''us'') || plainto_tsquery(''correct''))' select post_id, ts_rank(search_data, :joined_query), left(p.raw, 100) from post_search_data psd join posts p on psd.post_id = p.id join topics t on p.topic_id = t.id where t.visible = 't' and t.archetype <> 'private_message' and t.category_id IN (select id from categories where NOT read_restricted) and search_data @@ :joined_query order by ts_rank(search_data, :joined_query) desc; post_id | ts_rank | left ---------+-----------+------------------------------------------------------------------------------------------------------ 52 | 0.0308799 | Hi there, + | | + | | first of all, I want to say thank you for this awesome, brilliant, modern, reliable, thou 50 | 0.0151982 | As of today, in the beginning of 2017, I've been using Discourse for ~2 years. + | | + | | The wide range of Di 24 | 0.0121585 | That is very key to know @lll -- that the top section of the drop-down hamburger menu only shows to 27 | 0.0121585 | Since you're already logged in on your device, you can go to the admin page directly by using the ur 20 | 0.0121585 | [quote="McBlu, post:5, topic:76468"] + | | The menu on the after header still disappears when you scroll t 35 | 0.0121585 | I've been on the hunt for a community platform for a while and was alerted to Discourse. I've come t 38 | 0.0121585 | Correct, I’ll have that ready to go before the end of this month. 29 | 0.0121585 | [quote="McBlu, post:14, topic:12"] + | | Thanks, III. :slight_smile: + | | [/quote] + | | + | | :grin: you're welcome + | | + | | [quo 16 | 0.0121585 | The main issue is that parent `div`s have lower widths than the full viewport so `width: 100%` for t (9 rows)
Results and timing are a bit off due to my limited dataset (i just went through and copied a few topics off Meta). But performance is not looking too great without a denormalized table.
And I’m still using ts_rank for the final result. Bleh.
fyi, that’s called term frequency
It is actually used by default in nearly all proper search engines, such as Elastic Search, SOLR etc. (but not in the basic Postgres search that Discourse uses).