Topic is similar to ..is never remotely similar to


(Andy at Focallocal) #1

lovely idea, it just seems to be wildly inaccurate making it a mild popup annoyance. like Clippy


(Sam Saffron) #2

Going to need specific examples of how this fails for you to be able to call it a bug. Screenshots please.


(Harold Martin) #3

Here’s a specific example.

I was asked to post this here as I brought it up via PM on our forum and Simon recommended that I post this here to provide an example. I would have expected any of the multiple existing “calibration” posts (where the actual words, Calibration or Calibrate existed) to show up…instead, I show a list of totally unrelated recommendations.

Thanks.


(Mittineague) #4

It looks like what is happening is that the search prioritizes matches against topic titles, and the “similar to” matches against post content.

In your example, “calibrate” is only one word. There is also “recommend” and “discuss” and others that could be getting matched against.


(Harold Martin) #5

So basically it’s worthless. It might as well “search” on the word “the” and “and”.


(Sam Saffron) #6

I think worthless is a bit strong, but yes I agree it may make sense to split suggested into 2 queries.

One for exact title match (say first 2 results) and then the rest from body+title.

The big problem is that people are really terrible at writing titles (though if you are good at titles we should not punish you)

@codinghorror thoughts?


(Jeff Atwood) #7

I think the missing part is that we should refine the matches.

@zogstrip recently changed it so it will match only on title if that’s the only thing that is entered, however, there is a bit of a “cry wolf” problem here when we match immediately on title. So I think any recommendation should be delayed a fair bit so we have time to gather input.


(Sam Saffron) #8

Agree on the delay, but I do think it also makes sense to sneak in 1-3 exact matches on title at the top, in case you have “very high value title” and “low value body” which can often happen.


(Harold Martin) #9

Yes, in my post…I think we can all agree that the “key word” is “calibration” and/or Calibrate". If the algorithms could be smarter at determining important words from those that aren’t (like recommendation since that can be a recommendation for anything.) Now, if my title and/or body was “I’m looking for Calibration recommendations, or recommendations on how to calibrate correctly”…then OK, I get it hitting on the word recommendation AS LONG as it’s in the context of the calibration and not some other extraneous topic (like travel bags).


(Sam Saffron) #10

What you are describing here is machine learning and neural nets as recommendation engines. Though technically possible it would be many many month of R&D to build and high risk of failure.

We got to focus on simple refinements for now.


(Harold Martin) #11

But I did a test where I typed “Calibration” as the Subject and only a single word in the body “Calibration” and the above previously created threads were the ones that appeared.

I mean honestly, we’re OK because in my situation, IF people really wanted to search out specific Calibration threads, they can do so using the search function on the main page (and that works great). I understand why it’s difficult, so get how it can be hard to incorporate into Discourse, but the above example is just to demonstrate how the system gets it totally wrong many times based on the context of the actual post.


(Harold Martin) #12

Sounds like Discourse needs a Watson. :stuck_out_tongue:


(Kane York) #13

Also try looking at the lowest-frequency words with matches.

E.g. if “correctly” is found in 60 other posts but “aubergine” is found in 5 other, perhaps we should prioritize matches for “aubergine” instead of “correctly”.

(This is for matching on suggested, not for search.)


(Sam Saffron) #14

Not sure if we have this information at our fingertips though short of scanning the entire full text index.


(Kane York) #15

Oh yeah, that would require word population statistics.

Looking at pg’s relevance scoring, that feels like something they should have considered as a potential relevance indicator! (It’s mostly counting how much of a query term is in the document, and using that as word relevance. So a document that mentions “relevance” a lot has a high relevance for the query “relevance”.)

edit: how well does ts_stat() perform?

search for 'relevance' immediately after posting, position 13


(Kane York) #16

I did some experimenting.

select count(*) from topics;
=> 17
select count(*) from posts;
=> 52
CREATE TEMPORARY TABLE topic_search_stats AS
  SELECT ndoc, plainto_tsquery(word) as word
  FROM ts_stat('select search_data from topic_search_data') ;
=> SELECT 216
CREATE TEMPORARY TABLE post_search_stats AS
  SELECT ndoc, plainto_tsquery(word) as word
  FROM ts_stat('select search_data from post_search_data') ;
-- NOTICE:  text-search query contains only stop words or doesn't contain lexemes, ignored
-- NOTICE:  text-search query contains only stop words or doesn't contain lexemes, ignored
=> SELECT 1977
select ndoc, word from topic_search_stats
  where word @@ to_tsvector('Calibration I want to learn to calibrate correctly. What should I do? Let''s see what "recommendations" discourse gives us when discussing a topic that has been discussed before.')
  ORDER BY ndoc ASC;
 ndoc |   word    
------+-----------
    1 | 'see'
    1 | 'learn'
    2 | 'want'
    3 | 'topic'
    3 | 'discuss'
(5 rows)
select ndoc, word from post_search_stats
  where word @@ to_tsvector('Calibration I want to learn to calibrate correctly. What should I do? Let''s see what "recommendations" discourse gives us when discussing a topic that has been discussed before.')
  ORDER BY ndoc ASC;
 ndoc |    word     
------+-------------
    2 | 'recommend'
    2 | 'give'
    4 | 'learn'
    4 | 'us'
    4 | 'correct'
    6 | 'discuss'
    8 | 'let'
   13 | 'want'
   14 | 'see'
   17 | 'topic'
(10 rows)

explain analyze ... post_search_stats ...
Execution time: ~50 ms
explain analyze ... topic_search_stats ...
Execution time: 5.662 ms

\set joined_query '(plainto_tsquery(''recommend'') || plainto_tsquery(''give'') || plainto_tsquery(''learn'') || plainto_tsquery(''us'') || plainto_tsquery(''correct''))'

select post_id, ts_rank(search_data, :joined_query), left(p.raw, 100) from post_search_data psd join posts p on psd.post_id = p.id join topics t on p.topic_id = t.id where t.visible = 't' and t.archetype <> 'private_message' and t.category_id IN (select id from categories where NOT read_restricted) and search_data @@ :joined_query order by ts_rank(search_data, :joined_query) desc;
 post_id |  ts_rank  |                                                 left                                                 
---------+-----------+------------------------------------------------------------------------------------------------------
      52 | 0.0308799 | Hi there,                                                                                           +
         |           |                                                                                                     +
         |           | first of all, I want to say thank you for this awesome, brilliant, modern, reliable, thou
      50 | 0.0151982 | As of today, in the beginning of 2017, I've been using Discourse for ~2 years.                      +
         |           |                                                                                                     +
         |           | The wide range of Di
      24 | 0.0121585 | That is very key to know @lll -- that the top section of the drop-down hamburger menu only shows to 
      27 | 0.0121585 | Since you're already logged in on your device, you can go to the admin page directly by using the ur
      20 | 0.0121585 | [quote="McBlu, post:5, topic:76468"]                                                                +
         |           | The menu on the after header still disappears when you scroll t
      35 | 0.0121585 | I've been on the hunt for a community platform for a while and was alerted to Discourse. I've come t
      38 | 0.0121585 | Correct, I’ll have that ready to go before the end of this month.
      29 | 0.0121585 | [quote="McBlu, post:14, topic:12"]                                                                  +
         |           | Thanks, III. :slight_smile:                                                                         +
         |           | [/quote]                                                                                            +
         |           |                                                                                                     +
         |           | :grin: you're welcome                                                                               +
         |           |                                                                                                     +
         |           | [quo
      16 | 0.0121585 | The main issue is that parent `div`s have lower widths than the full viewport so `width: 100%` for t
(9 rows)

Results and timing are a bit off due to my limited dataset (i just went through and copied a few topics off Meta). But performance is not looking too great without a denormalized table.

And I’m still using ts_rank for the final result. Bleh.


(Bas van Leeuwen) #17

fyi, that’s called term frequency :slight_smile:
It is actually used by default in nearly all proper search engines, such as Elastic Search, SOLR etc. (but not in the basic Postgres search that Discourse uses).