Refinements to search being tested on meta

Recently, due to internal feedback, we decided to prioritize a round of improvements to our search algorithm.

These changes have now been rolled out to all sites as part of Discourse 3.1.0.beta3. After updating, your site will automatically begin to reindex all your content for search.

There are two new site settings as part of this, but these have been set to values we have found work well in our testing here on meta, so we do not expect most sites will have any reason to change them.

Prioritizing complete term match in title over partial match

Discourse performs a stem + prefix match when searching. This can sometimes lead to very surprising results.

For example: redis stems to redi so a search for redis can find all the words that start with redi such as redirect and more.

A new hidden site setting was added: prioritize_exact_search_title_match which is now enabled by default.

Before:

After:

This means that if you remember the title and type it in, you are far more likely to hit the title.

Reduced maximum index duplication

Our ranking algorithm ranks posts that have multiple hits to a term higher than posts that only contain the term once. This means that you can “cheat” in search by simply repeating a word a ton of times. The more you type the word the higher it floats to the top of search.

A new hidden site setting SiteSetting.max_duplicate_search_index_terms which defaults to 6.

Once this is applied it means that if you type sam, 6 times or 60 times in a post, it will still be ranked the same. It puts a roof on the bonus you can give results.

This change also has a positive performance impact, given the search index becomes a bit smaller.

Miscellaneous bug fixes

Part of the work was looking at pathological search cases.

  • Previously we bumped down priority of closed topics, but forgot about archived topics. This is now fixed.

  • Previously we relied to heavily on prefix matches for “domain” searches. Meaning that the word happy would not find https://happy.com since happy stems to happi and the prefix match fails. This was fixed.

Future work

  • We plan to experiment with “fuzzy” search for mention autocomplete. (allow you to skip a letter for example)

  • We plan to investigate de-prioritizing duplicate terms in titles. Currently the closed topic hello goodbye hello is ranked higher than the open topic hello world

  • PageRank… we currently do not take into effect the number of incoming, internal links when ranking results. This means that sometimes incredibly well linked topics can rank lower than a rare topic that is linked from nowhere. It would be nice to account for this in our ranking algorithm.

  • We have an open initiative looking at AI integrations, we may be able to draw some inspiration from GPT like tools.

What you can do to help?

Are you noticing any bad results on meta? If so, please include the term you searched for explaining why the results are sub-par.

How are the changes feeling to you (neutral/better/worse?)

47 Likes

Just to be sure… If I update/upgrade my setup will I find those two settings? I know how to find hidden one, that’s not an issue — but are those Meta-only at this time? For me it’s easier test it on my circles than here :wink:

7 Likes

Yes, but you need to run rake search:reindex as well

7 Likes

Have you thought about improving the search using meilisearch? This requires very few resources and can be included in the docker build.

3 Likes

7 posts were split to a new topic: Prioritizing closed or solved topics in search

We have started experiments in this area by

First experiments are limited to user / group search, but if all goes well it can expanded further.

8 Likes

We have considered various integrations including sphinx, melli, elastic, solr/lucene but they come at a cost. Hosting another process to run indexing, risking out-of-date indexes, complexity… etc all are not free.

I would like to see how much mileage we get out of PG prior to exploring any other options and keep them as a last resort.

Very interesting problem, yes, they are (and always have been) de-prioritized. I think at a minimum we can look at adding a site setting to discourse-solved to allow admins to decide what to do in these cases (prioritize/deprioritize/neutral etc.)

16 Likes

Unfortunately, postgres is not adapted as a search engine. And meilisearch has fantastically low memory consumption and limitless search possibilities. The overhead for the server compared to ruby will simply be invisible.

3 Likes

This is not a trivial problem. Our search contains enormous amounts of dimensions and has lots of params, it joins directly into postgres tables.

With an external search provider we need to worry about “synchronization”.

  • A topic is closed on Discourse → notify engine
  • A post is deleted → notify engine
  • A like is made → notify engine
  • A topic is split or merged → notify engine

The list goes on, including building multiple indexes (users/posts/topics/categories)

That said, given the right investment this is not necessarily insurmountable, but it is an enormous task and there is no proof of concept out there showing how much better it would be. It is nice that melli has a typo ranker, and many other features no argument there. But integrating it is not free at all.

As a rough estimate I would think there is about 3 months work building tight and robust integration into mellisearch. Maybe even 6 months if we were to design Discourse in such a way that search engine is “pluggable”

Note that we do support algolia integration here: https://discourse.algolia.com/ it is not quite rock solid, and you can see that the entire advanced search is omitted from the implementation.

8 Likes

I’m willing to bet that with such a large community of discourses like discourse, it can be much faster, no more than three months

2 Likes

After sometimes now I asked what my the most active users thought (thinked :man_facepalming: ) about searching — I never told it got some steroids.

Everyone said exactly same thing; they haven’t think it but because I asked they realize they found now relevant hits much easier, in most of cases right away,

One part of Discourse is acting as commenting system of WordPress. No, I don’t get more comments (nothing is so overrated than commenting of blogs) but it has showed existence (is it spelled that way?) of the forum. Nowadays I have a handful users who are using Discourse as a search engine, They don’t comment but they search what they are looking for from WordPress via Discourse topics and go back to the blog. Sure, tag-system helps a lot too. And WordPress is missing both: effective searching and working tagging,

I don’t know if I should post this in praise instead, but I wanted just tell that I’m quite pleased how this new and improvent search works.

11 Likes

Wow thanks, this certainly makes me feel really good! We have a PR in the oven now and we should be rolling out the changes globally really soon.

11 Likes

Sorry if I’m being obtuse — should this be active on hosted sites (with latest deploy)? The release announcement points here, but this talks about a hidden setting — is that hidden setting on?

6 Likes

You do not need to do anything:

I’ll update the original post with a note.

9 Likes

Thanks for the fantastic update. For us being able to define search synonyms would be a huge improvement :pray: Thank you.

6 Likes

9 posts were split to a new topic: Can I exclude usernames from search

Not sure if this was a problem before but I noticed many system-created posts showing up in search results. Maybe an edge case more noticeable here on meta, but I wouldn’t expect system-messages to be relevant to search.

Example result when searching for terms like “automatically closed”:

4 Likes

I can’t reproduce that here.

3 Likes

I can reproduce that; if you sort them by latest post instead of relevance, there are a lot of system messages in the results.

4 Likes

Ah, yeah I see that then. It isn’t everything, but more than reasonable. It does seem like these messages should be excluded from search

3 Likes