Recently, due to internal feedback, we decided to prioritize a round of improvements to our search algorithm.
These changes have now been rolled out to all sites as part of Discourse 3.1.0.beta3. After updating, your site will automatically begin to reindex all your content for search.
There are two new site settings as part of this, but these have been set to values we have found work well in our testing here on meta, so we do not expect most sites will have any reason to change them.
Prioritizing complete term match in title over partial match
Discourse performs a
prefix match when searching. This can sometimes lead to very surprising results.
redis stems to
redi so a search for
redis can find all the words that start with
redi such as
redirect and more.
A new hidden site setting was added:
prioritize_exact_search_title_match which is now enabled by default.
This means that if you remember the title and type it in, you are far more likely to hit the title.
Reduced maximum index duplication
Our ranking algorithm ranks posts that have multiple hits to a term higher than posts that only contain the term once. This means that you can “cheat” in search by simply repeating a word a ton of times. The more you type the word the higher it floats to the top of search.
A new hidden site setting
SiteSetting.max_duplicate_search_index_terms which defaults to 6.
Once this is applied it means that if you type sam, 6 times or 60 times in a post, it will still be ranked the same. It puts a roof on the bonus you can give results.
This change also has a positive performance impact, given the search index becomes a bit smaller.
Miscellaneous bug fixes
Part of the work was looking at pathological search cases.
Previously we bumped down priority of closed topics, but forgot about archived topics. This is now fixed.
Previously we relied to heavily on prefix matches for “domain” searches. Meaning that the word
happywould not find
happiand the prefix match fails. This was fixed.
We plan to experiment with “fuzzy” search for mention autocomplete. (allow you to skip a letter for example)
We plan to investigate de-prioritizing duplicate terms in titles. Currently the closed topic
hello goodbye hellois ranked higher than the open topic
PageRank… we currently do not take into effect the number of incoming, internal links when ranking results. This means that sometimes incredibly well linked topics can rank lower than a rare topic that is linked from nowhere. It would be nice to account for this in our ranking algorithm.
We have an open initiative looking at AI integrations, we may be able to draw some inspiration from GPT like tools.
What you can do to help?
Are you noticing any bad results on meta? If so, please include the term you searched for explaining why the results are sub-par.
How are the changes feeling to you (neutral/better/worse?)