Recently, due to internal feedback, we decided to prioritize a round of improvements to our search algorithm.
These changes have now been rolled out to all sites as part of Discourse 3.1.0.beta3. After updating, your site will automatically begin to reindex all your content for search.
There are two new site settings as part of this, but these have been set to values we have found work well in our testing here on meta, so we do not expect most sites will have any reason to change them.
Prioritizing complete term match in title over partial match
Discourse performs a stem
+ prefix match
when searching. This can sometimes lead to very surprising results.
For example: redis
stems to redi
so a search for redis
can find all the words that start with redi
such as redirect
and more.
A new hidden site setting was added: prioritize_exact_search_title_match
which is now enabled by default.
Before:
After:
This means that if you remember the title and type it in, you are far more likely to hit the title.
Reduced maximum index duplication
Our ranking algorithm ranks posts that have multiple hits to a term higher than posts that only contain the term once. This means that you can “cheat” in search by simply repeating a word a ton of times. The more you type the word the higher it floats to the top of search.
A new hidden site setting SiteSetting.max_duplicate_search_index_terms
which defaults to 6.
Once this is applied it means that if you type sam, 6 times or 60 times in a post, it will still be ranked the same. It puts a roof on the bonus you can give results.
This change also has a positive performance impact, given the search index becomes a bit smaller.
Miscellaneous bug fixes
Part of the work was looking at pathological search cases.
-
Previously we bumped down priority of closed topics, but forgot about archived topics. This is now fixed.
-
Previously we relied to heavily on prefix matches for “domain” searches. Meaning that the word
happy
would not findhttps://happy.com
sincehappy
stems tohappi
and the prefix match fails. This was fixed.
Future work
-
We plan to experiment with “fuzzy” search for mention autocomplete. (allow you to skip a letter for example)
-
We plan to investigate de-prioritizing duplicate terms in titles. Currently the closed topic
hello goodbye hello
is ranked higher than the open topichello world
-
PageRank… we currently do not take into effect the number of incoming, internal links when ranking results. This means that sometimes incredibly well linked topics can rank lower than a rare topic that is linked from nowhere. It would be nice to account for this in our ranking algorithm.
-
We have an open initiative looking at AI integrations, we may be able to draw some inspiration from GPT like tools.
What you can do to help?
Are you noticing any bad results on meta? If so, please include the term you searched for explaining why the results are sub-par.
How are the changes feeling to you (neutral/better/worse?)