We use our site as a knowledge-base and discussion forum for a university department. So for example, I can imagine people wanting to ask things like:
when will MSc grades be available?
what is the pass mark for MPsych students?
how many weeks leave can I book in one go?
what happens if my tutee fails a stage 1 module?
what does the university require that I do if my tutee is self harming?
how much do we pay research participants?
how do I get promoted?
what sources of phd funding are available? or when do the school phd studentships get released?
where in the programme do students learn about repeat measures anova?
In each of these cases we have quite good info, but the traditional search doesn’t find the correct results to summarise. Sometimes it finds nothing, but other times it finds old discussions which are not the “correct” answer.
I know many here are not programmers and so the differences between keyword search and semantic search may seem confusing or they may want to have more insight into how it works. While the following is for programmers, it is basic enough that you can learn some of the key concepts about the differences between the two search methods without being a programmer.
DeepLearning.AI recently (08/14/2023) added this free basic course on
@EricGT thanks for the link. That paper is pretty dense for those who don’t already understand quite a lot about ML.
I think the gist is that, as applied here, HyDE would use an LLM to first create a “made up” answer based on the question. This answer will have the form of a real forum post (for example) but may contain hallucinations and be factually wrong because the content is coming from the LLM not a canonical document set. This document is never shown to the user, but the neat trick is that this document will be semantically similar to real documents/topics in your site. The search returns real documents that are most similar to the “made up” document, and empirically this seems to work better than just matching the raw search term to semantically similar documents in the embeddings database.
@sam Hyde based search sounds cool and excited to try it. Are you envisaging tweak able knobs for some of these ai features? For example, I can imagine it might be nice to edit the prompts used both to generate the hypothetical document and to control the summary/answer. For example, the current chatbot is pretty verbose when it does find answers. It would be nice to be able to add “concisely” or “briefly” as a prefix to the prompt (as I often do when using chatgpt itself).
I know many will not pay attention to that statement but if you are paying real money for running prompts then that is one of the most valuable statements you need to understand.
See:
Prompts
40-90%: Amount saved by appending “Be Concise” to your prompt
It’s important to remember that you pay by the token for responses. This means that asking an LLM to be concise can save you a lot of money [1]. This can be broadened beyond simply appending “be concise” to your prompt: if you are using GPT-4 to come up with 10 alternatives, maybe ask it for 5 and keep the other half of the money.
It’s great! Especially when I enter a search phrase that returns “No results found” for the exact match search.
I’m getting quite a few semantically correct matches for closed marketplace topics. Possibly it’s useful to return those, but maybe they should appear near the bottom of the list.
Maybe some searches could be narrowed to specific categories or tags. For example:
Searching for “How do I prevent activation emails from being sent when users login from wordpress?” the best results are going to be found in Documentation or wordpress.
Searching for “How to write a Data Explorer query that returns the most liked topics?” the best results are going to be found in the data & reporting and Documentation categories.
If it was possible, the initial search could return results from the most likely categories and a suggestion could be given to try expanding the search to other categories.
Thinking about semantic search as the first stop for using Discourse as a customer support forum, it would be nice to be able to prioritize specific categories or tags. For example, on Meta the initial search could prioritize searching the Documentation category.
That’s exactly one of the problems I wanted to address with this new feature. The semantic search will always find something.
At the moment, the semantic search is pretty barebones. It consists of just a few lines of code in the backend and returns whatever is closest semantically. It lacks any of the search features we added to the standard search over the last decade, like Search Improvements in 2.3 and many others. Because of this, it’s currently being offered as a complementary results set.
If the feature is well-received and we can perfect the UI in the product, then we’ll attempt to incorporate the Discourse specific parts to the semantic search results.