Discourse AI - Embeddings

:bookmark: This topic covers the configuration of the Embeddings module of the Discourse AI plugin. It explains what embeddings are, how they’re used, and how to set them up.

:person_raising_hand: Required user level: Administrator

Embeddings are a crucial component of the Discourse AI plugin, enabling features like related topics and semantic search. This guide will walk you through the setup and use of embeddings in your Discourse instance.

Summary

  • Embeddings are pre-configured for hosted customers
  • They power semantic features like related topics and semantic search
  • Two provider options: Open Source (recommended) or OpenAI
  • Various settings to customize embeddings behavior

What are embeddings?

Embeddings are numerical representations of text that capture semantic meaning. In Discourse, they’re used to:

  1. Generate “Related Topics” at the bottom of topic pages
  2. Enable semantic search functionality

Setting up embeddings

For hosted customers

If you’re a hosted customer, embeddings are pre-configured. You can simply enable the AI features that depend on them.

For self-hosted instances

If you’re self-hosting, refer to the Discourse AI self-hosted guide for detailed setup instructions.

Configuring embeddings

Navigate to your site settings to configure the following options:

  1. ai embeddings enabled: Turn the embeddings module on or off
  2. ai embeddings discourse service api endpoint: URL for the API (auto-configured for hosted customers)
  3. ai embeddings discourse service api key: API key (auto-configured for hosted customers)
  4. ai embeddings models: Select which models to use for generating embeddings
  5. ai embeddings semantic suggested model: Choose the model for semantic suggested topics
  6. ai embeddings generate for pms: Decide whether to generate embeddings for private messages
  7. ai embeddings semantic related topics enabled: Enable or disable the “Related Topics” feature
  8. ai embeddings pg connection string: Database connection string (auto-configured for hosted customers)
  9. ai openai api key: Your OpenAI API key (if using OpenAI as a provider)
  10. ** ai embeddings semantic search hyde mode**: The model used for keyword expansion during AI search.

Providers

Discourse AI supports two embedding providers:

  1. Open Source (recommended and default): Uses a collection of open-source models from SBERT
  2. OpenAI: Requires an OpenAI API key

Features

Related Topics

When enabled, a “Related Topics” section appears at the bottom of topic pages, linking to semantically similar discussions.

:information_source: You can read more about using Related Topics in this guide.

Semantic Search

Embeddings power the semantic search option on the full-page search interface.

Semantic search leans on HyDE (Hypothetical Document Embedding). We expand the search term using a large language model you supply. Once expanded we convert the expanded search to a vector and look for similar topics. This technique adds some latency to search and improves results. When selecting a model for hyde, ai embeddings semantic search hyde mode be sure to choose a fast model like Gemini Flash, Claude Haiku or GPT4o Mini.

Generating embeddings

Embeddings are generated automatically for new posts. To generate embeddings for existing content:

  1. Embeddings are created when a page is viewed if they’re missing
  2. Self-hosters can use the rake task ai:embeddings:backfill to generate embeddings for all topics

:warning: The rake task should only be used by experienced operators who can install required gems manually.

FAQs

Q: How are related topics determined?
A: Related topics are based solely on embeddings, which include the title, category, tags, and posts content.

Q: Can I exclude certain topics from related topics?
A: Yes, there’s a site setting to remove closed topics from the results.

Q: Do embeddings work for historical posts?
A: Yes, the system will automatically backfill embeddings for all your content.

Additional resources

Last edited by @sam 2024-09-01T23:57:59Z

Last checked by @hugh 2024-08-06T04:16:01Z

Check documentPerform check on document:
17 Likes

Great work, thanks first of all, but I can’t see similar topics under the topics, somehow, my settings are like this, I added an openai key. Semantic search works, but how can I show similar articles under topics?

If you want to use OpenAI for embeddings you must set ai embeddings model to text-embedding-ada-002.

1 Like

How are the jobs to generate embeddings scheduled? From the code it seems like embeddings are only generated when the page is viewed and embeddings are missing. Is there a way to generate embeddings for the whole site when turning the feature on?

2 Likes

You can also run rake ai:embeddings:backfill to generate embeddings for all topics eagerly.

6 Likes

Suggestion

Sometimes reading a topic one knows most of the noted background but there are also some mentions that are not known. While there is summarization for summarizing an entire topic up to that point what would also be of help would be an AI option that inserts a glossary for the topic as a post near the top and updates it if a user selects a word or phrase that it wants the AI to include in the glossary.


Today in reading this topic there was one reference I did not recognize so looked it up and added a reply with a reference for it. While I know the remaining references I am sure there are others, especially those new to LLMs and such, that would have no idea of many of the noted references and if the AI could help them they would visit the site much more often.

While I know what RAG means in this starting post, how many really know that?

What is RAG (Click triangle to expand)

How do domain-specific chatbots work? An Overview of Retrieval Augmented Generation (RAG)


Note: Did not know with which topic to post this but since it needed embeddings to work posted it here. Please move this if it makes more sense elsewhere or as the Discourse AI plugin changes.

Are embeddings the only variable when determining “Related Topics”? Or are there any other factors that are considered (e.g. author, topic score, topic age, category, etc)?

3 Likes

Only the embeddings, but those contain the title, category, tags and posts. There is a site setting to remove closed topics from the results too.

5 Likes

7 posts were split to a new topic: Is full page semantic search only in English?

2 posts were split to a new topic: Differences in search latency between AI semantic and keyword search

I wish I found this a few months ago. I already created embeddings using bge-small-en-v1.5 and hosted them in an external database.

I will see if it can be shoehorned into this ‘standard’ set-up!

I find a little bug in the recent version leading to rake ai:embeddings:backfill failed:

root@nbg-webxj:/var/www/discourse# rake ai:embeddings:backfill
rake aborted!
NameError: uninitialized constant Parallel (NameError)

  Parallel.each(topics.all, in_processes: args[:concurrency].to_i, progress: "Topics") do |t|
  ^^^^^^^^
/var/www/discourse/plugins/discourse-ai/lib/tasks/modules/embeddings/database.rake:27:in `block in <main>'
/usr/local/bin/bundle:25:in `load'
/usr/local/bin/bundle:25:in `<main>'
Tasks: TOP => ai:embeddings:backfill
(See full trace by running task with --trace)

I suspect the culprit is that the parallel gem is neither installed in this plugin, nor in Discourse core(only find one in the if ENV["IMPORT"] == "1" block: gem "parallel", require: false).

I find the ruby-progressbar gem also required to perform rake ai:embeddings:backfill.

I make a simple PR on Github:

2 Likes

Note to others that this rake method seems to have been demoted/semi-deprecated since per Falco on GitHub:

Thanks for the PR @fokx, but I’ve left those out unintentionally as the rake task fell out out favor and should only be used in rare occasions by experienced operators who can easily install those out of band.

Is the semantic search option no longer shown in that dropdown and instead comprehended or enabled through the AI toggle?

1 Like

Can you confirm for me if the embeddings will only work on posts after installing or will it also allow us to semantic-search all historical posts? I’m hoping the latter! Thanks.

1 Like

It’s the later, as it will automatically backfill embeddings for all your content.

4 Likes

I’m trying to set up AI Embeddings using Gemini Flash but I can’t get it to work. I can’t find good descriptions/examples of all the settings fields though, so I might have missed one or two that are important. I don’t know if the ‘ai_embeddings_model’ setting is required, but if I set it to ‘gemini’ I get the following error…

I’ve not been able to find the ai_gemini_api_key setting. I do have Gemini Flash set up as an LLM with an API key and that’s working elsewhere, e.g. summarization, but I’m assuming this is wanting the API key entered somewhere else?

I suppose this would work with OpenAI too, wouldn’t it?

It would be great if it could support their Batch API (50% discount)

Yes, but nowadays we backfill automatically in the background, so this isn’t mandatory.

For price conscious peeps, we support great open weights model that you can run on your own hardware.

1 Like

Thanks. Do I understand it correctly that backfill is when the vectorization happens? When switching between models, do the vectors need to be recalculated (Are they “proprietary”)? I assume yes.

It’d be useful to know how the costs of using the OpenAI API stack up against investing in a GPU-powered server with opensource solution. Is there a formula or any way to estimate the number of tokens used? We’re only using the API to vectorize posts, not for calculating vector distances, right? So, the number of tokens used depends on how much content we have, correct?

I assume that for both related topics and AI-powered search, all posts need to be vectorized only once, so I can calculate the total number of words in posts table and derive the number of tokens needed. The same process would apply to the daily addition of posts. I’m neglecting the search phrases for now.