You can now include large bodies of text in your AI personas!
This offers multiple benefits:
-
You can introduce large bodies of text into your custom AI bots that are absent from models training. (for example: internal training docs, internal reports)
-
You can better ground a persona with concrete data (even if it exists in the model training set) which can help the model properly cite specific information and increase the quality of results.
To add uploads:
-
Create a new persona using the
/admin/plugins/discourse-ai/ai-personas/
interface. -
Upload the text files you wish to include in your persona
Before uploading files, please add the relevant extensions (
.md
and.txt
) via the site settingauthorized extensions
so that it can be used by the persona
- Tune indexing options as you see fit
Pre-requisites
For the option to operate you will need to have ai_embeddings_enabled
and an ai_embeddings_model
configured.
Discourse AI supports a very large amount of embedding models.
Our hosted customers get free access to state of the art bge-large-en model.
Self hosters or people wanting more choice can self host an embedding model or use models by Open AI, Google (Gemini) and more.
Is this a RAG?
The implementation of our upload support is indeed Retrieval-Augmented Generation.
At a high level, each time we are about to ask an LLM to answer a user’s question we look for highly relevant information based on the text you entered and inject it into the system prompt.
Explaining the various indexing options
What is a token? tokens are primitives used by large language models to split up text. A great visual explanation is at: https://platform.openai.com/tokenizer
The Discourse AI upload implementation comes with the following toggles:
Upload Chunk Tokens: after files are uploaded we split them into pieces. This allows you to control how big the pieces are. If a piece is too big for your embedding model then the embedding will be truncated (only part of the tokens will be handled).
Upload Chunk Overlap Tokens: This is the number tokens included from the previous chunk in current one. The larger this number the more duplicate information will be stored in your index.
Search Conversation Chunks: This controls how many “chunks” of tokens will be unconditionally included based on relevance in the completion prompt. The larger the number, the more context the LLM will be provided with (and the more expensive the calls will get). For example: If this is set to 10 and Upload Chunk Tokens
is set to 200 then every completion will have an extra overhead of 2000 tokens.
How does Discourse AI split up bodies of text?
Discourse uses a Recursive Character Text Splitter, this attempts to keep paragraphs, then lines and finally words together when splitting.
Additionally, Discourse gives you extra control over how your text will be split up.
The [[metadata YOUR METADATA HERE]]
separator can be used to split up large bodies of text and properly highlight what each section covers.
For example:
[[metadata about cats]]
a long story about cats
[[metadata about dogs]]
a long story about dogs
This allows a single text document to cover a large variety of content and protects you from “chunk contamination”. You are guaranteed that only data about cats will be included in cats chunks and dogs in dog chunks.
Sounds complicated, how do I debug this?
Discourse AI ships with the site setting ai bot debugging enabled groups
, users in this group have access to AI debugging:
The AI debugging screens can help you get a window into the information we send the AI.
Garbage in - Garbage out If you provide useless or vague information to an LLM it can not convert it magically to useful information
This screen can help you better decide how big your chunks should be or if you are including too many or too few chunks.
Does this even work?
A real world example is splitting up HAProxy documentation and feeding into a persona:
System Prompt:
You are a bot specializing in answering questions about HAProxy.
You live on a Discourse forum and render Discourse markdown.
When providing answers always try to include links back to HAProxy documentation.
For example this is how you would link to section 10.1.1. keep in mind that you can link to a section or an option within.
[fcgi-app](https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#10.1.1-fcgi-app)
Be liberal with links, they are very helpful.
Upload contents:
processed-haproxy-2.txt (1.2 MB)
Which was generated using the following script:
file_content = File.read("configuration.txt")
title = nil
body = nil
last_line = nil
sections = []
file_content.each_line do |line|
if line.strip.match?(/^[-]+$/)
section_number, title = title.to_s.split(" ", 2)
sections << {
section_number: section_number,
title: title,
body: body.to_s.strip
}
title = last_line
body = nil
last_line = nil
else
body = body.to_s + last_line.to_s
last_line = line
end
end
section_number, title = title.to_s.split(" ", 2)
sections << { section_number: section_number, title: title, body: body }
section_names =
sections.map { |section| [section[:section_number], section[:title]] }.to_h
sections[4..-1].each do |section|
title = []
current = +""
section_number = section[:section_number]
section_number
.split(".")
.each do |number|
current << number
current << "."
title << section_names[current].to_s.strip
end
title = title.join(" - ")
body = section[:body]
next if body.strip.empty?
puts "[[metadata section=\"#{section_number}\" title=\"#{title.strip}\"]]"
puts body
end
Both Claude Opus and GPT-4 can fail quite miserably with complex questions. This is understandable as they feed on all the tokens on the internet so 50 different versions of HAProxy documentation and all the discussion in the world about it goes into the brain, which can make it very confused:
Examples of confused GPT-4 and Claude 3 Opus
Both are objectively not nearly as good as the fine tuned answer the Discourse RAG provides:
Examples of less confused GPT-4 and Claude Opus
The future
We are looking forward to feedback some ideas for the future could be:
- PDF/DOCX/XLS etc. support so you don’t need to convert to text
- Smarter chunking for source code / html
- Smart transformations of incoming data prior to indexing
Let us know what you think!
Big thanks to @Roman_Rizzi for landing this feature