Discourse AI Persona, upload support

sam · April 16, 2024, 7:21am

You can now include large bodies of text in your AI personas!

This offers multiple benefits:

You can introduce large bodies of text into your custom AI bots that are absent from models training. (for example: internal training docs, internal reports)
You can better ground a persona with concrete data (even if it exists in the model training set) which can help the model properly cite specific information and increase the quality of results.

To add uploads:

Create a new persona using the /admin/plugins/discourse-ai/ai-personas/ interface.
Upload the text files you wish to include in your persona

Before uploading files, please add the relevant extensions ( .md and .txt) via the site setting authorized extensions so that it can be used by the persona

Tune indexing options as you see fit

Pre-requisites

For the option to operate you will need to have ai_embeddings_enabled and an ai_embeddings_model configured.

Discourse AI supports a very large amount of embedding models.

Our hosted customers get free access to state of the art bge-large-en model.

Self hosters or people wanting more choice can self host an embedding model or use models by Open AI, Google (Gemini) and more.

Is this a RAG?

The implementation of our upload support is indeed Retrieval-Augmented Generation.

At a high level, each time we are about to ask an LLM to answer a user’s question we look for highly relevant information based on the text you entered and inject it into the system prompt.

Explaining the various indexing options

What is a token? tokens are primitives used by large language models to split up text. A great visual explanation is at: https://platform.openai.com/tokenizer

The Discourse AI upload implementation comes with the following toggles:

Upload Chunk Tokens: after files are uploaded we split them into pieces. This allows you to control how big the pieces are. If a piece is too big for your embedding model then the embedding will be truncated (only part of the tokens will be handled).

Upload Chunk Overlap Tokens: This is the number tokens included from the previous chunk in current one. The larger this number the more duplicate information will be stored in your index.

Search Conversation Chunks: This controls how many “chunks” of tokens will be unconditionally included based on relevance in the completion prompt. The larger the number, the more context the LLM will be provided with (and the more expensive the calls will get). For example: If this is set to 10 and Upload Chunk Tokens is set to 200 then every completion will have an extra overhead of 2000 tokens.

How does Discourse AI split up bodies of text?

Discourse uses a Recursive Character Text Splitter, this attempts to keep paragraphs, then lines and finally words together when splitting.

Additionally, Discourse gives you extra control over how your text will be split up.

The [[metadata YOUR METADATA HERE]] separator can be used to split up large bodies of text and properly highlight what each section covers.

For example:

[[metadata about cats]]
a long story about cats
[[metadata about dogs]]
a long story about dogs

This allows a single text document to cover a large variety of content and protects you from “chunk contamination”. You are guaranteed that only data about cats will be included in cats chunks and dogs in dog chunks.

Sounds complicated, how do I debug this?

Discourse AI ships with the site setting ai bot debugging enabled groups, users in this group have access to AI debugging:

The AI debugging screens can help you get a window into the information we send the AI.

Garbage in - Garbage out If you provide useless or vague information to an LLM it can not convert it magically to useful information

This screen can help you better decide how big your chunks should be or if you are including too many or too few chunks.

Does this even work?

A real world example is splitting up HAProxy documentation and feeding into a persona:

System Prompt:

You are a bot specializing in answering questions about HAProxy.

You live on a Discourse forum and render Discourse markdown.

When providing answers always try to include links back to HAProxy documentation.

For example this is how you would link to section 10.1.1. keep in mind that you can link to a section or an option within.

[fcgi-app](https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#10.1.1-fcgi-app)

Be liberal with links, they are very helpful.

Upload contents:
processed-haproxy-2.txt (1.2 MB)

Which was generated using the following script:

file_content = File.read("configuration.txt")

title = nil
body = nil
last_line = nil

sections = []

file_content.each_line do |line|
  if line.strip.match?(/^[-]+$/)
    section_number, title = title.to_s.split(" ", 2)
    sections << {
      section_number: section_number,
      title: title,
      body: body.to_s.strip
    }

    title = last_line
    body = nil
    last_line = nil
  else
    body = body.to_s + last_line.to_s
    last_line = line
  end
end

section_number, title = title.to_s.split(" ", 2)
sections << { section_number: section_number, title: title, body: body }

section_names =
  sections.map { |section| [section[:section_number], section[:title]] }.to_h

sections[4..-1].each do |section|
  title = []
  current = +""
  section_number = section[:section_number]
  section_number
    .split(".")
    .each do |number|
      current << number
      current << "."
      title << section_names[current].to_s.strip
    end
  title = title.join(" - ")

  body = section[:body]

  next if body.strip.empty?
  puts "[[metadata section=\"#{section_number}\" title=\"#{title.strip}\"]]"
  puts body
end

Both Claude Opus and GPT-4 can fail quite miserably with complex questions. This is understandable as they feed on all the tokens on the internet so 50 different versions of HAProxy documentation and all the discussion in the world about it goes into the brain, which can make it very confused:

Examples of confused GPT-4 and Claude 3 Opus

Both are objectively not nearly as good as the fine tuned answer the Discourse RAG provides:

Examples of less confused GPT-4 and Claude Opus

The future

We are looking forward to feedback some ideas for the future could be:

PDF/DOCX/XLS etc. support so you don’t need to convert to text
Smarter chunking for source code / html
Smart transformations of incoming data prior to indexing

Let us know what you think!

Big thanks to @Roman for landing this feature

mattdm · April 17, 2024, 11:28am

Would it be possible to, in addition to manually uploaded text, include forum posts which match selected criteria?

Like:

in a given category
has a certain tag (or, doesn’t have)
is part of a topic marked solved (alternately, is specifically a solution post)
is the topic OP, not a reply
is posted by a user in a given group
is before or after a certain date

Or maybe instead of checkboxes with these things, simply “is one of top N topics for a given forum search”?

sam · April 17, 2024, 11:07pm

All of this is doable today with a custom search command:

Given category can be selected in filter
tag
solved
op only (i think it is doable)
given group
before and after date

mattdm · April 18, 2024, 6:51pm

Hmmm, maybe I am misunderstanding. Does making that available to the persona do the same

I’ve tried, and mostly I’m just getting Mistral to hallucinate topic titles and link to totally random post numbers.

MarcP · April 18, 2024, 6:53pm

Is Mistral actually good enough for these tasks? I think that might cause the hallucinations. Sam is right, by changing the base query you can do all things you stated in the OP.

mattdm · April 18, 2024, 6:56pm

Annnd, I posted before I finished my thoughts. The question was: does providing the search command and parameters do effectively the same thing as providing uploaded files?

But yeah, Mistral may not be good enough.

sam · April 18, 2024, 10:07pm

Just to expand here a bit:

https://chat.lmsys.org/?leaderboard

Mistral comes in many flavors … there is Mistral 7b, Mixtral 8x7b (the one you have), and the brand new mistralai/Mixtral-8x22B-Instruct-v0.1 · Hugging Face - this and another 5/6 models they release including some closed source ones.

Got to be careful with a “Mistral not good enough” and always clarify

I would say Mixtral-8x7b is simply not a great fit for tool support, it strays off too much.

I would say it is

Pretty good for “upload” support
Very good at custom persona support
Weak at tool support

We are trying to see if we can upgrade to 8x22b (it ships with good tool support), trouble is that memory requirements are quite high and we would need to quantize the model to fit it nicely on our servers.

But really… if you have a data privacy deal with Amazon I would strongly recommend bedrock which would give you access to Claude 3 Opus and Haiku.

I do get the tension between open source models vs closed source ones. Its tough the closed source ones are just quite a bit ahead at the moment.

MarcP · April 18, 2024, 10:10pm

You are right, I should’ve phrased better. I was indeed hinting at closed source models being better in general.

MarcP · April 23, 2024, 4:01am

Uploading multiple .txt files at once is bugging out: they quickly appear, but then only one is shown, after that the add file button does not respond anymore.

Also I think .md file support would be a great addition.

sam · April 23, 2024, 4:53am

oh yikes … nice catch @Roman will have a look.

This should work fine, it is already supported you just need to enable the extension.

Roman · April 25, 2024, 1:50pm

I pushed a fix for the multi-file bug:

github.com/discourse/discourse-ai

FIX: RAG uploader must support multi-file indexing.

discourse:main ← discourse:rag_fixes

opened 01:32PM - 25 Apr 24 UTC

romanrizzi

+56 -22

Updating the editing model's rag_uploads in the editor component broke multi-fil…e uploading. Instead, we'll keep the uploads in the uploader and update the model when we finish. This PR also fast-tracks the initial update so we can show feedback to the user quickly, and allows uploading MD files. Bug reported on https://meta.discourse.org/t/discourse-ai-persona-upload-support/304049/11

sam · June 11, 2024, 3:41am

2 posts were split to a new topic: Improving quality of search filters in Discourse AI

tpetrov · June 20, 2024, 9:07am

Sam Saffron:

Additionally, Discourse gives you extra control over how your text will be split up.

The [[metadata YOUR METADATA HERE]] separator can be used to split up large bodies of text and properly highlight what each section covers.

For example:
[[metadata about cats]]
a long story about cats
[[metadata about dogs]]
a long story about dogs
This allows a single text document to cover a large variety of content and protects you from “chunk contamination”. You are guaranteed that only data about cats will be included in cats chunks and dogs in dog chunks.

Hey Sam, I wonder how this works exactly. It will tell the AI that this is data about cats or dogs, but how will it affect chunks if they are already set to a set number of tokens (say 2000). Will it cut off a chunk when it sees a line like [[metadata about dogs]] and start a new chunk?

sam · June 20, 2024, 9:28am

Yes, it will cut off early

MarcP · July 6, 2024, 4:14am

Oh yikes I was using the <meta>content</meta> format which works for most LLM models, is there a reason you chose the [[brackets]] way? Do <tags> still work or is it better to use the bracket method in Discourse?

sam · July 8, 2024, 8:01am

This is not consumed by the llm at all (we parse and consume metadata ) wanted a separator that was very unlikely to show up in indexed data

Saif · August 27, 2024, 2:54am

Added this bit to the copy

MachineScholar · September 17, 2024, 6:52am

Are these embeddings creating for AI Personas are sitting in the same vector database? And actually, are all of the embeddings generated for Discourse stored in the same vector database?

sam · September 17, 2024, 6:59am

All in Postgres using the same db

BrianC · January 30, 2025, 3:03am

Can someone tell me what happens to the uploaded text files in the personas if they are deleted from the list of uploaded files? I understand they are used for RAG but if I delete the file will that delete it from what has been indexed? I’m wondering if it’s possible to edit what’s indexed by deleting a text file, making your adjustment and re-uploading?

Topic		Replies	Views
[Ai Bot] Add user token tracking, custom AI personas, max context posts, document loading, custom API URLs, and localized chat titles Feature ai , ai-bot	2	464	March 22, 2024
Discourse AI Plugin official , included-in-core , ai	71	34910	July 29, 2025
Improving quality of search filters in Discourse AI Support ai	14	555	June 28, 2024
AI personas integration of the chat plugin: It would be great if site users could chat with AI personas Feature chat , completed , ai	7	881	May 11, 2024
New AI Persona Editor for Discourse Announcements ai , ai-bot	23	3024	February 7, 2025