Discourse AI Persona, upload support

You can now include large bodies of text in your AI personas!

This offers multiple benefits:

  1. You can introduce large bodies of text into your custom AI bots that are absent from models training. (for example: internal training docs, internal reports)

  2. You can better ground a persona with concrete data (even if it exists in the model training set) which can help the model properly cite specific information and increase the quality of results.

To add uploads:

  1. Create a new persona using the /admin/plugins/discourse-ai/ai-personas/ interface.

  2. Upload the text files you wish to include in your persona

:information_source: Before uploading files, please add the relevant extensions ( .md and .txt) via the site setting authorized extensions so that it can be used by the persona

  1. Tune indexing options as you see fit

Pre-requisites

For the option to operate you will need to have ai_embeddings_enabled and an ai_embeddings_model configured.

Discourse AI supports a very large amount of embedding models.

Our hosted customers get free access to state of the art bge-large-en model.

Self hosters or people wanting more choice can self host an embedding model or use models by Open AI, Google (Gemini) and more.

Is this a RAG?

The implementation of our upload support is indeed Retrieval-Augmented Generation.

At a high level, each time we are about to ask an LLM to answer a user’s question we look for highly relevant information based on the text you entered and inject it into the system prompt.

Explaining the various indexing options

What is a token? tokens are primitives used by large language models to split up text. A great visual explanation is at: https://platform.openai.com/tokenizer

The Discourse AI upload implementation comes with the following toggles:

Upload Chunk Tokens: after files are uploaded we split them into pieces. This allows you to control how big the pieces are. If a piece is too big for your embedding model then the embedding will be truncated (only part of the tokens will be handled).

Upload Chunk Overlap Tokens: This is the number tokens included from the previous chunk in current one. The larger this number the more duplicate information will be stored in your index.

Search Conversation Chunks: This controls how many “chunks” of tokens will be unconditionally included based on relevance in the completion prompt. The larger the number, the more context the LLM will be provided with (and the more expensive the calls will get). For example: If this is set to 10 and Upload Chunk Tokens is set to 200 then every completion will have an extra overhead of 2000 tokens.

How does Discourse AI split up bodies of text?

Discourse uses a Recursive Character Text Splitter, this attempts to keep paragraphs, then lines and finally words together when splitting.

Additionally, Discourse gives you extra control over how your text will be split up.

The [[metadata YOUR METADATA HERE]] separator can be used to split up large bodies of text and properly highlight what each section covers.

For example:

[[metadata about cats]]
a long story about cats
[[metadata about dogs]]
a long story about dogs

This allows a single text document to cover a large variety of content and protects you from “chunk contamination”. You are guaranteed that only data about cats will be included in cats chunks and dogs in dog chunks.

Sounds complicated, how do I debug this?

Discourse AI ships with the site setting ai bot debugging enabled groups, users in this group have access to AI debugging:

The AI debugging screens can help you get a window into the information we send the AI.

:warning: Garbage in - Garbage out If you provide useless or vague information to an LLM it can not convert it magically to useful information

This screen can help you better decide how big your chunks should be or if you are including too many or too few chunks.

Does this even work?

A real world example is splitting up HAProxy documentation and feeding into a persona:

System Prompt:

You are a bot specializing in answering questions about HAProxy.

You live on a Discourse forum and render Discourse markdown.

When providing answers always try to include links back to HAProxy documentation.

For example this is how you would link to section 10.1.1. keep in mind that you can link to a section or an option within.

[fcgi-app](https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#10.1.1-fcgi-app)

Be liberal with links, they are very helpful.

Upload contents:
processed-haproxy-2.txt (1.2 MB)

Which was generated using the following script:

file_content = File.read("configuration.txt")

title = nil
body = nil
last_line = nil

sections = []

file_content.each_line do |line|
  if line.strip.match?(/^[-]+$/)
    section_number, title = title.to_s.split(" ", 2)
    sections << {
      section_number: section_number,
      title: title,
      body: body.to_s.strip
    }

    title = last_line
    body = nil
    last_line = nil
  else
    body = body.to_s + last_line.to_s
    last_line = line
  end
end

section_number, title = title.to_s.split(" ", 2)
sections << { section_number: section_number, title: title, body: body }

section_names =
  sections.map { |section| [section[:section_number], section[:title]] }.to_h

sections[4..-1].each do |section|
  title = []
  current = +""
  section_number = section[:section_number]
  section_number
    .split(".")
    .each do |number|
      current << number
      current << "."
      title << section_names[current].to_s.strip
    end
  title = title.join(" - ")

  body = section[:body]

  next if body.strip.empty?
  puts "[[metadata section=\"#{section_number}\" title=\"#{title.strip}\"]]"
  puts body
end

Both Claude Opus and GPT-4 can fail quite miserably with complex questions. This is understandable as they feed on all the tokens on the internet so 50 different versions of HAProxy documentation and all the discussion in the world about it goes into the brain, which can make it very confused:

Examples of confused GPT-4 and Claude 3 Opus

Both are objectively not nearly as good as the fine tuned answer the Discourse RAG provides:

Examples of less confused GPT-4 and Claude Opus

The future

We are looking forward to feedback some ideas for the future could be:

  • PDF/DOCX/XLS etc. support so you don’t need to convert to text
  • Smarter chunking for source code / html
  • Smart transformations of incoming data prior to indexing

Let us know what you think!

Big thanks to @Roman_Rizzi for landing this feature :hugs:

20 Likes

Would it be possible to, in addition to manually uploaded text, include forum posts which match selected criteria?

Like:

  • in a given category
  • has a certain tag (or, doesn’t have)
  • is part of a topic marked solved (alternately, is specifically a solution post)
  • is the topic OP, not a reply
  • is posted by a user in a given group
  • is before or after a certain date

Or maybe instead of checkboxes with these things, simply “is one of top N topics for a given forum search”?

1 Like

All of this is doable today with a custom search command:

  • Given category can be selected in filter
  • tag
  • solved
  • op only (i think it is doable)
  • given group
  • before and after date

:hugs:

5 Likes

Hmmm, maybe I am misunderstanding. Does making that available to the persona do the same

I’ve tried, and mostly I’m just getting Mistral to hallucinate topic titles and link to totally random post numbers. :slight_smile:

1 Like

Is Mistral actually good enough for these tasks? I think that might cause the hallucinations. Sam is right, by changing the base query you can do all things you stated in the OP.

1 Like

Annnd, I posted before I finished my thoughts. The question was: does providing the search command and parameters do effectively the same thing as providing uploaded files?

But yeah, Mistral may not be good enough.

1 Like

Just to expand here a bit:

https://chat.lmsys.org/?leaderboard

Mistral comes in many flavors … there is Mistral 7b, Mixtral 8x7b (the one you have), and the brand new mistralai/Mixtral-8x22B-Instruct-v0.1 · Hugging Face - this and another 5/6 models they release including some closed source ones.

Got to be careful with a “Mistral not good enough” and always clarify

I would say Mixtral-8x7b is simply not a great fit for tool support, it strays off too much.

I would say it is

  1. Pretty good for “upload” support
  2. Very good at custom persona support
  3. Weak at tool support

We are trying to see if we can upgrade to 8x22b (it ships with good tool support), trouble is that memory requirements are quite high and we would need to quantize the model to fit it nicely on our servers.

But really… if you have a data privacy deal with Amazon I would strongly recommend bedrock which would give you access to Claude 3 Opus and Haiku.

I do get the tension between open source models vs closed source ones. Its tough the closed source ones are just quite a bit ahead at the moment.

2 Likes

You are right, I should’ve phrased better. I was indeed hinting at closed source models being better in general.

2 Likes

Uploading multiple .txt files at once is bugging out: they quickly appear, but then only one is shown, after that the add file button does not respond anymore.

Also I think .md file support would be a great addition.

1 Like

oh yikes … nice catch @Roman_Rizzi will have a look.

This should work fine, it is already supported you just need to enable the extension.

3 Likes

I pushed a fix for the multi-file bug:

3 Likes

2 posts were split to a new topic: Improving quality of search filters in Discourse AI

Hey Sam, I wonder how this works exactly. It will tell the AI that this is data about cats or dogs, but how will it affect chunks if they are already set to a set number of tokens (say 2000). Will it cut off a chunk when it sees a line like [[metadata about dogs]] and start a new chunk?

1 Like

Yes, it will cut off early

2 Likes

Oh yikes I was using the <meta>content</meta> format which works for most LLM models, is there a reason you chose the [[brackets]] way? Do <tags> still work or is it better to use the bracket method in Discourse?

1 Like

This is not consumed by the llm at all (we parse and consume metadata ) wanted a separator that was very unlikely to show up in indexed data

2 Likes

Added this bit to the copy

1 Like

Are these embeddings creating for AI Personas are sitting in the same vector database? And actually, are all of the embeddings generated for Discourse stored in the same vector database?

1 Like

All in Postgres using the same db

2 Likes