Is it feasible to inject medium-sized documents (e.g., up to 100KB) into the context of a Discourse AI persona bot session via the system prompt?
USE CASE
A custom AI Persona linked to a private LLM like Llama3-8b in an AWS instance where the cost is by the hour, not the token… ie, the number or request/response tokens doesn’t matter and the server has considerable cuda power so performance is fine… ie, RAG could be optional?
( alt. use case: Gemini.1.5 LLM, where there’s no charge for API calls )
GOAL
Reduce moving parts in the pipeline and improve accuracy by avoiding similarity retrieval.
EXPERIMENT
Informal AI Persona test with Gemini 1.5 Pro where a text document of ~20k tokens was inserted into the system prompt .
I asked several questions that I knew had answers only in the paper… It answered all the questions correctly. So I’m assuming it read the 20k tokens out of the prompt and parsed them for each question?
In cases where sessions and context content aren’t too huge, any downsides to this approach?
When I deleted the prompt injection content mid session and continued to ask questions, Gemini continued to answer questions correctly, but after several questions it could not find the context and failed. As was somewhat expected, Gemini 1.5 can persist context across multiple conversational turns in a session but not indefinitely.
Yeah we have truncation logic that depends on the amount of tokens the llm allows, we set the threshold quite high for gemini 1.5 models (at 800k)
It should work, but every interaction can be very expensive.
Overall I have found that limiting context help models stay more focused but long term (2-5 years out) … rag may be pointless and we will just have so many tokens and focus that it does not matter.
In spite of my questions about prompt stuffing… I actually love rag . .
IMO the work you guys are doing with big league embedding engines is powerful and useful right now… but also agree that RAG…may be doomed.
As Sam Altman said in a recent interview… beware business models and project plans that get in the way of the LLM !! We will steamroller you! …or words to that effect…
So ultimately … maybe we’re going to want to just give our stuff to the llm without a lot of pre processing pipelines that are low dimensional (input) then high dimensional (embedding) then low dimensional (prompting) then high dimensional (transformer inference) then low dimensional (output)… Bob’s your uncle!
Here’s some background on rag versus long context that I just stumbled on… haven’t listened to it all yet but seems relevant maybe … (not affiliated with anyone in this video :-)>
I did get to watch that vid on Gradient long-context LLama3… it reminds us that context includes everything in play…
User input
LLM output
Control tokens
System instructions
… as the window slides, stuff gets left out… but they mentioned there can be ‘protection’ of the system prompt in sessions where the context window is filled up …
there’s also the issues of ‘max input size’ and the original ‘sequence length’ the model was trained on that may come into the equation.
below is an example of long context prompt stuffing in action…
In general it seems feasible to create a team of Discourse AI personas that each have a big chunk of specialized content or codebase for querying ( keeping in mind the caveat about high expense when paying by token !)
RAG works well for some apps, but for other fairly deterministic target cases, (eg, customer Q/A, payments, medical, etc.) myself and others have had problems getting good accuracy with RAG vector search over the past year. I.e, the bot will either miss things or make things up (poor recall, poor precision in IR terms), which is well documented by Stanford, Google, et all.
So the question arises… why throw a bunch of chunks at the LLM if you can give it the whole corpus. At least with context injection, when the LLM isn’t accurate you have less things to tune…
Ok it’s not going to work for vast document / code libraries … but for small and moderate sized content bases it seems to work great so far… am working on a project that is formally testing this… ! more soon … thanks
PS. → to make things even more interesting… I’ve had good luck with context injection + fine tuning… and there’s emerging approaches that combine RAG and context injection ! … Etc etc.
Here’s a Q/A test with a white paper (~20k tokens) put in context via prompt injection vs RAG… (content and settings were the same as much as possible. LLM = Gemini-1.5-Pro)…
ANALYSIS:
RAG is inconsistent… sometimes finding answer, sometime missing.
I did get RAG to answer questions from the file upload at the beginning of the document , and with coaxing , it may look at middle and end … so it’s not a total fail… but it is inconsistent… consistently , or more difficult to work with IMO : )
Here’s the test file in case anyone wants to play with it:
File contains these BME haystack ‘needles’ that are guaranteed to be unique, i.e., not present in external copies of the paper across the internet.
Beginning:
Proofreader: Felonius Monko
Middle:
Editor’s note: The Goldich stability series was discovered by Maurice Goldrch. While Goldichs original order of mineral weathering potential was qualitative, later work by Michal Kowalski and J. Donald Rimstidt placed in the series in quantitative terms. Thanks to Dr. Gomez Pyle at NMU for this clarification.
End:
Dang, S. et al. Cryo-EM structures of the TMEM16A calcium-activated chloride channel. Nature 552, 426429 (2017).
I was able to rerun the above test with GPT4o (128k context) , making sure to use large token / chunk settings… but it’s still very flaky for my white paper Q/A use case… (lost in the middle, lost at the end , etc.) …here’s my settings if anyone wants to duplicate and refine. .Would love it if we can find the right settings for this case :
CUSTOM AI PERSONA
Enabled?
Yes
Priority
Yes
Allow Chat
Yes
Allow Mentions
Yes
Vision Enabled
No
Name
Rag Testing Bot 3
Description
Test RAG vs Long Context prompt injection
Default Language Model
GPT-4o-custom
User
Rag_Testing_Bot_bot
Enabled Commands
Categories, Read, Summary
Allowed Groups
trust_level_4
System Prompt
Answer as comprehensively as possible from the provided context on Equatic Carbon Removal Research in the attached file. Do not invent content. Do not use content external to this session. Focus on content provided and create answers from it as accurately and completely as possible.