Prompt injection for long-context LLMs as an alternative to RAG?

Is it feasible to inject medium-sized documents (e.g., up to 100KB) into the context of a Discourse AI persona bot session via the system prompt?

USE CASE

A custom AI Persona linked to a private LLM like Llama3-8b in an AWS instance where the cost is by the hour, not the token… ie, the number or request/response tokens doesn’t matter and the server has considerable cuda power so performance is fine… ie, RAG could be optional?

( alt. use case: Gemini.1.5 LLM, where there’s no charge for API calls )

GOAL

Reduce moving parts in the pipeline and improve accuracy by avoiding similarity retrieval.

EXPERIMENT

Informal AI Persona test with Gemini 1.5 Pro where a text document of ~20k tokens was inserted into the system prompt .
image

I asked several questions that I knew had answers only in the paper… It answered all the questions correctly. So I’m assuming it read the 20k tokens out of the prompt and parsed them for each question?

In cases where sessions and context content aren’t too huge, any downsides to this approach?

Thanks much…



FOOTNOTE - Remove context from prompt mid session

When I deleted the prompt injection content mid session and continued to ask questions, Gemini continued to answer questions correctly, but after several questions it could not find the context and failed. As was somewhat expected, Gemini 1.5 can persist context across multiple conversational turns in a session but not indefinitely.

All thoughts, comments and guidance appreciated !

1 Like

Yeah we have truncation logic that depends on the amount of tokens the llm allows, we set the threshold quite high for gemini 1.5 models (at 800k)

It should work, but every interaction can be very expensive.

Overall I have found that limiting context help models stay more focused but long term (2-5 years out) … rag may be pointless and we will just have so many tokens and focus that it does not matter.

3 Likes

In spite of my questions about prompt stuffing… I actually love rag . .

IMO the work you guys are doing with big league embedding engines is powerful and useful right now… but also agree that RAG…may be doomed.

As Sam Altman said in a recent interview… beware business models and project plans that get in the way of the LLM !! We will steamroller you! …or words to that effect…

So ultimately … maybe we’re going to want to just give our stuff to the llm without a lot of pre processing pipelines that are low dimensional (input) then high dimensional (embedding) then low dimensional (prompting) then high dimensional (transformer inference) then low dimensional (output)… Bob’s your uncle!

Here’s some background on rag versus long context that I just stumbled on… haven’t listened to it all yet but seems relevant maybe … (not affiliated with anyone in this video :-)>

ADDENDUM

I did get to watch that vid on Gradient long-context LLama3… it reminds us that context includes everything in play…

  • User input
  • LLM output
  • Control tokens
  • System instructions

… as the window slides, stuff gets left out… but they mentioned there can be ‘protection’ of the system prompt in sessions where the context window is filled up …

there’s also the issues of ‘max input size’ and the original ‘sequence length’ the model was trained on that may come into the equation.

below is an example of long context prompt stuffing in action…

In general it seems feasible to create a team of Discourse AI personas that each have a big chunk of specialized content or codebase for querying ( keeping in mind the caveat about high expense when paying by token !)

But isn’t this just a (really inefficient and) “static” version of RAG?

All RAG does differently from this approach is select and include relevant chunks of content instead of including all content.

2 Likes

Fair point, for sure… no simple answer IMO

I guess this depends on use case.

RAG works well for some apps, but for other fairly deterministic target cases, (eg, customer Q/A, payments, medical, etc.) myself and others have had problems getting good accuracy with RAG vector search over the past year. I.e, the bot will either miss things or make things up (poor recall, poor precision in IR terms), which is well documented by Stanford, Google, et all.

So the question arises… why throw a bunch of chunks at the LLM if you can give it the whole corpus. At least with context injection, when the LLM isn’t accurate you have less things to tune…

Ok it’s not going to work for vast document / code libraries … but for small and moderate sized content bases it seems to work great so far… am working on a project that is formally testing this… ! more soon … thanks

PS. → to make things even more interesting… I’ve had good luck with context injection + fine tuning… and there’s emerging approaches that combine RAG and context injection ! … Etc etc.

also see:

https://www.google.com/search?q=papers+on+the+problems+with+llm+rag

ADDENDUM 2

Here’s a Q/A test with a white paper (~20k tokens) put in context via prompt injection vs RAG… (content and settings were the same as much as possible. LLM = Gemini-1.5-Pro)…

ANALYSIS:

RAG is inconsistent… sometimes finding answer, sometime missing.


:github_check: Prompt inject success:


:x: RAG fail:


RAG Request trace:

I did get RAG to answer questions from the file upload at the beginning of the document , and with coaxing , it may look at middle and end … so it’s not a total fail… but it is inconsistent… consistently , or more difficult to work with IMO : )

Here’s the test file in case anyone wants to play with it:

File contains these BME haystack ‘needles’ that are guaranteed to be unique, i.e., not present in external copies of the paper across the internet.

Beginning:

Proofreader: Felonius Monko

Middle:

Editor’s note: The Goldich stability series was discovered by Maurice Goldrch. While Goldichs original order of mineral weathering potential was qualitative, later work by Michal Kowalski and J. Donald Rimstidt placed in the series in quantitative terms. Thanks to Dr. Gomez Pyle at NMU for this clarification.

End:

Dang, S. et al. Cryo-EM structures of the TMEM16A calcium-activated chloride channel. Nature 552, 426429 (2017).

Equatics-paper1-with-unique-haystack-needles-v116.txt (71.8 KB)

All comments, critiques and guidance appreciated !! Going forward, I will conduct the test with more LLMs and embedding models, etc.

FOOTNOTE:

I was able to rerun the above test with GPT4o (128k context) , making sure to use large token / chunk settings… but it’s still very flaky for my white paper Q/A use case… (lost in the middle, lost at the end , etc.) …here’s my settings if anyone wants to duplicate and refine. .Would love it if we can find the right settings for this case :

CUSTOM AI PERSONA
Enabled? Yes
Priority Yes
Allow Chat Yes
Allow Mentions Yes
Vision Enabled No
Name Rag Testing Bot 3
Description Test RAG vs Long Context prompt injection
Default Language Model GPT-4o-custom
User Rag_Testing_Bot_bot
Enabled Commands Categories, Read, Summary
Allowed Groups trust_level_4
System Prompt Answer as comprehensively as possible from the provided context on Equatic Carbon Removal Research in the attached file. Do not invent content. Do not use content external to this session. Focus on content provided and create answers from it as accurately and completely as possible.
Max Context Posts 50
Temperature 0.1
Top P 1
Uploads Equatics-paper1-with-unique-haystack-needles-v116.txt
Upload Chunk Tokens 1024
Upload Chunk Overlap Tokens 10
Search Conversation Chunks 10
Language Model for Question Consolidator GPT-4o-custom
CUSTOM BOT
Name to display GPT-4o-custom
Model name gpt-4o
Service hosting the model OpenAI
URL of the service hosting the model https://api.openai.com/v1/chat/completions
API Key of the service hosting the model D20230943sdf_fake_Qqxo2exWa91
Tokenizer OpenAITokenizer
Number of tokens for the prompt 30000