Prompt injection for long-context LLMs as an alternative to RAG?

StevePlex · May 22, 2024, 10:47pm

Is it feasible to inject medium-sized documents (e.g., up to 100KB) into the context of a Discourse AI persona bot session via the system prompt?

USE CASE

A custom AI Persona linked to a private LLM like Llama3-8b in an AWS instance where the cost is by the hour, not the token… ie, the number or request/response tokens doesn’t matter and the server has considerable cuda power so performance is fine… ie, RAG could be optional?

( alt. use case: Gemini.1.5 LLM, where there’s no charge for API calls )

GOAL

Reduce moving parts in the pipeline and improve accuracy by avoiding similarity retrieval.

EXPERIMENT

Informal AI Persona test with Gemini 1.5 Pro where a text document of ~20k tokens was inserted into the system prompt .

I asked several questions that I knew had answers only in the paper… It answered all the questions correctly. So I’m assuming it read the 20k tokens out of the prompt and parsed them for each question?

In cases where sessions and context content aren’t too huge, any downsides to this approach?

Thanks much…

FOOTNOTE - Remove context from prompt mid session

When I deleted the prompt injection content mid session and continued to ask questions, Gemini continued to answer questions correctly, but after several questions it could not find the context and failed. As was somewhat expected, Gemini 1.5 can persist context across multiple conversational turns in a session but not indefinitely.

All thoughts, comments and guidance appreciated !

sam · May 23, 2024, 2:58am

Yeah we have truncation logic that depends on the amount of tokens the llm allows, we set the threshold quite high for gemini 1.5 models (at 800k)

It should work, but every interaction can be very expensive.

Overall I have found that limiting context help models stay more focused but long term (2-5 years out) … rag may be pointless and we will just have so many tokens and focus that it does not matter.

StevePlex · May 23, 2024, 6:29am

In spite of my questions about prompt stuffing… I actually love rag . .

IMO the work you guys are doing with big league embedding engines is powerful and useful right now… but also agree that RAG…may be doomed.

As Sam Altman said in a recent interview… beware business models and project plans that get in the way of the LLM !! We will steamroller you! …or words to that effect…

So ultimately … maybe we’re going to want to just give our stuff to the llm without a lot of pre processing pipelines that are low dimensional (input) then high dimensional (embedding) then low dimensional (prompting) then high dimensional (transformer inference) then low dimensional (output)… Bob’s your uncle!

Here’s some background on rag versus long context that I just stumbled on… haven’t listened to it all yet but seems relevant maybe … (not affiliated with anyone in this video :-)>

StevePlex · May 23, 2024, 5:17pm

ADDENDUM

I did get to watch that vid on Gradient long-context LLama3… it reminds us that context includes everything in play…

User input
LLM output
Control tokens
System instructions

… as the window slides, stuff gets left out… but they mentioned there can be ‘protection’ of the system prompt in sessions where the context window is filled up …

there’s also the issues of ‘max input size’ and the original ‘sequence length’ the model was trained on that may come into the equation.

below is an example of long context prompt stuffing in action…

In general it seems feasible to create a team of Discourse AI personas that each have a big chunk of specialized content or codebase for querying ( keeping in mind the caveat about high expense when paying by token !)

RGJ · May 23, 2024, 9:19pm

But isn’t this just a (really inefficient and) “static” version of RAG?

All RAG does differently from this approach is select and include relevant chunks of content instead of including all content.

StevePlex · May 24, 2024, 1:25am

Fair point, for sure… no simple answer IMO

I guess this depends on use case.

RAG works well for some apps, but for other fairly deterministic target cases, (eg, customer Q/A, payments, medical, etc.) myself and others have had problems getting good accuracy with RAG vector search over the past year. I.e, the bot will either miss things or make things up (poor recall, poor precision in IR terms), which is well documented by Stanford, Google, et all.

So the question arises… why throw a bunch of chunks at the LLM if you can give it the whole corpus. At least with context injection, when the LLM isn’t accurate you have less things to tune…

Ok it’s not going to work for vast document / code libraries … but for small and moderate sized content bases it seems to work great so far… am working on a project that is formally testing this… ! more soon … thanks

PS. → to make things even more interesting… I’ve had good luck with context injection + fine tuning… and there’s emerging approaches that combine RAG and context injection ! … Etc etc.

also see:

https://www.google.com/search?q=papers+on+the+problems+with+llm+rag

StevePlex · May 24, 2024, 3:59pm

ADDENDUM 2

Here’s a Q/A test with a white paper (~20k tokens) put in context via prompt injection vs RAG… (content and settings were the same as much as possible. LLM = Gemini-1.5-Pro)…

ANALYSIS:

RAG is inconsistent… sometimes finding answer, sometime missing.

Prompt inject success:

RAG fail:

RAG Request trace:

I did get RAG to answer questions from the file upload at the beginning of the document , and with coaxing , it may look at middle and end … so it’s not a total fail… but it is inconsistent… consistently , or more difficult to work with IMO : )

StevePlex · May 24, 2024, 4:31pm

Here’s the test file in case anyone wants to play with it:

File contains these BME haystack ‘needles’ that are guaranteed to be unique, i.e., not present in external copies of the paper across the internet.

Beginning:

Proofreader: Felonius Monko

Middle:

Editor’s note: The Goldich stability series was discovered by Maurice Goldrch. While Goldichs original order of mineral weathering potential was qualitative, later work by Michal Kowalski and J. Donald Rimstidt placed in the series in quantitative terms. Thanks to Dr. Gomez Pyle at NMU for this clarification.

End:

Dang, S. et al. Cryo-EM structures of the TMEM16A calcium-activated chloride channel. Nature 552, 426429 (2017).

Equatics-paper1-with-unique-haystack-needles-v116.txt (71.8 KB)

All comments, critiques and guidance appreciated !! Going forward, I will conduct the test with more LLMs and embedding models, etc.

StevePlex · May 24, 2024, 9:02pm

FOOTNOTE:

I was able to rerun the above test with GPT4o (128k context) , making sure to use large token / chunk settings… but it’s still very flaky for my white paper Q/A use case… (lost in the middle, lost at the end , etc.) …here’s my settings if anyone wants to duplicate and refine. .Would love it if we can find the right settings for this case :

CUSTOM AI PERSONA

Enabled?	Yes
Priority	Yes
Allow Chat	Yes
Allow Mentions	Yes
Vision Enabled	No

Name	Rag Testing Bot 3
Description	Test RAG vs Long Context prompt injection
Default Language Model	GPT-4o-custom
User	Rag_Testing_Bot_bot
Enabled Commands	Categories, Read, Summary
Allowed Groups	trust_level_4

System Prompt	Answer as comprehensively as possible from the provided context on Equatic Carbon Removal Research in the attached file. Do not invent content. Do not use content external to this session. Focus on content provided and create answers from it as accurately and completely as possible.

Max Context Posts	50
Temperature	0.1
Top P	1


Uploads	Equatics-paper1-with-unique-haystack-needles-v116.txt

Upload Chunk Tokens	1024
Upload Chunk Overlap Tokens	10
Search Conversation Chunks	10
Language Model for Question Consolidator	GPT-4o-custom

CUSTOM BOT

Name to display	GPT-4o-custom

Model name	gpt-4o

Service hosting the model	OpenAI
URL of the service hosting the model	https://api.openai.com/v1/chat/completions
API Key of the service hosting the model	D20230943sdf_fake_Qqxo2exWa91

Tokenizer	OpenAITokenizer
Number of tokens for the prompt	30000

Topic		Replies	Views
Engineering a persona to lean on chat history Support ai	8	107	August 11, 2025
Another added context for AI Bot Support ai-bot , ai	1	60	July 4, 2025
RAG capacities of discourse-ai Support ai	7	235	September 19, 2024
Why is my AI forum helper struggling to answer questions? Support ai , ai-bot	8	220	October 2, 2025
Discourse AI Persona, upload support Announcements ai-bot , ai	21	1540	September 11, 2025

Prompt injection for long-context LLMs as an alternative to RAG?

Related topics