Will RAG Support PDF Files in the Future?

silvacarl · September 30, 2024, 5:35pm

first, your AI stuff rocks!

second, if we post PDF or word or powerpoint files to our forum, will it also read those and chunk them up into vectors for RAG?

sam · October 1, 2024, 5:38am

Sadly we do not have PDF support yet, it is something we are thinking about. We do support TXT files in our Persona and Tool RAG implementation. So as long as you are able to convert the source material to text files you can consume it in a persona.

silvacarl · October 7, 2024, 8:39pm

yes, that is what we did, we converted attachments to text and associated those with each topic.

Saif · October 8, 2024, 2:54pm

We have seen this feedback a few times and are considering expanding extension support in the future through our AI bot persona and Tool RAG implementation

silvacarl · October 8, 2024, 6:43pm

as a work around for now, we just convert the powerpoint or word or PDf file to text and attach it to the same topic it belongs with.

MachineScholar · November 12, 2024, 4:04pm

PDF support would absolutely be a game changer for many communities! Given that it seems to be a universal standard for documents, we often find ourselves having to reformat stuff into .txt for RAG which is indeed time-intensive

Saif · November 12, 2024, 7:26pm

We are finishing some work on Embeddings and as soon as that is complete next up will be adding PDF support

satonotdead · November 12, 2024, 10:27pm

Wow, that’s super nice. Kudos to the team that always takes into mind what the community needs!

What about JSON files? I had a ton of Discord chats exported that we need to query within AI so we don’t lose this info

I was thinking about fine-tuning models, but I think adding the files to Discourse should be better and simpler for everyone with a similar use case.

sam · November 13, 2024, 12:11am

JSON is just text so we already support it.

It is an inefficient representation for LLMs given large amount of duplication within the format so it would waste a few tokens, but overall it will work. I would recommend running a script on it and reformatting to improve RAG performance.

It is very hard to do this automatically cause JSON can be very nested and picking a perfect domain specific text representation highly depends on the domain.

satonotdead · November 15, 2024, 9:45pm

Thanks Sam, can I ask about your suggestion to keep balanced performance+price when adding ~150 MB of JSON (on PDF)?

That’s my the first time at RAG on our data and I will start to learn soon on the process.

I appreciate any insight from the community as well.

MachineScholar · February 14, 2025, 10:19am

I must say, this commit is looking quite beautiful

https://github.com/discourse/discourse-ai/commit/5e80f93e4c0767199e4d5fb0caff8b07ee1498db

Is there maybe possibly perhaps any timeline for full release of this feature? I see that it is a hidden site feature for now

Saif · February 14, 2025, 11:22am

One of the challenges with the work behind this is supporting PDFs of all types. As you can imagine some PDFs are straight text and easy to parse. However, there are some with custom fonts, images, graphics, non-linearly formatted etc…

We are trying to find a way to make all types of PDFs work and such it may take a bit of time.

Overgrow · February 14, 2025, 12:43pm

Very well said. I think that the DeepSeek is now changing that landscape a bit. Running smaller DeepSeek models locally with ollama can now provide quality inference, and provide a solution to these concerns.

Sorry to bug you, @Saif may I get your help with related topic here: How to properly debug AI Personas? Thank you!

Yenwod · February 14, 2025, 2:07pm

Thank you for such an incredible enhancement to an already amazing plugin.

The PR points out that:

RAG Digestion Job: The DigestRagUpload job now handles PDF and image uploads. It uses PdfToImages and ImageToText to extract text and create document fragments.

When will this job actually run? Is this something I need to kickoff?

I just uploaded some txt files and a PDF. The txt files are indexed immediately but the PDF still says “ready to be indexed”.

Thank you.

Yenwod · February 14, 2025, 5:35pm

The job is running but experiencing a bug:

Jobs::HandledExceptionWrapper: Wrapped NameError: undefined local variable or method `temp_dir’ for an instance of DiscourseAi::Utils::PdfToImages

I self-host. Perhaps this is something I can dig deeper into?

Saif · February 14, 2025, 5:41pm

I would hold off on using this feature since it is not technically live just yet. You are going to run into issues here

Yenwod · February 14, 2025, 5:41pm

I think I found the problem in PdfToImages:

sam · February 14, 2025, 11:52pm

Confirmed, give me a few days here, I want to also try direct text extraction which is something we can enable by default.

Then “rich” LLM based extraction can be behind flags.

The trouble with many PDFs is that they are huge and can be very taxing on server resources. Additionally stuff like tesseract can be a bit tricky to install - it can improve the quality.

Yenwod · February 15, 2025, 12:44am

@sam, I self-host and am wrestling with tesseract now. Installed no problem but its throwing errors that don’t seem to be serious enough to fail the job:

Error during OCR processing: /var/www/discourse/lib/discourse.rb:139:in `exec’: Failed to OCR image with Tesseract
Estimating resolution as 337

Even with that error, the PDF shows in the Persona as being indexed.

I’m not sure what this means in terms of the impact on RAG. I’ll dig deeper over the weekend.

Thank you for responding so quickly.

sam · February 15, 2025, 3:16am

We have an eval (and I want to add more), but basically depending on the model image to text quality varies a lot if it is not grounded.

The good news though is that with PDF we can do text extraction in a lossless way and then only lean on the LLM to improve it if you want to gold plate. Should have something next week.

Topic		Replies	Views
PDF support in Discourse AI (RAG) Site Management how-to , ai	20	1114	December 5, 2025
Is the PDF upload feature for the new AI Bot UX still in development Support ai , ai-bot	2	145	May 9, 2025
Using PDF and attachment support with AI bots Site Management how-to , ai , ai-bot	0	352	December 11, 2025
Upload and discuss pdfs in composer Feature ai	5	313	February 24, 2025
Allow ChatBot to read PDFs so it can join in a group discussion Feature ai , ai-bot	6	1003	October 12, 2023

Will RAG Support PDF Files in the Future?

Related topics