Will RAG Support PDF Files in the Future?

first, your AI stuff rocks!

second, if we post PDF or word or powerpoint files to our forum, will it also read those and chunk them up into vectors for RAG?

2 Likes

Sadly we do not have PDF support yet, it is something we are thinking about. We do support TXT files in our Persona and Tool RAG implementation. So as long as you are able to convert the source material to text files you can consume it in a persona.

3 Likes

yes, that is what we did, we converted attachments to text and associated those with each topic.

1 Like

We have seen this feedback a few times and are considering expanding extension support in the future through our AI bot persona and Tool RAG implementation

3 Likes

as a work around for now, we just convert the powerpoint or word or PDf file to text and attach it to the same topic it belongs with.

1 Like

PDF support would absolutely be a game changer for many communities! Given that it seems to be a universal standard for documents, we often find ourselves having to reformat stuff into .txt for RAG which is indeed time-intensive :face_with_spiral_eyes:

5 Likes

We are finishing some work on Embeddings and as soon as that is complete next up will be adding PDF support

5 Likes

Wow, that’s super nice. Kudos to the team that always takes into mind what the community needs!

What about JSON files? I had a ton of Discord chats exported that we need to query within AI so we don’t lose this info :slight_smile:

I was thinking about fine-tuning models, but I think adding the files to Discourse should be better and simpler for everyone with a similar use case.

JSON is just text so we already support it.

It is an inefficient representation for LLMs given large amount of duplication within the format so it would waste a few tokens, but overall it will work. I would recommend running a script on it and reformatting to improve RAG performance.

It is very hard to do this automatically cause JSON can be very nested and picking a perfect domain specific text representation highly depends on the domain.

3 Likes

Thanks Sam, can I ask about your suggestion to keep balanced performance+price when adding ~150 MB of JSON (on PDF)?

That’s my the first time at RAG on our data and I will start to learn soon on the process.

I appreciate any insight from the community as well.

I must say, this commit is looking quite beautiful :heart_eyes:

Is there maybe possibly perhaps any timeline for full release of this feature? I see that it is a hidden site feature for now

5 Likes

One of the challenges with the work behind this is supporting PDFs of all types. As you can imagine some PDFs are straight text and easy to parse. However, there are some with custom fonts, images, graphics, non-linearly formatted etc…

We are trying to find a way to make all types of PDFs work and such it may take a bit of time.

2 Likes

Very well said. I think that the DeepSeek is now changing that landscape a bit. Running smaller DeepSeek models locally with ollama can now provide quality inference, and provide a solution to these concerns.

Sorry to bug you, @Saif may I get your help with related topic here: How to properly debug AI Personas? Thank you!

Thank you for such an incredible enhancement to an already amazing plugin.

The PR points out that:

  • RAG Digestion Job: The DigestRagUpload job now handles PDF and image uploads. It uses PdfToImages and ImageToText to extract text and create document fragments.

When will this job actually run? Is this something I need to kickoff?

I just uploaded some txt files and a PDF. The txt files are indexed immediately but the PDF still says “ready to be indexed”.

Thank you. :pray:

1 Like

The job is running but experiencing a bug:

Jobs::HandledExceptionWrapper: Wrapped NameError: undefined local variable or method `temp_dir’ for an instance of DiscourseAi::Utils::PdfToImages

I self-host. Perhaps this is something I can dig deeper into?

I would hold off on using this feature since it is not technically live just yet. You are going to run into issues here

2 Likes

I think I found the problem in PdfToImages:

3 Likes

Confirmed, give me a few days here, I want to also try direct text extraction which is something we can enable by default.

Then “rich” LLM based extraction can be behind flags.

The trouble with many PDFs is that they are huge and can be very taxing on server resources. Additionally stuff like tesseract can be a bit tricky to install - it can improve the quality.

5 Likes

@sam, I self-host and am wrestling with tesseract now. Installed no problem but its throwing errors that don’t seem to be serious enough to fail the job:

Error during OCR processing: /var/www/discourse/lib/discourse.rb:139:in `exec’: Failed to OCR image with Tesseract
Estimating resolution as 337

Even with that error, the PDF shows in the Persona as being indexed.

I’m not sure what this means in terms of the impact on RAG. I’ll dig deeper over the weekend.

Thank you for responding so quickly.

2 Likes

We have an eval (and I want to add more), but basically depending on the model image to text quality varies a lot if it is not grounded.

The good news though is that with PDF we can do text extraction in a lossless way and then only lean on the LLM to improve it if you want to gold plate. Should have something next week.

6 Likes