Will RAG Support PDF Files in the Future?

silvacarl · September 30, 2024, 5:35pm

first, your AI stuff rocks!

second, if we post PDF or word or powerpoint files to our forum, will it also read those and chunk them up into vectors for RAG?

sam · October 1, 2024, 5:38am

Sadly we do not have PDF support yet, it is something we are thinking about. We do support TXT files in our Persona and Tool RAG implementation. So as long as you are able to convert the source material to text files you can consume it in a persona.

silvacarl · October 7, 2024, 8:39pm

yes, that is what we did, we converted attachments to text and associated those with each topic.

Saif · October 8, 2024, 2:54pm

We have seen this feedback a few times and are considering expanding extension support in the future through our AI bot persona and Tool RAG implementation

silvacarl · October 8, 2024, 6:43pm

as a work around for now, we just convert the powerpoint or word or PDf file to text and attach it to the same topic it belongs with.

MachineScholar · November 12, 2024, 4:04pm

PDF support would absolutely be a game changer for many communities! Given that it seems to be a universal standard for documents, we often find ourselves having to reformat stuff into .txt for RAG which is indeed time-intensive

Saif · November 12, 2024, 7:26pm

We are finishing some work on Embeddings and as soon as that is complete next up will be adding PDF support

satonotdead · November 12, 2024, 10:27pm

Wow, that’s super nice. Kudos to the team that always takes into mind what the community needs!

What about JSON files? I had a ton of Discord chats exported that we need to query within AI so we don’t lose this info

I was thinking about fine-tuning models, but I think adding the files to Discourse should be better and simpler for everyone with a similar use case.

sam · November 13, 2024, 12:11am

JSON is just text so we already support it.

It is an inefficient representation for LLMs given large amount of duplication within the format so it would waste a few tokens, but overall it will work. I would recommend running a script on it and reformatting to improve RAG performance.

It is very hard to do this automatically cause JSON can be very nested and picking a perfect domain specific text representation highly depends on the domain.

satonotdead · November 15, 2024, 9:45pm

Thanks Sam, can I ask about your suggestion to keep balanced performance+price when adding ~150 MB of JSON (on PDF)?

That’s my the first time at RAG on our data and I will start to learn soon on the process.

I appreciate any insight from the community as well.

MachineScholar · February 14, 2025, 10:19am

I must say, this commit is looking quite beautiful

github.com/discourse/discourse-ai

FEATURE: PDF support for rag pipeline (#1118)

committed 01:15AM - 14 Feb 25 UTC

SamSaffron

+1329 -141

This PR introduces several enhancements and refactorings to the AI Persona and R…AG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: **1. LLM Model Association for RAG and Personas:** - **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - **Migration:** Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - **Model Changes:** The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - **UI Updates:** The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. **2. PDF and Image Support for RAG:** - **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - **UI Updates:** The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. **3. Refactoring and Improvements:** - **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - **Eval Script:** An evaluation script is included. **4. Testing:** - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`

Is there maybe possibly perhaps any timeline for full release of this feature? I see that it is a hidden site feature for now

Saif · February 14, 2025, 11:22am

One of the challenges with the work behind this is supporting PDFs of all types. As you can imagine some PDFs are straight text and easy to parse. However, there are some with custom fonts, images, graphics, non-linearly formatted etc…

We are trying to find a way to make all types of PDFs work and such it may take a bit of time.

Overgrow · February 14, 2025, 12:43pm

Very well said. I think that the DeepSeek is now changing that landscape a bit. Running smaller DeepSeek models locally with ollama can now provide quality inference, and provide a solution to these concerns.

Sorry to bug you, @Saif may I get your help with related topic here: How to properly debug AI Personas? Thank you!

Yenwod · February 14, 2025, 2:07pm

Thank you for such an incredible enhancement to an already amazing plugin.

The PR points out that:

RAG Digestion Job: The DigestRagUpload job now handles PDF and image uploads. It uses PdfToImages and ImageToText to extract text and create document fragments.

When will this job actually run? Is this something I need to kickoff?

I just uploaded some txt files and a PDF. The txt files are indexed immediately but the PDF still says “ready to be indexed”.

Thank you.

Yenwod · February 14, 2025, 5:35pm

The job is running but experiencing a bug:

Jobs::HandledExceptionWrapper: Wrapped NameError: undefined local variable or method `temp_dir’ for an instance of DiscourseAi::Utils::PdfToImages

I self-host. Perhaps this is something I can dig deeper into?

Saif · February 14, 2025, 5:41pm

I would hold off on using this feature since it is not technically live just yet. You are going to run into issues here

Yenwod · February 14, 2025, 5:41pm

I think I found the problem in PdfToImages:

sam · February 14, 2025, 11:52pm

Confirmed, give me a few days here, I want to also try direct text extraction which is something we can enable by default.

Then “rich” LLM based extraction can be behind flags.

The trouble with many PDFs is that they are huge and can be very taxing on server resources. Additionally stuff like tesseract can be a bit tricky to install - it can improve the quality.

Yenwod · February 15, 2025, 12:44am

@sam, I self-host and am wrestling with tesseract now. Installed no problem but its throwing errors that don’t seem to be serious enough to fail the job:

Error during OCR processing: /var/www/discourse/lib/discourse.rb:139:in `exec’: Failed to OCR image with Tesseract
Estimating resolution as 337

Even with that error, the PDF shows in the Persona as being indexed.

I’m not sure what this means in terms of the impact on RAG. I’ll dig deeper over the weekend.

Thank you for responding so quickly.

sam · February 15, 2025, 3:16am

We have an eval (and I want to add more), but basically depending on the model image to text quality varies a lot if it is not grounded.

The good news though is that with PDF we can do text extraction in a lossless way and then only lean on the LLM to improve it if you want to gold plate. Should have something next week.

Topic		Replies	Views
PDF support in Discourse AI Site Management how-to , ai	16	422	July 15, 2025
Is the PDF upload feature for the new AI Bot UX still in development Support ai , ai-bot	3	48	May 9, 2025
Upload and discuss pdfs in composer Feature ai	5	165	February 24, 2025
Allow ChatBot to read PDFs so it can join in a group discussion Feature ai , ai-bot	6	908	October 12, 2023
Advice on a support bot for a technical support forum (Discourse AI vs Discourse Chatbot) General ai , ai-bot	50	3546	September 19, 2024

Will RAG Support PDF Files in the Future?

Related topics