RAG 未来会支持 PDF 文件吗？

silvacarl · 2024 年9 月 30 日 17:35

首先，你们的 AI 很棒！

其次，如果我们把 PDF、Word 或 PowerPoint 文件发布到论坛上，它也能读取这些文件并将它们分块成向量以用于 RAG 吗？

sam · 2024 年10 月 1 日 05:38

抱歉，我们目前还不支持 PDF 文件，但我们正在考虑中。在我们的 Persona 和 Tool RAG 实现中，我们支持 TXT 文件。因此，只要您能够将源材料转换为文本文件，就可以在 Persona 中使用它。

silvacarl · 2024 年10 月 7 日 20:39

是的，我们就是这么做的，我们将附件转换为文本，并将它们与每个主题关联起来。

Saif · 2024 年10 月 8 日 14:54

我们已经收到几次这样的反馈，并正在考虑在未来通过我们的人工智能机器人角色和工具RAG实现来扩展扩展支持。

silvacarl · 2024 年10 月 8 日 18:43

作为临时的变通方法，我们只需将 PowerPoint、Word 或 PDF 文件转换为文本，然后将其附加到所属的同一主题。

MachineScholar · 2024 年11 月 12 日 16:04

PDF 支持对许多社区来说将是革命性的！鉴于它似乎是文档的通用标准，我们经常发现自己不得不将内容重新格式化为 .txt 以用于 RAG，这确实非常耗时

Saif · 2024 年11 月 12 日 19:26

我们正在完成一些关于 Embeddings 的工作，一旦完成，下一步将是添加 PDF 支持。

satonotdead · 2024 年11 月 12 日 22:27

太棒了，这真是太好了。为始终牢记社区需求的团队点赞！

JSON 文件呢？我有很多导出的 Discord 聊天记录，我们需要将其查询到 AI 中，这样就不会丢失这些信息

我曾考虑过微调模型，但我觉得将文件添加到 Discourse 对有类似用例的每个人来说会更好、更简单。

sam · 2024 年11 月 13 日 00:11

JSON 本身就是文本，所以我们已经支持了。

对于大型语言模型来说，它是一种效率低下的表示形式，因为该格式内部存在大量重复，会浪费一些 token，但总体上是可行的。我建议运行一个脚本对其进行处理并重新格式化，以提高 RAG 性能。

这很难自动完成，因为 JSON 可能嵌套很深，选择一种完美的特定领域文本表示形式高度依赖于该领域。

satonotdead · 2024 年11 月 15 日 21:45

谢谢 Sam，关于您建议在添加约 150 MB 的 JSON（在 PDF 上）时保持性能和价格的平衡，我能问一下吗？

这是我第一次在我们的数据上使用 RAG，我很快就会开始学习这个过程。

我也很期待社区的任何见解。

MachineScholar · 2025 年2 月 14 日 10:19

我必须说，这次提交看起来相当不错

github.com/discourse/discourse-ai

FEATURE: PDF support for rag pipeline (#1118)

committed 01:15AM - 14 Feb 25 UTC

SamSaffron

+1329 -141

This PR introduces several enhancements and refactorings to the AI Persona and R…AG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: **1. LLM Model Association for RAG and Personas:** - **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - **Migration:** Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - **Model Changes:** The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - **UI Updates:** The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. **2. PDF and Image Support for RAG:** - **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - **UI Updates:** The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. **3. Refactoring and Improvements:** - **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - **Eval Script:** An evaluation script is included. **4. Testing:** - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`

这个功能完全发布有没有可能或时间表？我看到它目前是一个隐藏的网站功能。

Saif · 2025 年2 月 14 日 11:22

这项工作背后的一项挑战是支持所有类型的 PDF。可以想象，有些 PDF 是纯文本，易于解析。然而，有些 PDF 包含自定义字体、图像、图形、非线性格式等……

我们正在努力寻找一种方法来让所有类型的 PDF 都能够正常工作，因此这可能需要一些时间。

Overgrow · 2025 年2 月 14 日 12:43

说得非常好。我认为DeepSeek现在正在稍微改变这种格局。使用ollama在本地运行较小的DeepSeek模型现在可以提供高质量的推理，并解决这些担忧。

抱歉打扰您，@Saif，我能否在以下相关主题中获得您的帮助：How to properly debug AI Personas? 谢谢！

Yenwod · 2025 年2 月 14 日 14:07

感谢您对这个已然很棒的插件进行如此惊人的增强。

PR 指出：

RAG 消化作业： DigestRagUpload 作业现在处理 PDF 和图像上传。它使用 PdfToImages 和 ImageToText 来提取文本并创建文档片段。

这个作业实际何时运行？是我需要启动它吗？

我刚刚上传了一些 txt 文件和一个 PDF。txt 文件立即被索引，但 PDF 仍然显示“准备好被索引”。

谢谢。

Yenwod · 2025 年2 月 14 日 17:35

作业正在运行但遇到错误：

Jobs::HandledExceptionWrapper: Wrapped NameError: undefined local variable or method `temp_dir’ for an instance of DiscourseAi::Utils::PdfToImages

我自行托管。也许这是我可以深入研究的问题？

Saif · 2025 年2 月 14 日 17:41

我建议暂时不要使用这个功能，因为它还没有正式上线。你在使用时可能会遇到问题

Yenwod · 2025 年2 月 14 日 17:41

我在 PdfToImages 中找到了问题：

sam · 2025 年2 月 14 日 23:52

确认，给我几天时间，我也想尝试直接文本提取，这是我们可以默认启用的功能。

然后，“丰富”的基于 LLM 的提取可以放在标志后面。

许多 PDF 的麻烦在于它们非常大，并且可能非常消耗服务器资源。此外，像 tesseract 这样的东西可能有点难安装——它可以提高质量。

Yenwod · 2025 年2 月 15 日 00:44

@sam，我自行托管，现在正在与 tesseract 搏斗。安装没问题，但它抛出的错误似乎不足以导致作业失败：

OCR 处理期间出错：/var/www/discourse/lib/discourse.rb:139:in `exec’: 使用 Tesseract OCR 图像失败
估计分辨率为 337

即使出现此错误，PDF 在 Persona 中仍显示已索引。

我不确定这对 RAG 有何影响。我将在周末深入研究。

感谢您如此迅速地回复。

sam · 2025 年2 月 15 日 03:16

[引用=“Chris, post:19, topic:335804, username:Yenwod”]
我不确定这意味着对RAG的影响。
[/引用]

我们有一个评估（我还想添加更多），但基本上，模型在图像转文本的质量会有很大差异，尤其是在没有基础的情况下。

不过好消息是，使用PDF我们可以进行无损文本提取，然后只在需要精装修（gold plate）时依赖于LLM来改进它。应该下周有些东西出来。

话题		回复	浏览量
PDF support in Discourse AI (RAG) Site Management how-to , ai	20	1081	2025 年12 月 5 日
Is the PDF upload feature for the new AI Bot UX still in development Support ai , ai-bot	2	144	2025 年5 月 9 日
Using PDF and attachment support with AI bots Site Management how-to , ai , ai-bot	0	327	2025 年12 月 11 日
Upload and discuss pdfs in composer Feature ai	5	298	2025 年2 月 24 日
Allow ChatBot to read PDFs so it can join in a group discussion Feature ai , ai-bot	6	1001	2023 年10 月 12 日

RAG 未来会支持 PDF 文件吗？

相关话题