RAGは将来PDFファイルに対応しますか？

silvacarl · 2024 年 9 月 30 日午後 5:35

まず、あなたのAIは素晴らしいです！

次に、PDF、Word、PowerPointファイルをフォーラムに投稿した場合、それらも読み込んでRAG用のベクトルにチャンク化してくれるのでしょうか？

sam · 2024 年 10 月 1 日午前 5:38

残念ながら、まだPDFのサポートはありません。検討中の機能です。PersonaおよびTool RAGの実装ではTXTファイルをサポートしています。そのため、ソース資料をテキストファイルに変換できれば、Personaで利用できます。

silvacarl · 2024 年 10 月 7 日午後 8:39

はい、それが私たちがやったことです。添付ファイルをテキストに変換し、それらを各トピックに関連付けました。

Saif · 2024 年 10 月 8 日午後 2:54

このフィードバックは数回寄せられており、AIボットのペルソナとTool RAGの実装を通じて、将来的に拡張機能のサポートを拡大することを検討しています。

silvacarl · 2024 年 10 月 8 日午後 6:43

とりあえずの回避策として、PowerPoint、Word、またはPDFファイルをテキストに変換し、それが属するトピックに添付しています。

MachineScholar · 2024 年 11 月 12 日午後 4:04

PDFのサポートは、多くのコミュニティにとってまさにゲームチェンジャーとなるでしょう！文書の普遍的な標準であるため、RAGのために.txtに再フォーマットしなければならないことがよくありますが、これは確かに時間のかかる作業です😵‍💫

Saif · 2024 年 11 月 12 日午後 7:26

現在、Embeddingsに関する作業を完了させており、それが完了次第、次にPDFサポートを追加する予定です。

satonotdead · 2024 年 11 月 12 日午後 10:27

すごい、それはとても良いですね。コミュニティのニーズを常に考慮してくれるチームに拍手です！

JSONファイルについてはどうでしょうか？AI内でクエリを実行する必要があるDiscordチャットのエクスポートがたくさんあり、この情報を失いたくありません

モデルのファインチューニングを考えていましたが、ファイルをDiscourseに追加する方が、同様のユースケースを持つすべての人にとって、より良く、よりシンプルになると思います。

sam · 2024 年 11 月 13 日午前 12:11

JSONは単なるテキストなので、すでにサポートしています。

LLMにとっては非効率的な表現であり、フォーマット内に重複が多いため、トークンを無駄にしてしまいますが、全体的には機能します。RAGのパフォーマンスを向上させるために、スクリプトを実行して再フォーマットすることをお勧めします。

JSONは非常にネストが深くなる可能性があり、ドメイン固有の完璧なテキスト表現を選択することはドメインに大きく依存するため、これを自動的に行うことは非常に困難です。

satonotdead · 2024 年 11 月 15 日午後 9:45

サムさん、ありがとうございます。PDFに約150MBのJSONを追加する際のパフォーマンスと価格のバランスを保つためのご提案についてお伺いしてもよろしいでしょうか？

これは、当社のデータに対するRAG（Retrieval-Augmented Generation）の初めての試みであり、すぐにプロセスを学習し始めます。

コミュニティからの洞察もいただけると幸いです。

MachineScholar · 2025 年 2 月 14 日午前 10:19

このコミットは非常に素晴らしいですね

github.com/discourse/discourse-ai

FEATURE: PDF support for rag pipeline (#1118)

committed 01:15AM - 14 Feb 25 UTC

SamSaffron

+1329 -141

This PR introduces several enhancements and refactorings to the AI Persona and R…AG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: **1. LLM Model Association for RAG and Personas:** - **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - **Migration:** Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - **Model Changes:** The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - **UI Updates:** The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. **2. PDF and Image Support for RAG:** - **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - **UI Updates:** The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. **3. Refactoring and Improvements:** - **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - **Eval Script:** An evaluation script is included. **4. Testing:** - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`

この機能の完全リリースには、何か時期的な見通しはありますでしょうか？現在はサイトの隠し機能になっているようですが。

Saif · 2025 年 2 月 14 日午前 11:22

この機能の裏にある作業の課題の1つは、あらゆる種類のPDFをサポートすることです。想像できるように、一部のPDFは単純なテキストで解析が容易です。しかし、カスタムフォント、画像、グラフィック、非線形フォーマットなどを持つものもあります…

すべての種類のPDFで機能する方法を見つけようとしており、時間がかかる場合があります。

Overgrow · 2025 年 2 月 14 日午後 12:43

非常に的確なご意見です。DeepSeekがその状況を少し変えつつあると思います。ollamaで小規模なDeepSeekモデルをローカルで実行することで、質の高い推論を提供し、これらの懸念に対する解決策を提供できるようになりました。

お手数をおかけして申し訳ありませんが、@Saif様、関連トピックについてこちらでご協力いただけますでしょうか: How to properly debug AI Personas? ありがとうございます！

Yenwod · 2025 年 2 月 14 日午後 2:07

素晴らしいプラグインのさらに素晴らしい強化をありがとうございます。

PRでは以下の点が指摘されています。

RAG消化ジョブ: DigestRagUpload ジョブがPDFおよび画像アップロードを処理するようになりました。PdfToImages と ImageToText を使用してテキストを抽出し、ドキュメントフラグメントを作成します。

このジョブは実際にいつ実行されますか？私が開始する必要があるものでしょうか？

txtファイルとPDFをアップロードしたばかりです。txtファイルはすぐにインデックス化されますが、PDFはまだ「インデックス化準備完了」と表示されています。

ありがとうございます。

Yenwod · 2025 年 2 月 14 日午後 5:35

ジョブは実行されていますが、バグが発生しています。

Jobs::HandledExceptionWrapper: Wrapped NameError: undefined local variable or method `temp_dir’ for an instance of DiscourseAi::Utils::PdfToImages

自己ホストしています。これは私がさらに深く掘り下げるべきことでしょうか？

Saif · 2025 年 2 月 14 日午後 5:41

この機能はまだ正式に公開されていないため、使用は控えることをお勧めします。問題が発生する可能性があります。

Yenwod · 2025 年 2 月 14 日午後 5:41

PdfToImages の問題点を見つけたと思います。

sam · 2025 年 2 月 14 日午後 11:52

承知しました。数日ください。デフォルトで有効にできる直接テキスト抽出も試したいと思います。

その後、「リッチ」なLLMベースの抽出はフラグの後ろに置くことができます。

多くのPDFの難点は、それらが巨大であり、サーバーリソースに大きな負荷をかける可能性があることです。さらに、Tesseractのようなものはインストールが少し難しい場合があります。品質を向上させることができます。

Yenwod · 2025 年 2 月 15 日午前 12:44

@sam、自己ホストしており、現在Tesseractと格闘しています。問題なくインストールできましたが、ジョブを失敗させるほど深刻ではないと思われるエラーが発生しています。

OCR処理中にエラーが発生しました: /var/www/discourse/lib/discourse.rb:139:in `exec’: Tesseract を使用した画像の OCR に失敗しました
解像度を 337 と推定しています

そのエラーが発生しても、PDF は Persona でインデックス化されているように表示されます。

これが RAG にどのような影響を与えるのか分かりません。週末にもっと深く掘り下げてみます。

迅速なご対応ありがとうございます。

sam · 2025 年 2 月 15 日午前 3:16

私たちは評価を行っており（もっと追加したいのですが）、基本的にはモデルの画像からテキストへの変換の質は、 grounded（基盤に基づいているかどうか）によって大きく異なります。

良いニュースは、PDFを使えばロスレスでテキスト抽出ができ、その後、必要に応じてLLMを頼って改善することも可能だということです。次週には何かしらのものを用意できるでしょう。

トピック		返信	表示
PDF support in Discourse AI (RAG) Site Management how-to , ai	20	1080	2025 年 12 月 5 日
Is the PDF upload feature for the new AI Bot UX still in development Support ai , ai-bot	2	143	2025 年 5 月 9 日
Using PDF and attachment support with AI bots Site Management how-to , ai , ai-bot	0	325	2025 年 12 月 11 日
Upload and discuss pdfs in composer Feature ai	5	297	2025 年 2 月 24 日
Allow ChatBot to read PDFs so it can join in a group discussion Feature ai , ai-bot	6	1000	2023 年 10 月 12 日