Поддержка PDF в Discourse AI (RAG)

:bookmark: Это руководство объясняет, как реализовать и использовать возможности обработки PDF в плагине discourse-ai, включая как базовое извлечение текста, так и расширенную обработку с помощью LLM.

:person_raising_hand: Требуемый уровень пользователя: Администратор

Краткое содержание

Плагин discourse-ai поддерживает обработку PDF для RAG (Retrieval-Augmented Generation) в двух различных режимах:

  1. Базовое извлечение текста
  2. Расширенная обработка с анализом LLM

Базовое извлечение текста

Этот режим предоставляет основные возможности обработки PDF:

  • Извлекает текстовое содержимое с использованием gem pdf-reader
  • Поддерживает файлы размером до 100 МБ
  • Работает сразу после установки плагина
  • Обрабатывает только текстовое содержимое (игнорирует визуальные элементы)

Расширенная обработка с улучшениями LLM

:information_source: Этот режим требует специальной конфигурации и предоставляет более продвинутые возможности.

Требования:

  • Подписка на план Enterprise или самохостинг Discourse
  • Установленная в контейнере поддержка ImageMagick с Ghostscript
  • Включенная настройка сайта ai_rag_images_enabled (скрытая — должна быть установлена через консоль Rails)
  • Настроенная RAG-модель LLM на агенте или инструменте AI

Возможности:

  • Интерпретирует изображения, диаграммы и схемы
  • Предоставляет контекст из визуальных элементов
  • Обрабатывает PDF постранично
  • Сохраняет ограничение на размер файла в 100 МБ
  • Разрешает загрузку файлов изображений (png, jpg, jpeg) для индексации RAG через извлечение текста на основе LLM

Детали реализации

Спецификации обработки

  • Разрешение обработки страниц: 300 DPI
  • Тайм-аут конвертации изображения на страницу: 30 секунд
  • Автоматическая очистка временных файлов
  • Полная интеграция с векторными вложениями документов RAG

Рабочий процесс обработки

  1. Загрузка и валидация PDF
  2. Извлечение содержимого (базовый или расширенный режим)
  3. Разбиение текста на фрагменты с настраиваемым перекрытием
  4. Векторизация фрагментов и сохранение
  5. Отслеживание прогресса через MessageBus

Ограничения

:warning: Обратите внимание на следующие ограничения при реализации обработки PDF:

  • Ограничения размера файла:
    • 100 МБ для существующей обработки PDF
    • 20 МБ для новых загрузок через административный интерфейс
  • Расширенный режим требует дополнительных системных ресурсов
  • Сложные макеты PDF могут интерпретироваться не идеально
  • Расширенная обработка значительно увеличивает время обработки

См. также:

11 лайков

This is really amazing news. Thanks team! Can’t wait for the enhanced processing to be finished. That’s gonna be critical for feeding LLMs research papers.

Also, is there any plan to allow doing RAG “chat-with-your-PDFs” by uploading PDFs in an AI bot PM or in a topic/post and mentioning the bot?

2 лайка

@sam Can you provide simple video to explain this great option, because what you mentioned not clear enough to implement

1 лайк

Where I can find this setting

It’s a hidden setting, you need to use the console, but you also need to configure the container, recommend you wait a few more weeks

4 лайка

Thank you, I appreciate your fantastic work

On my website (Arabic Forum), I conducted a test in Arabic by adding legislation in the first post (“topic”) and then asked questions using AI. However, the answers were inaccurate, and I believe this is because it is not Context Ragging.

Sorry but this is not how it works, you need to define a persona or tool and then add the upload there.

There has been some discussion around supporting “upload and ask” here: Upload and discuss pdfs in composer but it is not supported yet.

1 лайк

First of all, really thank you for your great work. I really like it.

After playing around with the settings and changing the AI Model to Gemini-Flash-2.0, it worked great for me. Here’s the situation I have:

We are an Auditors, Accountants, and Tax Consultants community, and we needed a tool to share related laws and trigger discussions about them. This discussion should be very useful for visitors, as we are professionals in our field. We are targeting the AI Model to check and analyze legislation and answer our questions. The great experiment led to the conclusion that we can really discuss the context added in the first post, and if the AI model is smart enough, it will answer our questions with very high-quality output.
Really thank you again and looking forward to the PDF support as it will make Discourse best forum Sofware

3 лайка

Latest image of discourse supports the advanced mode if anyone feels like testing

2 лайка

Does it have to be enabled via console? Don’t see any advanced mode options via the UI.

Furthermore, I am getting an error when trying to upload this pdf. It is 34 MB but I have my max attachment size set to 100 MB (in both admin settings and app.yml). What’s strange is that I have a compressed version which is 16 MB and it uploads just fine. But perhaps the larger PDF is simply too complex for now? There are lots of images, equations, etc.

Yes, you need to SiteSetting.ai_rag_images_enabled = true in the Rails console to enable it.

1 лайк

my guess here is that some nginx stuff needs to change in the container as well so it does not do the rejecting

1 лайк

Hi @sam
im currently having trouble to upload and indexing the pdfs by this error Job exception: undefined method `length’ for nil.

i was wondering if the error related to the settings we discussed above.
the interface will stuck on indexing 0% not move and
the exception details as below:

/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:81:in `chunk_document'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:40:in `block in execute'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:616:in `block in within_new_transaction'
activesupport-7.2.2.1/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:613:in `within_new_transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/database_statements.rb:361:in `transaction'
activerecord-7.2.2.1/lib/active_record/transactions.rb:234:in `block in transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:415:in `with_connection'
activerecord-7.2.2.1/lib/active_record/connection_handling.rb:296:in `with_connection'
activerecord-7.2.2.1/lib/active_record/transactions.rb:233:in `transaction'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:39:in `execute'
/var/www/discourse/app/jobs/base.rb:316:in `block (2 levels) in perform'
rails_multisite-6.1.0/lib/rails_multisite/connection_management/null_instance.rb:49:in `with_connection'
rails_multisite-6.1.0/lib/rails_multisite/connection_management.rb:21:in `with_connection'
/var/www/discourse/app/jobs/base.rb:303:in `block in perform'
/var/www/discourse/app/jobs/base.rb:299:in `each'
/var/www/discourse/app/jobs/base.rb:299:in `perform'
sidekiq-7.3.9/lib/sidekiq/processor.rb:220:in `execute_job'
sidekiq-7.3.9/lib/sidekiq/processor.rb:185:in `block (4 levels) in process'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:180:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
/var/www/discourse/lib/sidekiq/pausable.rb:132:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/job/interrupt_handler.rb:9:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:26:in `track'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:134:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:173:in `invoke'
sidekiq-7.3.9/lib/sidekiq/processor.rb:184:in `block (3 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:145:in `block (6 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:118:in `local'
sidekiq-7.3.9/lib/sidekiq/processor.rb:144:in `block (5 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/config.rb:39:in `block in <class:Config>'
sidekiq-7.3.9/lib/sidekiq/processor.rb:139:in `block (4 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:281:in `stats'
sidekiq-7.3.9/lib/sidekiq/processor.rb:134:in `block (3 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:15:in `call'
sidekiq-7.3.9/lib/sidekiq/processor.rb:133:in `block (2 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:85:in `global'
sidekiq-7.3.9/lib/sidekiq/processor.rb:132:in `block in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:40:in `prepare'
sidekiq-7.3.9/lib/sidekiq/processor.rb:131:in `dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:183:in `block (2 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `block in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:86:in `process_one'
sidekiq-7.3.9/lib/sidekiq/processor.rb:76:in `run'
sidekiq-7.3.9/lib/sidekiq/component.rb:10:in `watchdog'
sidekiq-7.3.9/lib/sidekiq/component.rb:19:in `block in safe_thread'
1 лайк

thanks for this amazing update,
just have once concern here, 100mb limit for each Persona AI bot or for all Personas

I’m new to Discourse AI but an old hand on Discourses generally.

Really keen to try this out for a specific use case in demo form at this stage.

I’ve enabled the hidden site setting.

Nothing in SideKiq that I can see. How can I see if it is working at all?

I’m aware this is a pre release feature and not ready for prime time yet, however it would be great to be able to experience and try out.

Really keen for any hints, tips, screenshots, or recipes from people that are trying this out.

I get this error when asking the bot to summarize the contents of some PDF’s on my site. I’ve not enabled enhanced processing, and am using GPT 4.1. Any ideas what I am doing wrong?

Sorry, it looks like our system encountered an unexpected issue while trying to reply.

Error details

{
“error”: {
“message”: “An assistant message with ‘tool_calls’ must be followed by tool messages responding to each ‘tool_call_id’. The following tool_call_ids did not have response messages: call_nrDCba5mt83oavbXfPq2BtEV”,
“type”: “invalid_request_error”,
“param”: “messages.[2].role”,
“code”: null
}
}

May I inquire into the current status of PDF support? :face_with_peeking_eye:

1 лайк

When you configure upload sizes in app.yml it is site-wide, so it applies to each persona.

1 лайк

Are there any updates on this matter? I’m attaching a PDF when initiating a conversation with the AI, but it still doesn’t seem to recognize it. I am currently utilizing GPT. Should I perhaps consider employing a different model specifically designed for PDF processing?

1 лайк