Это руководство объясняет, как реализовать и использовать возможности обработки PDF в плагине discourse-ai, включая как базовое извлечение текста, так и расширенную обработку с помощью LLM.
Требуемый уровень пользователя: Администратор
Краткое содержание
Плагин discourse-ai поддерживает обработку PDF для RAG (Retrieval-Augmented Generation) в двух различных режимах:
Базовое извлечение текста
Расширенная обработка с анализом LLM
Базовое извлечение текста
Этот режим предоставляет основные возможности обработки PDF:
Извлекает текстовое содержимое с использованием gem pdf-reader
Поддерживает файлы размером до 100 МБ
Работает сразу после установки плагина
Обрабатывает только текстовое содержимое (игнорирует визуальные элементы)
Расширенная обработка с улучшениями LLM
Этот режим требует специальной конфигурации и предоставляет более продвинутые возможности.
Требования:
Подписка на план Enterprise или самохостинг Discourse
Установленная в контейнере поддержка ImageMagick с Ghostscript
Включенная настройка сайта ai_rag_images_enabled (скрытая — должна быть установлена через консоль Rails)
Настроенная RAG-модель LLM на агенте или инструменте AI
Возможности:
Интерпретирует изображения, диаграммы и схемы
Предоставляет контекст из визуальных элементов
Обрабатывает PDF постранично
Сохраняет ограничение на размер файла в 100 МБ
Разрешает загрузку файлов изображений (png, jpg, jpeg) для индексации RAG через извлечение текста на основе LLM
Детали реализации
Спецификации обработки
Разрешение обработки страниц: 300 DPI
Тайм-аут конвертации изображения на страницу: 30 секунд
Автоматическая очистка временных файлов
Полная интеграция с векторными вложениями документов RAG
Рабочий процесс обработки
Загрузка и валидация PDF
Извлечение содержимого (базовый или расширенный режим)
Разбиение текста на фрагменты с настраиваемым перекрытием
Векторизация фрагментов и сохранение
Отслеживание прогресса через MessageBus
Ограничения
Обратите внимание на следующие ограничения при реализации обработки PDF:
Ограничения размера файла:
100 МБ для существующей обработки PDF
20 МБ для новых загрузок через административный интерфейс
Расширенный режим требует дополнительных системных ресурсов
Сложные макеты PDF могут интерпретироваться не идеально
Расширенная обработка значительно увеличивает время обработки
This is really amazing news. Thanks team! Can’t wait for the enhanced processing to be finished. That’s gonna be critical for feeding LLMs research papers.
Also, is there any plan to allow doing RAG “chat-with-your-PDFs” by uploading PDFs in an AI bot PM or in a topic/post and mentioning the bot?
On my website (Arabic Forum), I conducted a test in Arabic by adding legislation in the first post (“topic”) and then asked questions using AI. However, the answers were inaccurate, and I believe this is because it is not Context Ragging.
First of all, really thank you for your great work. I really like it.
After playing around with the settings and changing the AI Model to Gemini-Flash-2.0, it worked great for me. Here’s the situation I have:
We are an Auditors, Accountants, and Tax Consultants community, and we needed a tool to share related laws and trigger discussions about them. This discussion should be very useful for visitors, as we are professionals in our field. We are targeting the AI Model to check and analyze legislation and answer our questions. The great experiment led to the conclusion that we can really discuss the context added in the first post, and if the AI model is smart enough, it will answer our questions with very high-quality output.
Really thank you again and looking forward to the PDF support as it will make Discourse best forum Sofware
Does it have to be enabled via console? Don’t see any advanced mode options via the UI.
Furthermore, I am getting an error when trying to upload this pdf. It is 34 MB but I have my max attachment size set to 100 MB (in both admin settings and app.yml). What’s strange is that I have a compressed version which is 16 MB and it uploads just fine. But perhaps the larger PDF is simply too complex for now? There are lots of images, equations, etc.
Hi @sam
im currently having trouble to upload and indexing the pdfs by this error Job exception: undefined method `length’ for nil.
i was wondering if the error related to the settings we discussed above.
the interface will stuck on indexing 0% not move and
the exception details as below:
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:81:in `chunk_document'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:40:in `block in execute'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:616:in `block in within_new_transaction'
activesupport-7.2.2.1/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:613:in `within_new_transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/database_statements.rb:361:in `transaction'
activerecord-7.2.2.1/lib/active_record/transactions.rb:234:in `block in transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:415:in `with_connection'
activerecord-7.2.2.1/lib/active_record/connection_handling.rb:296:in `with_connection'
activerecord-7.2.2.1/lib/active_record/transactions.rb:233:in `transaction'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:39:in `execute'
/var/www/discourse/app/jobs/base.rb:316:in `block (2 levels) in perform'
rails_multisite-6.1.0/lib/rails_multisite/connection_management/null_instance.rb:49:in `with_connection'
rails_multisite-6.1.0/lib/rails_multisite/connection_management.rb:21:in `with_connection'
/var/www/discourse/app/jobs/base.rb:303:in `block in perform'
/var/www/discourse/app/jobs/base.rb:299:in `each'
/var/www/discourse/app/jobs/base.rb:299:in `perform'
sidekiq-7.3.9/lib/sidekiq/processor.rb:220:in `execute_job'
sidekiq-7.3.9/lib/sidekiq/processor.rb:185:in `block (4 levels) in process'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:180:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
/var/www/discourse/lib/sidekiq/pausable.rb:132:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/job/interrupt_handler.rb:9:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:26:in `track'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:134:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:173:in `invoke'
sidekiq-7.3.9/lib/sidekiq/processor.rb:184:in `block (3 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:145:in `block (6 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:118:in `local'
sidekiq-7.3.9/lib/sidekiq/processor.rb:144:in `block (5 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/config.rb:39:in `block in <class:Config>'
sidekiq-7.3.9/lib/sidekiq/processor.rb:139:in `block (4 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:281:in `stats'
sidekiq-7.3.9/lib/sidekiq/processor.rb:134:in `block (3 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:15:in `call'
sidekiq-7.3.9/lib/sidekiq/processor.rb:133:in `block (2 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:85:in `global'
sidekiq-7.3.9/lib/sidekiq/processor.rb:132:in `block in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:40:in `prepare'
sidekiq-7.3.9/lib/sidekiq/processor.rb:131:in `dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:183:in `block (2 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `block in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:86:in `process_one'
sidekiq-7.3.9/lib/sidekiq/processor.rb:76:in `run'
sidekiq-7.3.9/lib/sidekiq/component.rb:10:in `watchdog'
sidekiq-7.3.9/lib/sidekiq/component.rb:19:in `block in safe_thread'
I get this error when asking the bot to summarize the contents of some PDF’s on my site. I’ve not enabled enhanced processing, and am using GPT 4.1. Any ideas what I am doing wrong?
Sorry, it looks like our system encountered an unexpected issue while trying to reply.
Error details
{
“error”: {
“message”: “An assistant message with ‘tool_calls’ must be followed by tool messages responding to each ‘tool_call_id’. The following tool_call_ids did not have response messages: call_nrDCba5mt83oavbXfPq2BtEV”,
“type”: “invalid_request_error”,
“param”: “messages.[2].role”,
“code”: null
}
}
Are there any updates on this matter? I’m attaching a PDF when initiating a conversation with the AI, but it still doesn’t seem to recognize it. I am currently utilizing GPT. Should I perhaps consider employing a different model specifically designed for PDF processing?