PDF support in Discourse AI

sam · February 18, 2025, 4:32am

This guide explains how to implement and use PDF processing capabilities within discourse-ai, including both basic text extraction and enhanced processing with LLM assistance.

Required user level: Administrator

Summary

The discourse-ai plugin supports PDF processing for RAG (Retrieval-Augmented Generation) in two distinct modes:

Basic text extraction
Enhanced processing with LLM analysis

Basic text extraction

This mode provides fundamental PDF processing capabilities:

Extracts text content using the pdf-reader gem
Supports files up to 100MB
Works immediately after plugin installation
Processes text-only content (ignores visual elements)

Enhanced processing with LLM improvements

This mode requires specific configuration and provides more advanced capabilities.

Requirements:

Enterprise plan subscription or self-hosted Discourse
ImageMagick with Ghostscript support installed in container
ai_rag_images_enabled site setting enabled

Capabilities:

Interprets images, charts, and diagrams
Provides context from visual elements
Processes PDFs page by page
Maintains the 100MB file size limit

Implementation details

Processing specifications

Page processing resolution: 300 DPI
Maximum processing time: 600 seconds (10 minutes)
Automatic cleanup of temporary files
Full integration with RAG document embeddings

Processing workflow

PDF upload and validation
Content extraction (basic or enhanced mode)
Text chunking with configurable overlap
Chunk embedding and storage
Progress tracking via MessageBus

Limitations

Be aware of these constraints when implementing PDF processing:

File size restrictions:
- 100MB for existing PDF processing
- 20MB for new admin interface uploads
Enhanced mode requires additional system resources
Complex PDF layouts may not be perfectly interpreted
Enhanced processing increases processing time significantly

MachineScholar · February 18, 2025, 1:17pm

This is really amazing news. Thanks team! Can’t wait for the enhanced processing to be finished. That’s gonna be critical for feeding LLMs research papers.

Also, is there any plan to allow doing RAG “chat-with-your-PDFs” by uploading PDFs in an AI bot PM or in a topic/post and mentioning the bot?

hameedacpa · February 24, 2025, 4:30am

@sam Can you provide simple video to explain this great option, because what you mentioned not clear enough to implement

hameedacpa · February 24, 2025, 8:42am

Where I can find this setting

sam · February 24, 2025, 10:18am

It’s a hidden setting, you need to use the console, but you also need to configure the container, recommend you wait a few more weeks

hameedacpa · February 24, 2025, 2:47pm

Thank you, I appreciate your fantastic work

hameedacpa · February 24, 2025, 10:49pm

In my website (Arabic Forum) I did a test in Arabic by adding legislation in the first post “topic” and then I asked questions using AI, but the answers not accurate and I think this is because it is not Context Ragging

محاسبة دوت نت – 24 Feb 25

قرار وزاري رقم (120) لسنة 2023م في شأن التعديلات بموجب الأحكام الانتقالية...

المحاسبة والضرائب والتشريعات - دولة الإمارات ضريبة الشركات - دولة الإمارات

بسم الله الرحمن الرحيم تحية طيبة وبعد، يشرفني أن أقدم لكم تحليلاً مفصلاً للنص الذي تفضلتم بعرضه، مع الربط بينه وبين معايير المحاسبة الدولية IFRS والمعايير الدولية للتدقيق ISA، بالإضافة إلى أحدث الأبحاث والممارسات المهنية والقواعد المحاسبية...

sam · February 25, 2025, 1:00am

Sorry but this is not how it works, you need to define a persona or tool and then add the upload there.

There has been some discussion around supporting “upload and ask” here: Upload and discuss pdfs in composer but it is not supported yet.

hameedacpa · February 25, 2025, 6:45am

First of all, really thank you for your great work. I really like it.

After playing around with the settings and changing the AI Model to Gemini-Flash-2.0, it worked great for me. Here’s the situation I have:

We are an Auditors, Accountants, and Tax Consultants community, and we needed a tool to share related laws and trigger discussions about them. This discussion should be very useful for visitors, as we are professionals in our field. We are targeting the AI Model to check and analyze legislation and answer our questions. The great experiment led to the conclusion that we can really discuss the context added in the first post, and if the AI model is smart enough, it will answer our questions with very high-quality output.
Really thank you again and looking forward to the PDF support as it will make Discourse best forum Sofware

sam · February 28, 2025, 12:04am

Latest image of discourse supports the advanced mode if anyone feels like testing

MachineScholar · February 28, 2025, 12:01pm

Does it have to be enabled via console? Don’t see any advanced mode options via the UI.

Furthermore, I am getting an error when trying to upload this pdf. It is 34 MB but I have my max attachment size set to 100 MB (in both admin settings and app.yml). What’s strange is that I have a compressed version which is 16 MB and it uploads just fine. But perhaps the larger PDF is simply too complex for now? There are lots of images, equations, etc.

Falco · February 28, 2025, 3:17pm

Yes, you need to SiteSetting.ai_rag_images_enabled = true in the Rails console to enable it.

sam · February 28, 2025, 11:48pm

my guess here is that some nginx stuff needs to change in the container as well so it does not do the rejecting

Michael_Liu · April 17, 2025, 12:17am

Hi @sam
im currently having trouble to upload and indexing the pdfs by this error Job exception: undefined method `length’ for nil.

i was wondering if the error related to the settings we discussed above.
the interface will stuck on indexing 0% not move and
the exception details as below:

/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:81:in `chunk_document'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:40:in `block in execute'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:616:in `block in within_new_transaction'
activesupport-7.2.2.1/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:613:in `within_new_transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/database_statements.rb:361:in `transaction'
activerecord-7.2.2.1/lib/active_record/transactions.rb:234:in `block in transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:415:in `with_connection'
activerecord-7.2.2.1/lib/active_record/connection_handling.rb:296:in `with_connection'
activerecord-7.2.2.1/lib/active_record/transactions.rb:233:in `transaction'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:39:in `execute'
/var/www/discourse/app/jobs/base.rb:316:in `block (2 levels) in perform'
rails_multisite-6.1.0/lib/rails_multisite/connection_management/null_instance.rb:49:in `with_connection'
rails_multisite-6.1.0/lib/rails_multisite/connection_management.rb:21:in `with_connection'
/var/www/discourse/app/jobs/base.rb:303:in `block in perform'
/var/www/discourse/app/jobs/base.rb:299:in `each'
/var/www/discourse/app/jobs/base.rb:299:in `perform'
sidekiq-7.3.9/lib/sidekiq/processor.rb:220:in `execute_job'
sidekiq-7.3.9/lib/sidekiq/processor.rb:185:in `block (4 levels) in process'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:180:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
/var/www/discourse/lib/sidekiq/pausable.rb:132:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/job/interrupt_handler.rb:9:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:26:in `track'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:134:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:173:in `invoke'
sidekiq-7.3.9/lib/sidekiq/processor.rb:184:in `block (3 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:145:in `block (6 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:118:in `local'
sidekiq-7.3.9/lib/sidekiq/processor.rb:144:in `block (5 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/config.rb:39:in `block in <class:Config>'
sidekiq-7.3.9/lib/sidekiq/processor.rb:139:in `block (4 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:281:in `stats'
sidekiq-7.3.9/lib/sidekiq/processor.rb:134:in `block (3 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:15:in `call'
sidekiq-7.3.9/lib/sidekiq/processor.rb:133:in `block (2 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:85:in `global'
sidekiq-7.3.9/lib/sidekiq/processor.rb:132:in `block in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:40:in `prepare'
sidekiq-7.3.9/lib/sidekiq/processor.rb:131:in `dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:183:in `block (2 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `block in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:86:in `process_one'
sidekiq-7.3.9/lib/sidekiq/processor.rb:76:in `run'
sidekiq-7.3.9/lib/sidekiq/component.rb:10:in `watchdog'
sidekiq-7.3.9/lib/sidekiq/component.rb:19:in `block in safe_thread'

Michael_Liu · April 17, 2025, 1:45am

thanks for this amazing update,
just have once concern here, 100mb limit for each Persona AI bot or for all Personas

pacharanero · April 30, 2025, 9:52pm

I’m new to Discourse AI but an old hand on Discourses generally.

Really keen to try this out for a specific use case in demo form at this stage.

I’ve enabled the hidden site setting.

Nothing in SideKiq that I can see. How can I see if it is working at all?

I’m aware this is a pre release feature and not ready for prime time yet, however it would be great to be able to experience and try out.

Really keen for any hints, tips, screenshots, or recipes from people that are trying this out.

Topic		Replies	Views
Will RAG Support PDF Files in the Future? Feature completed , ai , ai-bot	23	326	May 25, 2025
Is the PDF upload feature for the new AI Bot UX still in development Support ai , ai-bot	2	44	May 9, 2025
Upload and discuss pdfs in composer Feature ai	5	148	February 24, 2025
Discourse AI Persona, upload support Announcements ai , ai-bot	20	1411	January 30, 2025
Discourse AI - Self-Hosted Guide Self-Hosting ai	61	10909	April 30, 2025