תמיכת PDF ב-Discourse AI

:bookmark: This guide explains how to implement and use PDF processing capabilities within discourse-ai, including both basic text extraction and enhanced processing with LLM assistance.

:person_raising_hand: Required user level: Administrator

Summary

The discourse-ai plugin supports PDF processing for RAG (Retrieval-Augmented Generation) in two distinct modes:

  1. Basic text extraction
  2. Enhanced processing with LLM analysis

Basic text extraction

This mode provides fundamental PDF processing capabilities:

  • Extracts text content using the pdf-reader gem
  • Supports files up to 100MB
  • Works immediately after plugin installation
  • Processes text-only content (ignores visual elements)

Enhanced processing with LLM improvements

:information_source: This mode requires specific configuration and provides more advanced capabilities.

Requirements:

  • Enterprise plan subscription or self-hosted Discourse
  • ImageMagick with Ghostscript support installed in container
  • ai_rag_images_enabled site setting enabled

Capabilities:

  • Interprets images, charts, and diagrams
  • Provides context from visual elements
  • Processes PDFs page by page
  • Maintains the 100MB file size limit

Implementation details

Processing specifications

  • Page processing resolution: 300 DPI
  • Maximum processing time: 600 seconds (10 minutes)
  • Automatic cleanup of temporary files
  • Full integration with RAG document embeddings

Processing workflow

  1. PDF upload and validation
  2. Content extraction (basic or enhanced mode)
  3. Text chunking with configurable overlap
  4. Chunk embedding and storage
  5. Progress tracking via MessageBus

Limitations

:warning: Be aware of these constraints when implementing PDF processing:

  • File size restrictions:
    • 100MB for existing PDF processing
    • 20MB for new admin interface uploads
  • Enhanced mode requires additional system resources
  • Complex PDF layouts may not be perfectly interpreted
  • Enhanced processing increases processing time significantly
11 לייקים

זו באמת חדשות מרגשות. תודה לצוות! לא יכול לחכות לסיום העיבוד המשופר. זה יהיה קריטי להזנת מאמרי מחקר ל-LLMs.

[ציטוט=“Sam Saffron, פוסט:22, נושא:335804, שם משתמש:sam”]
אלה שמחכים לעיבוד משופר, תנו לנו כמה שבועות
[/ציטוט]

וגם, יש תכנית לאפשר לבצע RAG “שיחה עם הקבצים שלך” על ידי העלאת PDFs בצ’אט AI פרטי או בפורום/נושא והזכרת הבוט?

לייק 1

@sam אפשר לספק סרטון פשוט להסביר אפשרות מצוינת זו, כי מה שהזכרת לא ברור מספיק ליישום

היכן אני יכול למצוא את ההגדרה הזאת ai_rag_images_enabled בהגדרות האתר

It’s a hidden setting, you need to use the console, but you also need to configure the container, recommend you wait a few more weeks

3 לייקים

Thank you, I appreciate your fantastic work

On my website (Arabic Forum), I conducted a test in Arabic by adding legislation in the first post (“topic”) and then asked questions using AI. However, the answers were inaccurate, and I believe this is because it is not Context Ragging.

Sorry but this is not how it works, you need to define a persona or tool and then add the upload there.

There has been some discussion around supporting “upload and ask” here: Upload and discuss pdfs in composer but it is not supported yet.

לייק 1

First of all, really thank you for your great work. I really like it.

After playing around with the settings and changing the AI Model to Gemini-Flash-2.0, it worked great for me. Here’s the situation I have:

We are an Auditors, Accountants, and Tax Consultants community, and we needed a tool to share related laws and trigger discussions about them. This discussion should be very useful for visitors, as we are professionals in our field. We are targeting the AI Model to check and analyze legislation and answer our questions. The great experiment led to the conclusion that we can really discuss the context added in the first post, and if the AI model is smart enough, it will answer our questions with very high-quality output.
Really thank you again and looking forward to the PDF support as it will make Discourse best forum Sofware

3 לייקים

Latest image of discourse supports the advanced mode if anyone feels like testing

2 לייקים

Does it have to be enabled via console? Don’t see any advanced mode options via the UI.

Furthermore, I am getting an error when trying to upload this pdf. It is 34 MB but I have my max attachment size set to 100 MB (in both admin settings and app.yml). What’s strange is that I have a compressed version which is 16 MB and it uploads just fine. But perhaps the larger PDF is simply too complex for now? There are lots of images, equations, etc.

כן, אתה צריך לכתוב SiteSetting.ai_rag_images_enabled = true בקונסולת Rails כדי להפעיל את זה.

לייק 1

my guess here is that some nginx stuff needs to change in the container as well so it does not do the rejecting

לייק 1

Hi @sam
im currently having trouble to upload and indexing the pdfs by this error Job exception: undefined method `length’ for nil.

i was wondering if the error related to the settings we discussed above.
the interface will stuck on indexing 0% not move and
the exception details as below:

/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:81:in `chunk_document'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:40:in `block in execute'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:616:in `block in within_new_transaction'
activesupport-7.2.2.1/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/transaction.rb:613:in `within_new_transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/database_statements.rb:361:in `transaction'
activerecord-7.2.2.1/lib/active_record/transactions.rb:234:in `block in transaction'
activerecord-7.2.2.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:415:in `with_connection'
activerecord-7.2.2.1/lib/active_record/connection_handling.rb:296:in `with_connection'
activerecord-7.2.2.1/lib/active_record/transactions.rb:233:in `transaction'
/var/www/discourse/plugins/discourse-ai/app/jobs/regular/digest_rag_upload.rb:39:in `execute'
/var/www/discourse/app/jobs/base.rb:316:in `block (2 levels) in perform'
rails_multisite-6.1.0/lib/rails_multisite/connection_management/null_instance.rb:49:in `with_connection'
rails_multisite-6.1.0/lib/rails_multisite/connection_management.rb:21:in `with_connection'
/var/www/discourse/app/jobs/base.rb:303:in `block in perform'
/var/www/discourse/app/jobs/base.rb:299:in `each'
/var/www/discourse/app/jobs/base.rb:299:in `perform'
sidekiq-7.3.9/lib/sidekiq/processor.rb:220:in `execute_job'
sidekiq-7.3.9/lib/sidekiq/processor.rb:185:in `block (4 levels) in process'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:180:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
/var/www/discourse/lib/sidekiq/pausable.rb:132:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/job/interrupt_handler.rb:9:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:183:in `block in traverse'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:26:in `track'
sidekiq-7.3.9/lib/sidekiq/metrics/tracking.rb:134:in `call'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:182:in `traverse'
sidekiq-7.3.9/lib/sidekiq/middleware/chain.rb:173:in `invoke'
sidekiq-7.3.9/lib/sidekiq/processor.rb:184:in `block (3 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:145:in `block (6 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:118:in `local'
sidekiq-7.3.9/lib/sidekiq/processor.rb:144:in `block (5 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/config.rb:39:in `block in <class:Config>'
sidekiq-7.3.9/lib/sidekiq/processor.rb:139:in `block (4 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:281:in `stats'
sidekiq-7.3.9/lib/sidekiq/processor.rb:134:in `block (3 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:15:in `call'
sidekiq-7.3.9/lib/sidekiq/processor.rb:133:in `block (2 levels) in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_retry.rb:85:in `global'
sidekiq-7.3.9/lib/sidekiq/processor.rb:132:in `block in dispatch'
sidekiq-7.3.9/lib/sidekiq/job_logger.rb:40:in `prepare'
sidekiq-7.3.9/lib/sidekiq/processor.rb:131:in `dispatch'
sidekiq-7.3.9/lib/sidekiq/processor.rb:183:in `block (2 levels) in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:182:in `block in process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `handle_interrupt'
sidekiq-7.3.9/lib/sidekiq/processor.rb:181:in `process'
sidekiq-7.3.9/lib/sidekiq/processor.rb:86:in `process_one'
sidekiq-7.3.9/lib/sidekiq/processor.rb:76:in `run'
sidekiq-7.3.9/lib/sidekiq/component.rb:10:in `watchdog'
sidekiq-7.3.9/lib/sidekiq/component.rb:19:in `block in safe_thread'
לייק 1

תודה על העדכון המדהים הזה, יש לי רק שאלה אחת, הגבלת ה-100MB לכל בוט Persona AI או לכל כל ה-Personas

I’m new to Discourse AI but an old hand on Discourses generally.

Really keen to try this out for a specific use case in demo form at this stage.

I’ve enabled the hidden site setting.

Nothing in SideKiq that I can see. How can I see if it is working at all?

I’m aware this is a pre release feature and not ready for prime time yet, however it would be great to be able to experience and try out.

Really keen for any hints, tips, screenshots, or recipes from people that are trying this out.

אני מקבל את השגיאה הזו כאשר אני מבקש מהבוט לסכם את התוכן של כמה קבצי PDF באתר שלי. לא הפעלתי עיבוד משופר, ואני משתמש ב-GPT 4.1. יש לך רעיונות מה אני עושה לא בסדר?

מצטער, נראה שמערכת שלנו נתקלה בבעיה בלתי צפויה בעת ניסיון להגיב.

פרטי שגיאה

{
“error”: {
“message”: “הודעת עוזר עם ‘tool_calls’ חייבת להיות מלווה בהודעות כלי המגיבות לכל ‘tool_call_id’. ל-tool_call_ids הבאים לא היו הודעות תגובה: call_nrDCba5mt83oavbXfPq2BtEV”,
“type”: “invalid_request_error”,
“param”: “messages.[2].role”,
“code”: null
}
}”,”target_locale”:”he”}```

Heathrow fechado: paralisação de voos deve continuar nos próximos dias, diz gestora do aeroporto de Londres

May I inquire into the current status of PDF support? :face_with_peeking_eye:

לייק 1

When you configure upload sizes in app.yml it is site-wide, so it applies to each persona.

לייק 1

Are there any updates on this matter? I’m attaching a PDF when initiating a conversation with the AI, but it still doesn’t seem to recognize it. I am currently utilizing GPT. Should I perhaps consider employing a different model specifically designed for PDF processing?

לייק 1