This guide explains how to implement and use PDF processing capabilities within discourse-ai, including both basic text extraction and enhanced processing with LLM assistance.
Required user level: Administrator
Summary
The discourse-ai plugin supports PDF processing for RAG (Retrieval-Augmented Generation) in two distinct modes:
- Basic text extraction
- Enhanced processing with LLM analysis
Basic text extraction
This mode provides fundamental PDF processing capabilities:
- Extracts text content using the
pdf-reader
gem - Supports files up to 100MB
- Works immediately after plugin installation
- Processes text-only content (ignores visual elements)
Enhanced processing with LLM improvements
This mode requires specific configuration and provides more advanced capabilities.
Requirements:
- Enterprise plan subscription or self-hosted Discourse
- ImageMagick with Ghostscript support installed in container
ai_rag_images_enabled
site setting enabled
Capabilities:
- Interprets images, charts, and diagrams
- Provides context from visual elements
- Processes PDFs page by page
- Maintains the 100MB file size limit
Implementation details
Processing specifications
- Page processing resolution: 300 DPI
- Maximum processing time: 600 seconds (10 minutes)
- Automatic cleanup of temporary files
- Full integration with RAG document embeddings
Processing workflow
- PDF upload and validation
- Content extraction (basic or enhanced mode)
- Text chunking with configurable overlap
- Chunk embedding and storage
- Progress tracking via MessageBus
Limitations
Be aware of these constraints when implementing PDF processing:
- File size restrictions:
- 100MB for existing PDF processing
- 20MB for new admin interface uploads
- Enhanced mode requires additional system resources
- Complex PDF layouts may not be perfectly interpreted
- Enhanced processing increases processing time significantly