Overview
The Discourse AI plugin ships a Ruby CLI under plugins/discourse-ai/evals that exercises AI features against YAML definitions and records results. Use it to benchmark prompts, compare model outputs, and regression-test AI behaviors without touching the app database.
Core concepts (what users need to know)
- Eval case: A YAML definition under
evals/cases/<group>/<id>.ymlthat pairs inputs (args) with an expected outcome. Evals can check exact strings, regexes, or expected tool calls. - Feature: The Discourse AI behavior under test, identified as
module:feature_name(for example,summarization:topic_summaries).--list-featuresshows the valid keys. - Persona: The system prompt wrapped around the LLM call. Runs default to the built-in prompt unless you pass
--persona-keysto load alternate prompts fromevals/personas/*.yml. Add multiple keys to compare prompts in one run. - Judge: A rubric embedded in some evals that requires a second LLM to grade outputs. Think of it as an automated reviewer: it reads the model output and scores it against the criteria. If an eval defines
judge, you must supply or accept the default judge model (--judge, defaultgpt-4o). Without a judge, outputs are matched directly against the expected value. - Comparison modes:
--compare personas(one model, many personas) or--compare llms(one persona, many models). The judge picks a winner and reports ratings; non-comparison runs just report pass/fail. - Datasets: Instead of YAML cases, pass
--dataset path.csv --feature module:featureto build cases from CSV rows (contentandexpected_outputcolumns required). - Logs: Every run writes plain text and structured traces to
plugins/discourse-ai/evals/log/with timestamps and persona keys. Use them to inspect failures, skipped models, and judge decisions.
Prerequisites
- Have a working Discourse development environment with the Discourse AI plugin present. The runner loads
config/environment(defaulting to the repository root orDISCOURSE_PATHif set). - LLMs are defined in
plugins/discourse-ai/config/eval-llms.yml; copy it toeval-llms.local.ymlto override entries locally. Each entry expects anapi_key_env(or inlineapi_key), so export the matching environment variables before running, for example:OPENAI_API_KEY=...ANTHROPIC_API_KEY=...GEMINI_API_KEY=...
- From the repository root, change into
plugins/discourse-ai/evalsand run./run --helpto confirm the CLI is wired up. Ifevals/casesis missing it will be cloned automatically fromdiscourse/discourse-ai-evals.
Discover available inputs
./run --listlists all eval ids fromevals/cases/*/*.yml../run --list-featuresprints feature keys grouped by module (format:module:feature)../run --list-modelsshows LLM configs that can be hydrated fromeval-llms.yml/.local.yml../run --list-personaslists persona keys defined underevals/personas/*.ymlplus the built-indefault.
Run evals
-
Run a single eval against specific models:
OPENAI_API_KEY=... ./run --eval simple_summarization --models gpt-4o-mini -
Run every eval for a feature (or the whole suite) against multiple models:
./run --feature summarization:topic_summaries --models gpt-4o-mini,claude-3-5-sonnet-latestOmitting
--modelshydrates every configured LLM. Models that cannot hydrate (missing API keys, etc.) are skipped with a log message. -
Some evals define a
judgeblock. When any selected eval requires judging, the runner defaults to--judge gpt-4ounless you pass--judge name. Invalid or missing judge configs cause the CLI to exit before running.
Personas and comparison modes
-
Supply custom prompts with
--persona-keys key1,key2. Keys resolve to YAML files inevals/personas; each needskey(optional, defaults to the filename),system_prompt, and an optionaldescription. -
Minimal persona example (
evals/personas/topic_summary_eval.yml):key: topic_summary_eval description: Variant tuned for eval comparisons system_prompt: | Summarize the topic in 2–4 sentences. Keep the original language and avoid new facts. -
--compare personasruns one model against multiple personas. The built-indefaultpersona is automatically prepended so you can compare YAML prompts against stock behavior, and at least two personas are required. -
--compare llmsruns one persona (default unless overridden) across multiple models and asks the judge to score them side by side. -
Non-comparison runs accept a single persona; pass one
--persona-keysvalue or rely on the default prompt.
Dataset-driven runs
-
Generate eval cases from a CSV instead of YAML by passing
--dataset path/to/file.csv --feature module:feature. The CSV must includecontentandexpected_outputcolumns; each row becomes its own eval id (dataset-<filename>-<row>). -
Minimal CSV example:
content,expected_output "This is spam!!! Buy now!",true "Genuine question about hosting",false -
Example:
./run --dataset evals/cases/spam/spam_eval_dataset.csv --feature spam:inspect_posts --models gpt-4o-mini
Writing eval cases
- Store cases under
evals/cases/<group>/<name>.yml. Each file must declareid,name,description, andfeature(themodule:featurekey registered with the plugin). - Provide inputs under
args. Keys ending in_path(orpath) are expanded relative to the YAML directory so you can reference fixture files. For multi-case files,argscan contain arrays (for example,cases:) that runners iterate over. - Expected results can be declared with one of:
expected_output: exact string matchexpected_output_regex: treated as a multiline regular expressionexpected_tool_call: expected tool invocation payload
- Set
vision: truefor evals that require a vision-capable model. Include ajudgesection (pass_rating,criteria, and optionallabel) to have outputs scored by a judge LLM.
Results and logs
-
CLI output shows pass/fail per model and prints expected vs actual details on failures. Comparison runs also stream the judge’s winner and ratings.
-
Example pass/fail snippet:
gpt-4o-mini: Passed 🟢 claude-3-5-sonnet-latest: Failed 🔴 ---- Expected ---- true ---- Actual ---- false -
Comparison winner snippet:
Comparing personas for topic-summary Winner: topic_summary_eval Reason: Captured key details and stayed concise. - default: 7/10 — missed concrete use case - topic_summary_eval: 9/10 — mentioned service dogs and tone was neutral -
Each run writes plain logs and structured traces to
plugins/discourse-ai/evals/log/(timestamped.logand.jsonfiles). The JSON files are formatted for ui.perfetto.dev to inspect the structured steps. -
On completion the runner echoes the log paths; use them to audit skipped models, judge decisions, and raw outputs when iterating on prompts or features.
Common features (what to try first)
summarization:topic_summaries: Summarize a conversation.spam:inspect_posts: Spam/ham classification.translation:topic_title_translator: Translate topic titles while preserving tone/formatting.ai_helper:rewrite: Prompt the AI helper for rewrites.tool_calls:tool_calls_with_no_toolandtool_calls:tool_call_chains: Validate structured tool call behavior.
This document is version controlled - suggest changes on github.