LLM prompt evals - Nice to know

EricGT · January 6, 2025, 8:32pm

For those of us who create prompts for LLMs, understanding how effective our prompts are is of importance.

In LLM lingo, it is known as evals, short for evaluations. For those of us used to Unit Testing in programming, the similarities are so close that I often just mentally equate the two.

During the 12 days of OpenAI, this question was asked:

What are we as developers not doing as much as you think we should? What do you wish we did differently, or more or less of?

Michelle Pokrass of OpenAI replied:

One big one is evals! I see tons of developers not using evals at all and relying on vibes for rolling out changes to prod. Would highly recommend creating some simple evals using our evals product (or open source offerings) so you can update with confidence when we release new models.

On Twitter, Amanda Askell @AnthropicAI notes:

The boring yet crucial secret behind good system prompts is test-driven development. You don’t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

What many do not know and it is now starting to gain traction with the LLM model creators are tools to help end users evaluate their prompts.

OpenAI playground:
https://platform.openai.com/docs/guides/evals
Note: This is new and in the OpenAI playground, this is not the evals we have seen for years in OpenAI GitHub (evals)

Anthropic console:

Microsoft .Net framework on Azure:

Disclosure: I have not used any of these automated evaluations, but I have done many simpler evaluations manually by trying different prompts. This will just make it easier.

For more details on the method of asking another (ideally larger or more powerful) model to analyze a review, rather than comparing the model output to human-created output, I recommend this lesson from Colin Jarvis.

Lesson 6: Metaprompting with o1
part of the DeepLearning.AI course: Reasoning with o1 - DeepLearning.AI

FYI

Wanted to add a tag evals, but I lack permission to create it.

Topic		Replies	Views
Observability for Discourse AI Feature ai , ai-bot	2	103	August 12, 2024
Prompt tools: funnel, orbit, and flux charts Feature ai , sql-query	0	56	April 16, 2025
Could every system prompt of AIs been editable? Feature ai , ai-helper	5	115	December 20, 2024
Best models and prompts for testing Discord search and Discoveries Support ai , ai-search	3	54	June 16, 2025
We need prompt chains: Allow custom AI persona tools to access LangChain.js and/or longer execution time Feature ai	5	96	September 19, 2024

LLM prompt evals - Nice to know

Related topics