A lot of this depends on “we need evals”:
We have:
What we need to do here is:
- Add a big bunch of spam/ham posts to the eval suite (say 20-30 or each)
- Run the eval
- Fix the prompt
- Run the eval again
Otherwise we tend to be poking in the dark. cc @Falco