I started working with language models five years ago when I led the team that created CodeSearchNet, a precursor to GitHub CoPilot. Since then, I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.
If Discourse AI is to power business-critical LLM tasks, I think supporting monitoring tools like LangSmith should be prioritized.
Using LangSmith is as simple as running yarn add langchain langsmith and adding a few environment variables.
Has team Discourse thought about how we can configure LLM tracing? Also, any thoughts on how we can implement this prior to discourse-ai official supporting it?
We log every single request and response to LLMs in a table, and allow admins to query those at any time via Data Explorer. Have you tried this already?
{
"max_tokens": 2000,
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"temperature": 0,
"stop": [
"\n</output>"
],
"messages": [
{
"role": "system",
"content": "You are a markdown proofreader. You correct egregious typos and phrasing issues but keep the user's original voice.\nYou do not touch code blocks. I will provide you with text to proofread. If nothing needs fixing, then you will echo the text back.\nYou will find the text between <input></input> XML tags.\nYou will ALWAYS return the corrected text between <output></output> XML tags.\n\n"
},
{
"role": "user",
"content": "<input>We log every single request and response to LLMs in a table, and allow admins to query those at any time via Data Explorer. Have you tried already?</input>"
}
]
}
{
"id": "chat-45cd241b6e0f4a58840fcc9f49dfa56a",
"object": "chat.completion",
"created": 1722528517,
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "<output>We log every single request and response to LLMs in a table, and allow admins to query those at any time via Data Explorer. Have you tried this already?</output>",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 135,
"total_tokens": 174,
"completion_tokens": 39
}
}
Creating evals for our features is certainly on our roadmap for 3.4, in special for tweak on our Related Topics and Summarization features.
I didn’t say that was all there was to it. () But I guess it doesn’t matter since I think LLM calls are made from Ruby.
I haven’t yet, but this is brilliant - thank you! Theoretically, I could export these and programmatically create traces in LangSmith for evals and experiments.