Observability for Discourse AI

Monitoring and evaluating LLMs is critical:

I started working with language models five years ago when I led the team that created CodeSearchNet, a precursor to GitHub CoPilot. Since then, I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.

If Discourse AI is to power business-critical LLM tasks, I think supporting monitoring tools like LangSmith should be prioritized.

Using LangSmith is as simple as running yarn add langchain langsmith and adding a few environment variables.

Has team Discourse thought about how we can configure LLM tracing? Also, any thoughts on how we can implement this prior to discourse-ai official supporting it?

2 Likes

Hahahaha, I wish.

We log every single request and response to LLMs in a table, and allow admins to query those at any time via Data Explorer. Have you tried this already?

{
  "max_tokens": 2000,
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
  "temperature": 0,
  "stop": [
    "\n</output>"
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are a markdown proofreader. You correct egregious typos and phrasing issues but keep the user's original voice.\nYou do not touch code blocks. I will provide you with text to proofread. If nothing needs fixing, then you will echo the text back.\nYou will find the text between <input></input> XML tags.\nYou will ALWAYS return the corrected text between <output></output> XML tags.\n\n"
    },
    {
      "role": "user",
      "content": "<input>We log every single request and response to LLMs in a table, and allow admins to query those at any time via Data Explorer. Have you tried already?</input>"
    }
  ]
}
{
  "id": "chat-45cd241b6e0f4a58840fcc9f49dfa56a",
  "object": "chat.completion",
  "created": 1722528517,
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<output>We log every single request and response to LLMs in a table, and allow admins to query those at any time via Data Explorer. Have you tried this already?</output>",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 135,
    "total_tokens": 174,
    "completion_tokens": 39
  }
}

Creating evals for our features is certainly on our roadmap for 3.4, in special for tweak on our Related Topics and Summarization features.

1 Like

I didn’t say that was all there was to it. (:wink:) But I guess it doesn’t matter since I think LLM calls are made from Ruby.

I haven’t yet, but this is brilliant - thank you! Theoretically, I could export these and programmatically create traces in LangSmith for evals and experiments.

1 Like