How to add a new Chat Bot connected to a self-hosted LLM?

I want to add a new “Chat Bot” and link it to a self hosted LLM.
I have tried to use the “ai hugging face model display name” field and that doesn’t seem to appear anywhere, perhaps I have to reference that in the prompts associated with a persona?
I have also tried to “create” a new bot via the “ai bot enable chat bots” drop down, and anything I create appear shows in the chatbot drop down as " [en.discourse_ai.ai_bot.bot_names.XXXX] where XXXX is the name I provided.
Any tips to any documentation or guide who to do this would be appreacaited.

Anyone who can offer any suggestions or is this a known limitation?

@Roman_Rizzi is working on refactoring this section, expect more news in the coming weeks

3 Likes

I’m not sure if I interpret this correctly that currently it is not possible to use a self-hosted LLM, but this will change soon?

It is not possible atm, but hopefully in a week or 2 we will have this working.

Thanks. I was surprised it didn’t work since OpenAI is supported. I think many people run their own LLMs with an OpenAI compatible endpoint. I will look forward to the update in 2 weeks :slight_smile:

1 Like

Out of interest @Isambard what’s your estimate for how much a sufficiently powerful local LLM will cost you to host on a monthly basis (dollar equivalent)?

About a minimum of $5 in additional electricity costs per month for the GPU at idle - although in reality, the incremental cost for discourse is zero since I already run the LLM for other purposes.

But for sure, it would be more economical for small forums and low usage to use an LLM as a service. Though for the scale of Discourse’s hosted offering, I suspect it might make sense to host internally (and also develop knowledge of this area that is likely to be important).

1 Like

And 15k for the A100 ?

What model particularly are you running locally?

1 Like

I’m running several different things. For Discourse stuff, I will run a 7B model based off Mistral and fine-tuned for the tasks. I’m looking at various BERT-like models for classification tasks and still undecided on the embeddings yet. This runs on a 2nd hand 3090 Ti which I bought for $700.

I would love to have an A100, but instead, I built a separate 4 GPU system ‘on the cheap’ for only $1,000 that runs Llama 3 70Bq4 at over 20 tok/s.

For sure in many/most cases it would make sense to just go with a provider, however, it might make sense to DIY if:

  • You want to learn
  • You want to have control certainty over your models (so you don’t lose access to them, or are beholden to a company to use their non-public embeddings)
  • You have a lot of bulk processing to do which would be cheaper to do in-house
  • You want reserved and reliable capacity (there are limits on both requests and tokens available from providers) for bulk processing
4 Likes

I benchmarked the 3090 and was getting max sustained throughput of around 2600 tokens per second running Llama 3 - 8B FP16. I live in an expensive electricity region but running continuously at a 285W power limit, it would cost around $0.007 per million output tokens. Or roughly $0.01 per million tokens if you fully depreciate the equipment cost over 3 years.

This compares quite favourably to Claude Haiku providing you have a reasonable utilization rate.

2 Likes