I want to add a new “Chat Bot” and link it to a self hosted LLM.
I have tried to use the “ai hugging face model display name” field and that doesn’t seem to appear anywhere, perhaps I have to reference that in the prompts associated with a persona?
I have also tried to “create” a new bot via the “ai bot enable chat bots” drop down, and anything I create appear shows in the chatbot drop down as " [en.discourse_ai.ai_bot.bot_names.XXXX] where XXXX is the name I provided.
Any tips to any documentation or guide who to do this would be appreacaited.
Anyone who can offer any suggestions or is this a known limitation?
@Roman_Rizzi is working on refactoring this section, expect more news in the coming weeks
I’m not sure if I interpret this correctly that currently it is not possible to use a self-hosted LLM, but this will change soon?
It is not possible atm, but hopefully in a week or 2 we will have this working.
Thanks. I was surprised it didn’t work since OpenAI is supported. I think many people run their own LLMs with an OpenAI compatible endpoint. I will look forward to the update in 2 weeks
Out of interest @Isambard what’s your estimate for how much a sufficiently powerful local LLM will cost you to host on a monthly basis (dollar equivalent)?
About a minimum of $5 in additional electricity costs per month for the GPU at idle - although in reality, the incremental cost for discourse is zero since I already run the LLM for other purposes.
But for sure, it would be more economical for small forums and low usage to use an LLM as a service. Though for the scale of Discourse’s hosted offering, I suspect it might make sense to host internally (and also develop knowledge of this area that is likely to be important).
And 15k for the A100 ?
What model particularly are you running locally?
I’m running several different things. For Discourse stuff, I will run a 7B model based off Mistral and fine-tuned for the tasks. I’m looking at various BERT-like models for classification tasks and still undecided on the embeddings yet. This runs on a 2nd hand 3090 Ti which I bought for $700.
I would love to have an A100, but instead, I built a separate 4 GPU system ‘on the cheap’ for only $1,000 that runs Llama 3 70Bq4 at over 20 tok/s.
For sure in many/most cases it would make sense to just go with a provider, however, it might make sense to DIY if:
- You want to learn
- You want to have control certainty over your models (so you don’t lose access to them, or are beholden to a company to use their non-public embeddings)
- You have a lot of bulk processing to do which would be cheaper to do in-house
- You want reserved and reliable capacity (there are limits on both requests and tokens available from providers) for bulk processing
I benchmarked the 3090 and was getting max sustained throughput of around 2600 tokens per second running Llama 3 - 8B FP16. I live in an expensive electricity region but running continuously at a 285W power limit, it would cost around $0.007 per million output tokens. Or roughly $0.01 per million tokens if you fully depreciate the equipment cost over 3 years.
This compares quite favourably to Claude Haiku providing you have a reasonable utilization rate.
I made an interesting discovery: the web server that I’m hosting my forum on has sufficient grunt to run a small LLM at modest speeds (6 tok/s without batching) even without a GPU. This will be useful for offline/background tasks.