An intern deployed our Discourse site on DigitalOcean with OpenAI using an API endpoint connected to the Discourse AI plugin. The site is working great. The intern suggested that they look at HuggingFace TGI. I’m trying to provide guidance to the intern to see if they are on the correct track with regards to HuggingFace. I believe that they are suggesting self-hosted HuggingFace TGI to reduce costs. However, when I look at the GPU costs of hosting, it seems expensive.
I could ask the intern to propose specific services and costs, but I’m trying to help with strategic guidance. The alternative is for the intern to continue to test OpenAI, Anthropic, Gemini.
Is there any advice on what I should assign the intern?
The basic idea is to implement Discourse AI on a production deployment of Discourse and then ask the customer (the one funding the community) to pay some additional service fee to maintain the AI and promote the new features.
As far as intern assignments, I could also assign them to look at the Hugging Face Inference API. Is it cheaper than using the OpenAI API?
Is anyone using specific services from Google Cloud, AWS, Azure to host the TGI?
For example, for AWS, should they look at g4dn.xlarge or g5.xlarge
For a single instance, it will be hard to beat API pricing, as with API pricing you pay per call, whereas when running TGI, you pay per hour the server is running.
Let’s say you are running Llama 3.1 8B in a g6.xlarge; that will cost you approximately $600 a month. This could give you around 450M tokens in Anthropic Claude 3.5 Haiku.
Running your own LLM makes sense when you need either privacy or scale.
Thank you for your response. $600/month for Llama 3.1 8B in g6.xlarge would be reasonable cost, but as you graciously pointed out, the API cost would be cheaper. Thus, we’ll likely go with the OpenAI and other API costs. What are the privacy concerns?
For the purpose of experimentation with HuggingFace TGI, is there anything cheaper than $600/month that we could use for testing? For example, can the intern turn off the GPU instance when they are not working? I’m trying to figure out what to recommend to them. I am somewhat confused as to the costs for the GPU-enabled containers and I don’t want to put the burden of the cost recommendation on the intern. If they make a mistake with the purchase of a container, they may feel bad.
What I’d like to do is buy them the resources, then instruct them to test out HuggingFace TGI in the resource that I purchased for them. They can then report back on any performance or result optimization differences.