Self-Hosting an OpenSource LLM for DiscourseAI

The Discourse AI plugin has many features that require an LLM to be enabled, like, for example, Summarization, AI Helper, AI Search, AI Bot. While you can use a third party API, like Configure API Keys for OpenAI or Configure API Keys for Anthropic we built Discourse AI since first day to not be locked into those.

Running with HuggingFace TGI

HuggingFace provides an awesome container image that can get you running quickly.

For example:

mkdir -p /opt/tgi-cache
docker run --rm --gpus all --shm-size 1g -p 8080:80 \
  -v /opt/tgi-cache:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.2

Should get you up and running with a local instance of Mistral 7B Instruct on the localhost at port 8080, that can be tested with

curl http://localhost:8080/ \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"inputs":"<s>[INST] What is your favourite condiment? [/INST] Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] Do you have mayonnaise recipes? [/INST]","parameters":{"max_new_tokens":500, "temperature":0.5,"top_p": 0.9}}'

Running with vLLM

Another option to self-host LLMs Discourse AI supports is vLLM, which is a very popular project, licensed under the Apache License.

Here how to get started with a model:

mkdir -p /opt/vllm-cache
docker run --gpus all \
  -v /opt/vllm-cache:/root/.cache/huggingface \
  -e "MODEL=mistralai/Mistral-7B-Instruct-v0.2" \
  -p 8080:8000 --ipc=host vllm/vllm-openai:latest

Which you can test with

curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "<s> [INST] What was the latest released hero for Dota 2? [/INST] The latest released hero for Dota 2 was", "max_tokens": 200}'

Making it available for your Discourse instance

Most of the times you will be running this in a dedicated server because of the GPU requirement. When doing so I recommend running a reverse proxy, doing TLS termination and securing the endpoint so it can only be connected by your Discourse instance.

Configuring DiscourseAI

Discourse AI ships site settings to configure the inference server for open source models. You should point it to your server using either ai_hugging_face_api_url or ai_vllm_endpoint according to what inference software you picked.

After that, change each module to use the model you are running, in the model selection settings, like

  • ai_helper_model
  • ai_embeddings_semantic_search_hyde_model
  • summarization strategy
  • ai_bot_enabled_chat_bots
12 Likes

For anyone searching this topic with/for:
#Llava-Api-keys

Iā€™m using vLLM too. I also would recommend the openchat v3.5 0106 model, which is a 7B parameter model which performs very well.

I actually am running it in 4bit quantized so that it runs faster.