Discourse AI - Self-Hosted Guide

Is there any particular reason these Docker commands aren’t in detached mode (missing -d)?

If you mean “Shouldn’t I launch these with -d?” The answer is probably yes.

If you really mean “Why didn’t the OP tell me to launch these commands with -d?”, I think they are intended to be just examples of how you might start them up and make them work at all. In practice, you’ll want to do some <other stuff> to launch them in a way that would make them useful in production.

1 Like

That’s exactly my question and you hit the nail right on the head. It’s been a while since I configured my Docker instances, but this is coming back now. When you say “other stuff” that I should do to make them useful for production, is there something else that should be screaming at me “DO THIS!” (besides the obvious of changing the port number from the same 6666 in each of the docker instances)?

OK. So for the pipe-separated API keys, are those completely arbitrary as the service host and do we just specify whatever alphanumeric key(s) we want to accept from the client?

How is changing the port any less “Obvious” than having it run in the background?

That’s the thing. Without having intimate knowledge of what you think is obvious, it’s impossible to answer the question. Mostly, if you are not pretty sure that you know how to make the stuff useful, then you probably need help that you can’t get here. :person_shrugging:

Because I have run dozens of docker containers in the past. I just haven’t touched Docker in the last two years and dived in deep. It wasn’t obvious at first since I hadn’t touched it in a while but this basic fundamental knowledge for using Docker came back upon discussion.

That’s the thing. Without having intimate knowledge of what you think is obvious, it’s impossible to answer the question. Mostly, if you are not pretty sure that you know how to make the stuff useful, then you probably need help that you can’t get here. :person_shrugging:

That’s the rub, though. Sometimes, the obvious isn’t obvious even for those who have experience with different Docker systems like myself. One could interpret what you said as, you should know the answer to the question before you ask it. Understand that some of us run communities as a volunteer service and do not spend 24-7 learning the most intimate details of Discourse down to the Postgres data structures and such. I feel like you were shutting me down and that wasn’t appreciated in what should be a community forum where everyone should be freely and happily helping each other.

To the point here, I’ve done some Googling to try to ascertain how API_KEYS is supposed to be utilized and have come up short. I understand I may be missing the obvious and that might be downright frustrating for a Discourse professional like yourself with extensive knowledge of the platform down to the lowest level, but I am trying to have a community discussion here so others can benefit too who aren’t necessarily at your skill level yet. After all, the point is so people other than the developers of Discourse can use this software as well.

1 Like

I feel your pain. Coming across some service I have running and little idea how I started it happens more than I’d like.

Right. Even instructions that we write for ourselves don’t make sense when we need them.

Sorry. I didn’t mean to be rude or mean, and it looks like I was. My point was just that it’s hard enough supporting people running the Standard Install, so figuring out what your skills are and how you’re planning to launch it and whether it’ll be on the open internet and whether you know how to or want to have it protected with https (you probably do if you think you’re protecting it with API keys, is hard.

Yeah. If you’re putting this somewhere that someone else can contact it, I think you’ll want to define that API_KEYS variable and find some way to generate some random-ish thing to use as a key. And then you’d enter that same key in the settings of the plugin. That’s what I did. I didn’t check that using the wrong key would break it, which TBH, I think I should have. Maybe I’ll do that on the instance I’m about to add the plugin to.

But it might be nicer if the OP included the -d and set the API_KEYS env variable.

1 Like

the API_KEYS env knob is an optional one you can use if you, for any reason, want to restrict the service to only clients who supply one of the configured API_KEYS in their header.

Something you don’t really need if running it internally for a single instance, but that may be useful if running it across the internet or in shared environment.

2 Likes

Thanks @Falco and @pfaffman for your help and sorry if I derailed things here! Both of your help has been greatly appreciated! :smiley:

1 Like

Can all of these services be used by multiple Discourse installations, or should these be run on a per-site basis?

1 Like

They are all safe to share between instances.

2 Likes

It’s still possible to use Summarization with OpenAI API keys?

Yes, fill in the keys and pick a OpenAI model in the summarization setting.

1 Like

There is one small issue if that topic is using another language than english or minor one — one time it uses right language and suddenly it starts using english. Both ways changing language seems happening totally randomly.

I’m testing the Summarization endpoint:

docker run -d --rm --gpus all --shm-size 1g -p 80:80 -v /mnt:/data -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=32 -e REVISION=gptq-4bit-32g-actorder_True ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ --max-batch-prefill-tokens=12000 --max-total-tokens=12000 --max-input-length=10000 --quantize=gptq --sharded=true --num-shard=$(lspci | grep NVIDIA | wc -l | tr -d '\n') --rope-factor=2

However, when I run it I get the following error. This machine has (2) Tesla T4s and no other process is accessing the GPUs. See usage below.

user@gpu2-hc1node:~$ sudo docker logs -f 68e27eb51ee1
2023-12-14T21:30:12.861320Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ", revision: Some("gptq-4bit-32g-actorder_True"), validation_workers: 2, sharded: Some(true), num_shard: Some(2), quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 10000, max_total_tokens: 12000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 12000, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "68e27eb51ee1", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: Some(2.0), json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-14T21:30:12.861350Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-12-14T21:30:12.861441Z  INFO download: text_generation_launcher: Starting download process.
2023-12-14T21:30:19.986231Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-12-14T21:30:20.771527Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-12-14T21:30:20.771941Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-12-14T21:30:20.771967Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-12-14T21:30:27.769624Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2023-12-14T21:30:27.997163Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2023-12-14T21:30:28.046134Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-14T21:30:28.071687Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-14T21:30:28.072298Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-14T21:30:28.241375Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-14T21:30:28.262756Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-14T21:30:28.263363Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-14T21:30:30.786133Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-12-14T21:30:30.786133Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-12-14T21:30:40.348755Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 191, in get_multi_weights_col
    qweight = torch.cat(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 14.76 GiB of which 74.75 MiB is free. Process 19973 has 14.68 GiB memory in use. Of the allocated memory 13.73 GiB is allocated by PyTorch, and 74.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nvidia-smi after the model crashes.

Thu Dec 14 15:39:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   54C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   55C    P0    28W /  70W |      0MiB / 15109MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When I start the model, I can see GPU usage increase to about 100% on both GPUs, and then it crashes.

Two T4 is way too little for that model. You can try something like a prompt compatible 7B model in those.

2 Likes

I was able to get the following mode to run on a T4

sudo docker run --gpus all --shm-size 1g -p 80:80 -v /home/deeznnutz/discourse/data:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id tiiuae/falcon-7b-instruct --max-batch-prefill-tokens 2048

I am able to test it locally and it works:

curl https://Public_URL/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'     -H 'Content-Type: application/json'
{"generated_text":"\nDeep learning is a branch of machine learning that uses artificial neural networks to learn and make decisions."}

However, when I try to run it in Discourse, with these settings

ai summarization discourse service api endpoint: https://URL/generate/
ai summarization discourse service api key: random numbers
summarization strategy: Discourse AI's long-t5-tglobal....-book-summary

I get the following error.

Message (6 copies reported)

Job exception: Net::HTTPBadResponse


Backtrace

/var/www/discourse/plugins/discourse-ai/lib/inference/discourse_classifier.rb:13:in `perform!'
/var/www/discourse/plugins/discourse-ai/lib/summarization/strategies/truncate_content.rb:46:in `completion'
/var/www/discourse/plugins/discourse-ai/lib/summarization/strategies/truncate_content.rb:42:in `summarize_with_truncation'
/var/www/discourse/plugins/discourse-ai/lib/summarization/strategies/truncate_content.rb:23:in `summarize'
/var/www/discourse/app/services/topic_summarization.rb:38:in `summarize'
/var/www/discourse/app/jobs/regular/stream_topic_summary.rb:25:in `execute'
/var/www/discourse/app/jobs/base.rb:292:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rails_multisite-5.0.0/lib/rails_multisite/connection_management.rb:82:in `with_connection'
/var/www/discourse/app/jobs/base.rb:279:in `block in perform'
/var/www/discourse/app/jobs/base.rb:275:in `each'

You need to set the URL of that service under ai_hugging_face_api_url

Looks like the available summarization strategies do not support the model I’m running.

ghcr.io/huggingface/text-generation-inference:1.3 --model-id tiiuae/falcon-7b-instruct

I’m just wondering if you install and run the Toxicity classification service how you de-activate or properly uninstall it? Thanks