Guide auto-hébergé pour Discourse AI

I’m testing the Summarization endpoint:

docker run -d --rm --gpus all --shm-size 1g -p 80:80 -v /mnt:/data -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=32 -e REVISION=gptq-4bit-32g-actorder_True ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ --max-batch-prefill-tokens=12000 --max-total-tokens=12000 --max-input-length=10000 --quantize=gptq --sharded=true --num-shard=$(lspci | grep NVIDIA | wc -l | tr -d '\n') --rope-factor=2

However, when I run it I get the following error. This machine has (2) Tesla T4s and no other process is accessing the GPUs. See usage below.

user@gpu2-hc1node:~$ sudo docker logs -f 68e27eb51ee1
2023-12-14T21:30:12.861320Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Upstage-Llama-2-70B-instruct-v2-GPTQ", revision: Some("gptq-4bit-32g-actorder_True"), validation_workers: 2, sharded: Some(true), num_shard: Some(2), quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 10000, max_total_tokens: 12000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 12000, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "68e27eb51ee1", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: Some(2.0), json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-14T21:30:12.861350Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-12-14T21:30:12.861441Z  INFO download: text_generation_launcher: Starting download process.
2023-12-14T21:30:19.986231Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-12-14T21:30:20.771527Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-12-14T21:30:20.771941Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-12-14T21:30:20.771967Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-12-14T21:30:27.769624Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2023-12-14T21:30:27.997163Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2023-12-14T21:30:28.046134Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-14T21:30:28.071687Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-14T21:30:28.072298Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-14T21:30:28.241375Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-14T21:30:28.262756Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-14T21:30:28.263363Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-14T21:30:30.786133Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-12-14T21:30:30.786133Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-12-14T21:30:40.348755Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 191, in get_multi_weights_col
    qweight = torch.cat(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 14.76 GiB of which 74.75 MiB is free. Process 19973 has 14.68 GiB memory in use. Of the allocated memory 13.73 GiB is allocated by PyTorch, and 74.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nvidia-smi after the model crashes.

Thu Dec 14 15:39:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   54C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   55C    P0    28W /  70W |      0MiB / 15109MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When I start the model, I can see GPU usage increase to about 100% on both GPUs, and then it crashes.