We are running the full model, but the smallest version of it with Mistral 7B. It’s taking 21GB VRAM in our single A100 servers, and it’s ran via ghcr.io/xfalcox/llava:latest
container image.
Sadly the ecosystem for multi-modal models ain’t as mature as the text2text ones, so we can’t yet leverage inference servers like vLLM or TGI and are left with those one-off microservices. This may change this year, multimodal is on vLLM roadmap, but until then we can at least test the waters with those services.