vLLM, a high-throughput and memory-efficient inference engine, makes serving large language models like LLaMA-2 straightforward, even on diverse platforms like ROCm and various clouds via SkyPilot. This guide documents a Docker deployment issue related to NVIDIA GPU access and provides a detailed solution, enhancing the vLLM serving experience.
Deploying vLLMs with Docker service requires NVIDIA GPU support, but Docker can sometimes fail to access the GPU, showing errors. This guide resolves such issues, ensuring vLLM can utilize the necessary GPU resources.
Problem
While deploying vLLM with the following Docker command:
docker run --runtime nvidia --gpus all ...
I was met with:
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
Removing --runtime nvidia
led to a new error about the inability to find a
device driver with GPU capabilities.
Resolution Steps:
Verifying nvidia-docker2 Installation
It’s crucial to have nvidia-docker2
installed for Docker to interface with
NVIDIA GPUs. Begin by verifying its presence:
# Check if nvidia-docker2 is installed
dpkg -l | grep nvidia-docker
docker info | grep nvidia
Installing nvidia-docker2
If nvidia-docker2 is missing, install it to bridge Docker with NVIDIA GPUs:
# Update package lists and install nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# Restart Docker to apply changes
sudo systemctl restart docker
Confirming the Installation
Ensure the installation was successful by inspecting Docker’s runtime configuration and checking the installed version:
# Confirm Docker's runtime configuration for NVIDIA
cat /etc/docker/daemon.json
>>> {
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
# Verify the nvidia-docker2 version
dpkg -l | grep nvidia-docker
# Check Docker runtimes for NVIDIA support
docker info | grep nvidia
Successful vLLM Deployment
With NVIDIA GPU support enabled, execute the Docker command to deploy the vLLM:
# Deploy vLLM with NVIDIA GPU support
docker run --runtime nvidia --gpus all \
-v /home/likxun/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$(cat /home/likxun/.cache/huggingface/token)" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model TheBloke/Mistral-7B-Instruct-v0.1-GPTQ \
--quantization "gptq" \
--dtype "half"
Testing the Deployment
Verify that the vLLM is operational by executing a test request:
# Test the vLLM deployment
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'
Expected Output:
{
"id": "cmpl-afbe2ffa3e0d4779ba28ff8afae5b6a9",
"object": "chat.completion",
"created": 1711946527,
"model": "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " 2+2 is 4."
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 16,
"total_tokens": 25,
"completion_tokens": 9
}
}
This response indicates that the vLLM deployment is successful and capable of processing requests.
Conclusion
This guide provided a detailed walkthrough for resolving Docker and NVIDIA GPU integration issues, ensuring a successful vLLM deployment. By following these steps, users can overcome common hurdles, enabling efficient and effective model serving with GPU support.