目前最新版vllm docker镜像还不完美支持qwen2.5-vl,你需要手动更新transformer库并更新bnb量化相关代码。如果你想轻松运行量化/全尺寸模型,都可以使用我重新打包的模型
If any of you interested in trying this model with or without quanization in docker, you can use my re-packed docker image vllm motorbottle/vllm-qwen2_5_vl-fixed:v0.1.0
. Please note that nvidia runtime needed
How to start the container:
sudo docker run --runtime nvidia --gpus all --ipc=host -p 18434:8000 \
-v hf_cache:/root/.cache/huggingface -d \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
--name qwen2.5-vl-72b \
--entrypoint "python3" motorbottle/vllm-qwen2_5_vl-fixed:v0.1.0 \
-m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 4 --trust-remote-code --max-model-len 32768 --dtype bfloat16 --quantization bitsandbytes --load-format bitsandbytes
remove these if you want to run at full precision:
--quantization bitsandbytes --load-format bitsandbytes
change this accroding to your gpu availability:
--tensor-parallel-size 4