目前最新版vllm docker镜像还不完美支持qwen2.5-vl,你需要手动更新transformer库并更新bnb量化相关代码。如果你想轻松运行量化/全尺寸模型,都可以使用我重新打包的模型
If any of you interested in trying this model with or without quanization in docker, you can use my re-packed docker image vllm motorbottle/vllm-qwen2_5_vl-fixed:v0.1.0
. Please note that nvidia runtime needed
How to start the container:
sudo docker run --runtime nvidia --gpus all --ipc=host -p 18434:8000 \ -v hf_cache:/root/.cache/huggingface -d \ -e HF_HUB_ENABLE_HF_TRANSFER=0 \ --name qwen2.5-vl-72b \ --entrypoint "python3" motorbottle/vllm-qwen2_5_vl-fixed:v0.1.0 \ -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-72B-Instruct \ --tensor-parallel-size 4 --trust-remote-code --max-model-len 32768 --dtype bfloat16 --quantization bitsandbytes --load-format bitsandbytes
remove these if you want to run at full precision:
--quantization bitsandbytes --load-format bitsandbytes
change this accroding to your gpu availability:
--tensor-parallel-size 4