使用Vllm在docker中运行qwen2.5-vl系列模型

目前最新版vllm docker镜像还不完美支持qwen2.5-vl，你需要手动更新transformer库并更新bnb量化相关代码。如果你想轻松运行量化/全尺寸模型，都可以使用我重新打包的模型

If any of you interested in trying this model with or without quanization in docker, you can use my re-packed docker image vllm motorbottle/vllm-qwen2_5_vl-fixed:v0.1.0 . Please note that nvidia runtime needed

How to start the container:

sudo docker run --runtime nvidia --gpus all --ipc=host -p 18434:8000 \
   -v hf_cache:/root/.cache/huggingface -d \
   -e HF_HUB_ENABLE_HF_TRANSFER=0 \
   --name qwen2.5-vl-72b \
   --entrypoint "python3" motorbottle/vllm-qwen2_5_vl-fixed:v0.1.0 \
   -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-72B-Instruct \
   --tensor-parallel-size 4 --trust-remote-code --max-model-len 32768 --dtype bfloat16 --quantization bitsandbytes --load-format bitsandbytes

remove these if you want to run at full precision:

--quantization bitsandbytes --load-format bitsandbytes

change this accroding to your gpu availability:

--tensor-parallel-size 4

相关

发送评论 编辑评论

发送评论编辑评论