机器配置:
- Supermicro7048GR-TR
- E5-2680v4 x2(14c x2 total 56 threads)
- 后期更换为E5-2686v4 x2(18c x2 total 72 threads)
- DDR4 ECC 320g
- RTX3080 20G x1
- 整体费用约为8000人民币
配置指南参考的这个,采用docker+container toolkit配置:https://github.com/ubergarm/r1-ktransformers-guide
待测试:更新到最新版0.2.4并设置export USE_NUMA=1,参考:https://github.com/kvcache-ai/ktransformers/issues/769
测试模型:unsloth的2.51bit动态量化版。
输出速度测试历史:
- 不启用Flashinfer(https://www.bilibili.com/video/BV15hQHYXEJc):平均4.7Token/s
- 启用Flashinfer:平均5.0Token/s
- 启用Flashinfer,启动命令中–cpu_infer 16改为28,平均5.3Token/s
- 启用Flashinfer,更换两颗cpu为2686v4,–cpu_infer 改为34,平均6.0Token/s
- 禁用Flashinfer重新使用Triton,更换两颗cpu为2686v4,–cpu_infer 改为34,删除 `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` 的启动参数,平均6.5Token/s
Docker compose配置:
services:
deepseek:
image: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
runtime: nvidia
container_name: deepseek
environment:
- NVIDIA_VISIBLE_DEVICES=2 # My RTX3080
volumes:
- ./model:/model
- ./workspace:/workspace
shm_size: '310g' # shared memory size
ports:
- "18434:8080"
stdin_open: true
tty: true
安装命令(参考):
apt-get update
apt-get install git -y
apt install curl -y
apt-get install build-essential cmake
curl -LsSf https://astral.sh/uv/install.sh | sh
# reopen terminal after uv installation
cd workspace
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init # submodule "third_party/llama.cpp", [submodule "third_party/pybind11"]
git submodule update
git checkout 7a19f3b
git rev-parse --short HEAD # 7a19f3b
uv venv ./venv --python 3.11 --python-preference=only-managed
source venv/bin/activate
uv pip install flashinfer-python
# Find your GPU ARCH here: `nvidia-smi --query-gpu=compute_cap --format=csv`
# This example is for RTX 3080/3090TI and RTX A6000
export TORCH_CUDA_ARCH_LIST="8.6"
# The first inference after startup will be slow as it must JIT compile
# 2025-02-27 12:24:22,992 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64
# 2025-02-27 12:24:42,108 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64
uv pip install -r requirements-local_chat.txt
uv pip install setuptools wheel packaging
# If you have enough CPU cores and memory you can speed up builds
# $ export MAX_JOBS=18
# $ export CMAKE_BUILD_PARALLEL_LEVEL=18
# Install flash_attn
uv pip install flash_attn --no-build-isolation
# ONLY IF you have Intel dual socket and >1TB RAM to hold 2x copies of entire model in RAM (one copy per socket)
# Dual socket AMD EPYC NPS0 probably makes this not needed?
# $ apt install libnuma-dev
# $ export USE_NUMA=1
# Install ktransformers
KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation
# DONE, Continue below!
启动参数:
python3 ktransformers/server/main.py \
--gguf_path /model/ \
--model_path deepseek-ai/DeepSeek-R1 \
--model_name unsloth/DeepSeek-R1-UD-Q2_K_XL \
--cpu_infer 34 \
--max_new_tokens 8192 \
--cache_lens 32768 \
--total_context 32768 \
--cache_q4 true \
--temperature 0.6 \
--top_p 0.95 \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
--force_think \
--use_cuda_graph \
--host 0.0.0.0 \
--port 8080