E5V4老服务器&3080 KTransformers大战DeepSeek-R1/V3 输出达到9.6t/s(更新Qwen3 235B测试)

机器配置

  • Supermicro7048GR-TR
  • E5-2680v4 x2(14c x2 total 56 threads)
  • 后期更换为E5-2686v4 x2(18c x2 total 72 threads)
  • DDR4 ECC 512g 运行在2133
  • RTX3080 20G x1
  • 整体费用约为9000人民币

测试内容(新,V0.3,250502更新)

测试结果:

  • 新测试Qwen3 unsloth/Qwen3-235B-A22B的UD-Q4_K_XL:prefill平均23token/s,decode平均8.3token/s
  • V3-0324,unsloth的UD_Q2_K_XL:prefill平均30token/s,decode平均9.6token/s,与0.2.4一致,相比023有一定的提升

安装方式:

与0.2.4一致,按照官方dockerfile一步步来,目前问题是报错torch版本兼容性问题(dockerfile安装方式会自动升级到2.7但使用时报错),需要降级到2.5.1。另外model_path的问题仍未解决,和0.2.4一样需要手动指定绝对路径。

启动参数:

TORCH_CUDA_ARCH_LIST="8.6" USE_NUMA=1 python3 ktransformers/server/main.py \
    --gguf_path /modelv3_new/ \
    --model_path /root/.cache/huggingface/hub/models--Qwen--Qwen3-235B-A22B/snapshots/b51c4308ed84804fa6722b20722cd91e3cd17808 \
    --model_name Qwen3Moe \
    --architectures Qwen3MoeForCausalLM \
    --cpu_infer 56 \
    --max_new_tokens 4096 \
    --cache_lens 16384 \
    --temperature 0.3 \
    --cache_8bit true \
    --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml \
    --chunk_size 256 \
    --max_batch_size 4 \
    --backend_type balance_serve \
    --host 0.0.0.0 \
    --port 8080

测试内容(V0.2.4,250405更新)

测试结果:

  • V3-0324,unsloth的UD_Q2_K_XL:prefill平均30token/s,decode平均9.6token/s,相比023有一定的提升
  • 关于如何设置参数,如何优化新能,有一些疑问仍在探讨,可在此issue跟进,大致是cpu_infer对推理速度有较大影响需要多次测试判断什么数值最合适。
SetupPerformance
USE_NUMA=1, HT on, cpu_infer=569.65 token/s
USE_NUMA=1, HT on, cpu_infer=429.54 token/s
USE_NUMA=1, HT on, cpu_infer=369.45 token/s
USE_NUMA=1, HT on, cpu_infer=689.25 token/s
USE_NUMA not set, HT on, cpu_infer=688.2 token/s
USE_NUMA not set, HT on, cpu_infer=367.3 token/s
USE_NUMA not set, HT off, cpu_infer=346.8 token/s
USE_NUMA=1, HT off, cpu_infer=346.5 token/s (Needs Validation)

docker compose 配置:

services:
  deepseek:
    image: pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
    runtime: nvidia
    container_name: deepseek
    privileged: true
    environment:
      - NVIDIA_VISIBLE_DEVICES=2 # 注意,privileged:true是为了启用USE_NUMA,但会导致此项失效。未开启USE_NUMA时,我通过此项单独指定显卡2
    volumes:
      - /home/motor/disk1/dsr1:/modelr1 # 映射存储模型gguf的文件夹
      - /home/motor/disk1/DSv3:/modelv3
      - /home/motor/disk1/dsv3_new:/modelv3_new
      - ./workspace:/workspace
    ipc: host
    ports:
      - "19434:8080"
    stdin_open: true
    tty: true

配置+编译步骤:

直接参考了官方dockerfile,但有所改动(我启用了USE_NUMA=1)

cd /workspace

export CPU_INSTRUCT=NATIVE
export CUDA_HOME=/usr/local/cuda
export MAX_JOBS=64
# 需要根据你机器的线程数设置,我的是72线程
export CMAKE_BUILD_PARALLEL_LEVEL=64
# 需要根据你机器的线程数设置,我的是72线程
# Find your GPU ARCH here: `nvidia-smi --query-gpu=compute_cap --format=csv`
# This example is for my RTX2080ti + RTX 3080
export TORCH_CUDA_ARCH_LIST="7.5;8.6"

apt update -y

apt install -y --no-install-recommends \
    libtbb-dev \
    libssl-dev \
    libcurl4-openssl-dev \
    libaio1 \
    libaio-dev \
    libfmt-dev \
    libgflags-dev \
    zlib1g-dev \
    patchelf \
    git \
    wget \
    vim \
    gcc \
    g++ \
    cmake \
    libnuma-dev

git clone https://github.com/kvcache-ai/ktransformers.git

rm -rf /var/lib/apt/lists/*

cd /workspace/ktransformers

git submodule update --init --recursive

# git checkout a5608dc # 目前最新版有编译失败问题,checkout a5608dc(0.2.4post1)后能正常完成编译

pip install --upgrade pip

pip install ninja pyproject numpy cpufeature aiohttp zmq openai

pip install flash-attn

# 安装 ktransformers 本体(启用了USE_NUMA,含编译)
CPU_INSTRUCT=${CPU_INSTRUCT} \
    USE_BALANCE_SERVE=1 \
    USE_NUMA=1 \
    KTRANSFORMERS_FORCE_BUILD=TRUE \
    TORCH_CUDA_ARCH_LIST="7.5;8.6" \
    pip install . --no-build-isolation --verbose

pip install third_party/custom_flashinfer/
# 清理 pip 缓存
pip cache purge

# 拷贝 C++ 运行时库
RUN cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /opt/conda/lib/

运行:

首先关闭swap避免内容被部分load到swap

sudo swapoff -a

注意:目前版本可能存在加载模型config卡住问题,解决方法为直接指定模型config文件的绝对路径。具体问题跟进请关注此issue

# V3
TORCH_CUDA_ARCH_LIST="8.6" USE_NUMA=1 python3 ktransformers/server/main.py \
    --gguf_path /modelv3/ \
    --model_path /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3-0324/snapshots/e9b33add76883f293d6bf61f6bd89b497e80e335 \
    --model_name DeespSeek-V3-0324 \
    --cpu_infer 56 \
    --max_new_tokens 4096 \
    --cache_lens 16384 \
    --temperature 0.3 \
    --cache_8bit true \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
    --chunk_size 256 \
    --max_batch_size 4 \
    --backend_type balance_serve \
    --host 0.0.0.0 \
    --port 8080

# R1
TORCH_CUDA_ARCH_LIST="8.6" USE_NUMA=1 python3 ktransformers/server/main.py \
    --gguf_path /modelr1/ \
    --model_path /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1/snapshots/56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad \
    --model_name DeespSeek-R1 \
    --cpu_infer 56 \
    --max_new_tokens 4096 \
    --cache_lens 16384 \
    --temperature 0.3 \
    --cache_8bit true \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
    --chunk_size 256 \
    --max_batch_size 4 \
    --backend_type balance_serve \
    --host 0.0.0.0 \
    --port 8080

运行DeepSeek-V3-0324的推荐设置参考这篇Unsloth的文档

测试内容(老,V0.2.3)

配置指南参考的这个,版本0.2.3,采用docker+container toolkit配置:https://github.com/ubergarm/r1-ktransformers-guide

待测试:更新到最新版0.2.4并设置export USE_NUMA=1,参考:https://github.com/kvcache-ai/ktransformers/issues/769

测试模型:unsloth的2.51bit动态量化版。

输出速度测试历史:

  • 不启用Flashinfer(https://www.bilibili.com/video/BV15hQHYXEJc):平均4.7Token/s
  • 启用Flashinfer:平均5.0Token/s
  • 启用Flashinfer,启动命令中–cpu_infer 16改为28,平均5.3Token/s
  • 启用Flashinfer,更换两颗cpu为2686v4,–cpu_infer 改为34,平均6.0Token/s
  • 禁用Flashinfer重新使用Triton,更换两颗cpu为2686v4,–cpu_infer 改为34,删除 `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` 的启动参数,平均6.5Token/s

Docker compose配置:

services:
  deepseek:
    image: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
    runtime: nvidia
    container_name: deepseek
    environment:
      - NVIDIA_VISIBLE_DEVICES=2 # My RTX3080
    volumes:
      - ./model:/model
      - ./workspace:/workspace
    shm_size: '310g'  # shared memory size
    ports:
      - "18434:8080"
    stdin_open: true
    tty: true

安装命令(参考):

apt-get update
apt-get install git -y
apt install curl -y
apt-get install build-essential cmake
curl -LsSf https://astral.sh/uv/install.sh | sh

# reopen terminal after uv installation

cd workspace
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init # submodule "third_party/llama.cpp", [submodule "third_party/pybind11"]
git submodule update
# git checkout 7a19f3b # To run V0.2.3
git rev-parse --short HEAD # 7a19f3b
uv venv ./venv --python 3.11 --python-preference=only-managed
source  venv/bin/activate

uv pip install flashinfer-python
# Find your GPU ARCH here: `nvidia-smi --query-gpu=compute_cap --format=csv`
# This example is for RTX 3080/3090TI and RTX A6000
export TORCH_CUDA_ARCH_LIST="8.6"
# The first inference after startup will be slow as it must JIT compile
# 2025-02-27 12:24:22,992 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64
# 2025-02-27 12:24:42,108 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64

uv pip install -r requirements-local_chat.txt
uv pip install setuptools wheel packaging

# If you have enough CPU cores and memory you can speed up builds
# $ export MAX_JOBS=18
# $ export CMAKE_BUILD_PARALLEL_LEVEL=18

# Install flash_attn
uv pip install flash_attn --no-build-isolation

# ONLY IF you have Intel dual socket and >1TB RAM to hold 2x copies of entire model in RAM (one copy per socket)
# Dual socket AMD EPYC NPS0 probably makes this not needed?
# $ apt install libnuma-dev
# $ export USE_NUMA=1

# Install ktransformers
KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation

# DONE, Continue below!

启动参数:

# R1
python3 ktransformers/server/main.py \
    --gguf_path /model/ \
    --model_path deepseek-ai/DeepSeek-R1 \
    --model_name unsloth/DeepSeek-R1-UD-Q2_K_XL \
    --cpu_infer 68 \
    --max_new_tokens 8192 \
    --cache_lens 32768 \
    --total_context 32768 \
    --cache_q4 true \
    --temperature 0.6 \
    --top_p 0.95 \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
    --force_think \
    --use_cuda_graph \
    --host 0.0.0.0 \
    --port 8080

#V3-0324
TORCH_CUDA_ARCH_LIST="8.6" python3 ktransformers/server/main.py \
    --gguf_path /modelv3_new/ \
    --model_path deepseek-ai/DeepSeek-V3-0324 \
    --model_name DeepSeek-V3 \
    --cpu_infer 68 \
    --max_new_tokens 4096 \
    --cache_lens 8192 \
    --total_context 8192 \
    --cache_8bit true \
    --temperature 0.3 \
    --top_p 0.95 \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
    --use_cuda_graph \
    --host 0.0.0.0 \
    --port 8080

运行DeepSeek-V3-0324的推荐设置参考这篇Unsloth的文档

暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇