老服务器&3080 KTransformers运行DeepSeek-R1测试

机器配置:

  • Supermicro7048GR-TR
  • E5-2680v4 x2(14c x2 total 56 threads)
  • 后期更换为E5-2686v4 x2(18c x2 total 72 threads)
  • DDR4 ECC 320g
  • RTX3080 20G x1
  • 整体费用约为8000人民币

配置指南参考的这个,采用docker+container toolkit配置:https://github.com/ubergarm/r1-ktransformers-guide

待测试:更新到最新版0.2.4并设置export USE_NUMA=1,参考:https://github.com/kvcache-ai/ktransformers/issues/769

测试模型:unsloth的2.51bit动态量化版。

输出速度测试历史:

  • 不启用Flashinfer(https://www.bilibili.com/video/BV15hQHYXEJc):平均4.7Token/s
  • 启用Flashinfer:平均5.0Token/s
  • 启用Flashinfer,启动命令中–cpu_infer 16改为28,平均5.3Token/s
  • 启用Flashinfer,更换两颗cpu为2686v4,–cpu_infer 改为34,平均6.0Token/s
  • 禁用Flashinfer重新使用Triton,更换两颗cpu为2686v4,–cpu_infer 改为34,删除 `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` 的启动参数,平均6.5Token/s

Docker compose配置:

services:
  deepseek:
    image: nvidia/cuda:12.6.3-cudnn-devel-ubuntu22.04
    runtime: nvidia
    container_name: deepseek
    environment:
      - NVIDIA_VISIBLE_DEVICES=2 # My RTX3080
    volumes:
      - ./model:/model
      - ./workspace:/workspace
    shm_size: '310g'  # shared memory size
    ports:
      - "18434:8080"
    stdin_open: true
    tty: true

安装命令(参考):

apt-get update
apt-get install git -y
apt install curl -y
apt-get install build-essential cmake
curl -LsSf https://astral.sh/uv/install.sh | sh

# reopen terminal after uv installation

cd workspace
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init # submodule "third_party/llama.cpp", [submodule "third_party/pybind11"]
git submodule update
git checkout 7a19f3b
git rev-parse --short HEAD # 7a19f3b
uv venv ./venv --python 3.11 --python-preference=only-managed
source  venv/bin/activate

uv pip install flashinfer-python
# Find your GPU ARCH here: `nvidia-smi --query-gpu=compute_cap --format=csv`
# This example is for RTX 3080/3090TI and RTX A6000
export TORCH_CUDA_ARCH_LIST="8.6"
# The first inference after startup will be slow as it must JIT compile
# 2025-02-27 12:24:22,992 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64
# 2025-02-27 12:24:42,108 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64

uv pip install -r requirements-local_chat.txt
uv pip install setuptools wheel packaging

# If you have enough CPU cores and memory you can speed up builds
# $ export MAX_JOBS=18
# $ export CMAKE_BUILD_PARALLEL_LEVEL=18

# Install flash_attn
uv pip install flash_attn --no-build-isolation

# ONLY IF you have Intel dual socket and >1TB RAM to hold 2x copies of entire model in RAM (one copy per socket)
# Dual socket AMD EPYC NPS0 probably makes this not needed?
# $ apt install libnuma-dev
# $ export USE_NUMA=1

# Install ktransformers
KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation

# DONE, Continue below!

启动参数:

python3 ktransformers/server/main.py \
    --gguf_path /model/ \
    --model_path deepseek-ai/DeepSeek-R1 \
    --model_name unsloth/DeepSeek-R1-UD-Q2_K_XL \
    --cpu_infer 34 \
    --max_new_tokens 8192 \
    --cache_lens 32768 \
    --total_context 32768 \
    --cache_q4 true \
    --temperature 0.6 \
    --top_p 0.95 \
    --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
    --force_think \
    --use_cuda_graph \
    --host 0.0.0.0 \
    --port 8080
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇