双卡4080super32g运行Qwen3.5-122b并驱动OpenClaw

我的配置

SuperMicro7048
CPU：e5-2686v4 * 2
RAM:DDR4-2133-16g * 8
GPU:RTX4080Super 32g * 2
Sys:Ubuntu22.04 with Docker

选用Llama.cpp而非Vllm运行模型

选用unsloth/Qwen3.5-122B-A10B-UD-Q2_K_XL 动态量化版本，大小仅约42GB。关键权重采用了Q8_0、Q6量化，Exp则使用更激进的2-5bit量化，实测运行OpenClaw任务执行效果优秀，同时在PCIE带宽有限的情况下（老服务器，3.0）能够达到约60-70tps的单路请求速度，prefill约1000-1500tps。

2026/05/20-更新 MTP加速推理

unsloth提供了新的支持MTP推理的权重文件，配合最新版Llama.cpp，可以把输出提速到90tps左右，比原先的60-70提升超30%，最新权重也不再需要修改jinja。缺点是暂时不支持mmproj图片输入处理和超过1的请求数

提前下载好权重文件，我是放在了/disk1/Checkpoints，这里做了-v /disk1/Checkpoints:/models映射，可以按照个人喜好配置：

sudo docker run -d --name llamacpp-qwen-122b-mtp \
  --gpus '"device=1,2"' \
  -v /disk1/Checkpoints:/models \
  -p 18434:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.5-122B-A10B-UD-Q2_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers -1 \
  --parallel 1 \
  --ctx-size 524288 \
  --rope-scaling yarn \
  --rope-scale 2 \
  --yarn-orig-ctx 262144 \
  --override-kv qwen35moe.context_length=int:524288 \
  --flash-attn on \
  --batch-size 2048 \
  --ubatch-size 1024 \
  --cont-batching \
  --no-mmap \
  --mlock \
  --split-mode layer \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --prio 3 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3

Openclaw使用正常的OpenAI Completions设置即可，可以打开reasoning

2026/03/06-使用为OpenClaw优化的Jinja模板

当使用llama.cpp运行Openclaw时，会收到role身份兼容性报错，主要是因为Openclaw的系统提示词并未使用system而是采用了developer作为前缀。解决方法：使用以下修改过的Jinja模板，添加developer身份支持，命名为edited_chat_template.jinja

报错信息：

got exception: {"error":{"code":500,"message":"\n------------\nWhile executing CallExpression at line 144, column 28 in source:\n... {%- else %}↵ {{- raise_exception('Unexpected message role.') }}↵ {%- ...\n ^\nError: Jinja Exception: Unexpected message role.","type":"server_error"}}

{%- set image_count = namespace(value=0) %}
{%- set video_count = namespace(value=0) %}
{%- set system_roles = ['system', 'developer'] %}

{%- macro render_content(content, do_vision_count, is_system_content=false) %}
    {%- if content is string %}
        {{- content }}
    {%- elif content is iterable and content is not mapping %}
        {%- for item in content %}
            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain images.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set image_count.value = image_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
            {%- elif 'video' in item or item.type == 'video' %}
                {%- if is_system_content %}
                    {{- raise_exception('System message cannot contain videos.') }}
                {%- endif %}
                {%- if do_vision_count %}
                    {%- set video_count.value = video_count.value + 1 %}
                {%- endif %}
                {%- if add_vision_id %}
                    {{- 'Video ' ~ video_count.value ~ ': ' }}
                {%- endif %}
                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
            {%- elif 'text' in item %}
                {{- item.text }}
            {%- else %}
                {{- raise_exception('Unexpected item type in content.') }}
            {%- endif %}
        {%- endfor %}
    {%- elif content is none or content is undefined %}
        {{- '' }}
    {%- else %}
        {{- raise_exception('Unexpected content type.') }}
    {%- endif %}
{%- endmacro %}

{%- if not messages %}
    {{- raise_exception('No messages provided.') }}
{%- endif %}

{%- if tools and tools is iterable and tools is not mapping %}
    {{- '<|im_start|>system\n' }}
    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>" }}
    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
    {%- if messages[0].role in system_roles %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {%- if content %}
            {{- '\n\n' + content }}
        {%- endif %}
    {%- endif %}
    {{- '<|im_end|>\n' }}
{%- else %}
    {%- if messages[0].role in system_roles %}
        {%- set content = render_content(messages[0].content, false, true)|trim %}
        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}

{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}
{%- endfor %}

{%- if ns.multi_step_tool %}
    {{- raise_exception('No user query found in messages.') }}
{%- endif %}

{%- for message in messages %}
    {%- set content = render_content(message.content, true)|trim %}
    {%- if message.role in system_roles %}
        {%- if not loop.first %}
            {{- raise_exception('System/developer message must be at the beginning.') }}
        {%- endif %}
    {%- elif message.role == "user" %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- set reasoning_content = reasoning_content|trim %}
        {%- if loop.index0 > ns.last_query_index %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {%- if loop.first %}
                    {%- if content|trim %}
                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- else %}
                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                    {%- endif %}
                {%- else %}
                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- endif %}
                {%- if tool_call.arguments is defined %}
                    {%- for args_name, args_value in tool_call.arguments|items %}
                        {{- '<parameter=' + args_name + '>\n' }}
                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                        {{- args_value }}
                        {{- '\n</parameter>\n' }}
                    {%- endfor %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.previtem and loop.previtem.role != "tool" %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if not loop.last and loop.nextitem.role != "tool" %}
            {{- '<|im_end|>\n' }}
        {%- elif loop.last %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Unexpected message role.') }}
    {%- endif %}
{%- endfor %}

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- else %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}

启动指令

替换<Your-Path-of-Checkpoints>为你存放gguf文件和jinja文件的地方

--gpus '"device=1,2"'是针对我服务器的特殊配置，因为我一共安装了5张显卡，如果你仅安装两张显卡，可以使用–gpus all

sudo docker run -d --name llama-qwen \
  --gpus '"device=1,2"' \
  -v /<Your-Path-of-Checkpoints>:/models \
  -p 18434:8000 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.5-122B-A10B-UD-Q2_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --n-gpu-layers -1 \
  --parallel 2 \
  --ctx-size 262144 \
  --flash-attn on \
  --batch-size 4096 \
  --ubatch-size 2048 \
  --cont-batching \
  --no-mmap \
  --mlock \
  --split-mode layer \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --jinja \
  --chat-template-file /models/edited_chat_template.jinja

Openclaw使用正常的OpenAI Completions设置即可，可以打开reasoning

我的配置

选用Llama.cpp而非Vllm运行模型

2026/05/20-更新 MTP加速推理

2026/03/06-使用为OpenClaw优化的Jinja模板

启动指令

相关

发送评论编辑评论

我的配置

选用Llama.cpp而非Vllm运行模型

2026/05/20-更新 MTP加速推理

2026/03/06-使用为OpenClaw优化的Jinja模板

启动指令

相关

发送评论 编辑评论

发送评论编辑评论