Please enable Javascript to view the contents

1. LMCache 简介

TTFT 是指从请求发出到模型生成第一个 token 的时间。由于 Prefill 阶段需要把输入的上下文编码成 KV Cache,才能开始生成,在生成第一个 token 时需要大量的计算从而导致 TTFT 很高。

为了降低 TTFT,有一个思路就是将 Prefill 阶段计算出来的 KV Cache 缓存起来,下次遇到相同的上下文时,直接复用缓存的 KV Cache,就可以大幅降低 TTFT。

在模型推理的场景下,https://github.com/LMCache/LMCache 就是针对 KV Cache 缓存的一个开源项目,支持将 KV Cache 存储到内存、磁盘、Redis、GDS、Nixl 等多种存储后端。详情查看 https://docs.lmcache.ai/kv_cache/storage_backends/index.html

此外,lmcache 还提供了计算 KV Cache 大小的工具 https://lmcache.ai/kv_cache_calculator.html ,以 4k 中文估算,2k token 需要 106 MB 的 KV Cache,存储开销非常大。虽然 LMCache 有 LRU、FIFO、LFU、MRU 等缓存淘汰策略,但在生产环境中,通常还是需要配合大容量的存储后端,比如 Redis、3FS、大磁盘。

接下来我们通过一些 benchmark 来展示 LMCache 的效果。

2. 缓存到内存

  • 启动 lmcache 环境
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
nerdctl run -it \
        -p 8000:8000 \
        --gpus all \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --name lmcache \
        --volume /data/models:/data/models \
        --entrypoint /bin/bash \
        lmcache/vllm-openai:v0.3.6

其他测试也都是基于这个镜像创建的环境,测试设备是 NVIDIA A100-SXM4-80GB。

  • 设置环境变量
1
unset $(env | awk -F= '/^LMCACHE_/ {print $1}')
1
2
3
4
5
6
7
8
# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Enable CPU memory backend
export LMCACHE_LOCAL_CPU=True # default
# 50 GB of Pinned CPU memory
export LMCACHE_MAX_LOCAL_CPU_SIZE=50 # default 5.0
  • 启动模型服务
1
2
3
4
5
6
7
export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
    /data/models/Qwen2.5-7B-Instruct \
    --no-enable-prefix-caching \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  • 第一次测试
1
2
3
4
5
6
7
/opt/venv/bin/vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.91
Total input tokens:                      225502
Total generated tokens:                  202560
Request throughput (req/s):              14.04
Output token throughput (tok/s):         2778.23
Total Token throughput (tok/s):          5871.13
---------------Time to First Token----------------
Mean TTFT (ms):                          62.06
Median TTFT (ms):                        55.99
P99 TTFT (ms):                           140.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.90
Median TPOT (ms):                        20.73
P99 TPOT (ms):                           36.28
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.39
Median ITL (ms):                         15.81
P99 ITL (ms):                            72.54
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.35
Total input tokens:                      225502
Total generated tokens:                  202945
Request throughput (req/s):              14.15
Output token throughput (tok/s):         2805.13
Total Token throughput (tok/s):          5922.04
---------------Time to First Token----------------
Mean TTFT (ms):                          32.65
Median TTFT (ms):                        32.43
P99 TTFT (ms):                           44.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.00
Median TPOT (ms):                        15.07
P99 TPOT (ms):                           16.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.99
Median ITL (ms):                         14.72
P99 ITL (ms):                            19.05
==================================================
  • 查看日志
1
(EngineCore_DP0 pid=18318) [2025-09-18 05:07:16,918] LMCache INFO: Retrieved 776 out of total 776 out of total 776 tokens. size: 0.0414 gb, cost 2.0837 ms, throughput: 19.8891 GB/s; (cache_engine.py:519:lmcache.v1.cache_engine)

可以看到建立 KV Cache 相关的日志信息。

  • 小结
指标第一次测试第二次测试降低
TTFT62.06ms32.65ms47%
TPOT20.90ms15.00ms28%
ITL20.39ms14.99ms26%

3. 缓存到磁盘

  • 设置环境变量
1
unset $(env | awk -F= '/^LMCACHE_/ {print $1}')
1
2
3
4
5
6
7
export LMCACHE_CHUNK_SIZE=256
export LMCACHE_LOCAL_DISK="file:///data/models/lmcache/"
# 50GB of disk space
export LMCACHE_MAX_LOCAL_DISK_SIZE=50
export LMCACHE_LOCAL_CPU=False
export LMCACHE_EXTRA_CONFIG='{'use_odirect': True}'
export LMCACHE_USE_EXPERIMENTAL=True
  • 启动模型服务
1
2
3
4
5
6
7
export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
    /data/models/Qwen2.5-7B-Instruct \
    --no-enable-prefix-caching \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  • 第一次测试
1
2
3
4
5
6
7
/opt/venv/bin/vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.92
Total input tokens:                      225502
Total generated tokens:                  202927
Request throughput (req/s):              14.04
Output token throughput (tok/s):         2783.03
Total Token throughput (tok/s):          5875.66
---------------Time to First Token----------------
Mean TTFT (ms):                          63.79
Median TTFT (ms):                        57.74
P99 TTFT (ms):                           145.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.34
Median TPOT (ms):                        21.15
P99 TPOT (ms):                           37.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.78
Median ITL (ms):                         15.88
P99 ITL (ms):                            76.45
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.49
Total input tokens:                      225502
Total generated tokens:                  201717
Request throughput (req/s):              14.13
Output token throughput (tok/s):         2782.66
Total Token throughput (tok/s):          5893.42
---------------Time to First Token----------------
Mean TTFT (ms):                          39.40
Median TTFT (ms):                        37.25
P99 TTFT (ms):                           89.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.21
Median TPOT (ms):                        15.99
P99 TPOT (ms):                           20.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.17
Median ITL (ms):                         14.85
P99 ITL (ms):                            34.62
==================================================
  • 查看日志
1
2
(EngineCore_DP0 pid=21129) [2025-09-18 05:23:16,568] LMCache INFO: Retrieved 12 out of total 12 out of total 12 tokens. size: 0.0006 gb, cost 0.6591 ms, throughput: 0.9724 GB/s; (cache_engine.py:519:lmcache.v1.cache_engine)
(APIServer pid=20851) INFO 09-18 05:23:22 [loggers.py:123] Engine 000: Avg prompt throughput: 1700.5 tokens/s, Avg generation throughput: 1847.9 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
  • 查看缓存文件
1
ls -alh /data/models/lmcache/
1
2
3
-rw-r--r-- 1 root root   14M Sep 18 05:20 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f2f002abf32763.pt
-rw-r--r-- 1 root root   14M Sep 18 05:20 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@f838a2991593dd7.pt
-rw-r--r-- 1 root root   12M Sep 18 05:20 vllm@-data-models-Qwen2.5-7B-Instruct@1@0@fb7cf79a0adacc1.pt
  • 小结
指标第一次测试第二次测试降低
TTFT63.79ms39.40ms38%
TPOT21.34ms16.21ms24%
ITL20.78ms16.17ms22%

4. 缓存到 Redis

  • 启动 Redis
1
nerdctl run -d --name redis -p 6379:6379 redis:7
  • 设置环境变量
1
unset $(env | awk -F= '/^LMCACHE_/ {print $1}')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Redis host
export LMCACHE_REMOTE_URL="redis://x.x.x.x:6379"
# Redis Sentinel hosts (for high availability)
# export LMCACHE_REMOTE_URL="redis-sentinel://localhost:26379,localhost:26380,localhost:26381"
# LMCache Server host
# export LMCACHE_REMOTE_URL="lm://localhost:65432"

# How to serialize and deserialize KV cache on remote transmission
export LMCACHE_REMOTE_SERDE="naive" # "naive" (default) or "cachegen"
  • 启动模型服务
1
2
3
4
5
6
7
export CUDA_VISIBLE_DEVICES=7
/opt/venv/bin/vllm serve \
    /data/models/Qwen2.5-7B-Instruct \
    --no-enable-prefix-caching \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  • 第一次测试
1
2
3
4
5
6
7
/opt/venv/bin/vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.90
Total input tokens:                      225502
Total generated tokens:                  202337
Request throughput (req/s):              14.05
Output token throughput (tok/s):         2775.41
Total Token throughput (tok/s):          5868.57
---------------Time to First Token----------------
Mean TTFT (ms):                          67.79
Median TTFT (ms):                        60.94
P99 TTFT (ms):                           165.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.20
Median TPOT (ms):                        21.75
P99 TPOT (ms):                           40.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.43
Median ITL (ms):                         15.96
P99 ITL (ms):                            78.68
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  72.91
Total input tokens:                      225502
Total generated tokens:                  202978
Request throughput (req/s):              14.04
Output token throughput (tok/s):         2783.98
Total Token throughput (tok/s):          5876.88
---------------Time to First Token----------------
Mean TTFT (ms):                          50.34
Median TTFT (ms):                        39.07
P99 TTFT (ms):                           142.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.80
Median TPOT (ms):                        17.32
P99 TPOT (ms):                           35.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.17
Median ITL (ms):                         15.13
P99 ITL (ms):                            66.43
==================================================
  • 查看日志
1
(EngineCore_DP0 pid=23013) [2025-09-18 05:29:58,971] LMCache INFO: Storing KV cache for 776 out of 776 tokens (skip_leading_tokens=0) for request cmpl-benchmark-serving1022-0 (vllm_v1_adapter.py:988:lmcache.integration.vllm.vllm_v1_adapter)
  • 缓存
1
nerdctl exec -it redis redis-cli KEYS "*"
1
2
3
4
5
6
4087) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-7542097a982a0d29metadata"
4088) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@ab0b65969d69b56metadata"
4089) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-49f1c05dccbcca9metadata"
4090) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-2ede777a488b6923kv_bytes"
4091) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-27a856291a779d38kv_bytes"
4092) "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@7522166acc9c0267kv_bytes"
1
2
nerdctl exec -it redis redis-cli MEMORY USAGE "vllm@/data/models/Qwen2.5-7B-Instruct@1@0@-27a856291a779d38kv_bytes"
(integer) 524408

一个缓存块大约是 12 MB,与磁盘缓存块大小一致。

  • 小结
指标第一次测试第二次测试降低
TTFT67.79ms50.34ms25%
TPOT22.20ms18.80ms15%
ITL21.43ms18.17ms15%

5. 无 LMCache 对照

  • 不使用 LMCache
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
nerdctl run -it \
        -p 8000:8000 \
        --gpus all \
        --ipc=host \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --name vllm \
        --volume /data/models:/data/models \
        --entrypoint /bin/bash \
        vllm/vllm-openai:v0.10.1.1
  • 启动模型服务
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
export CUDA_VISIBLE_DEVICES=7
python3 -m vllm.entrypoints.openai.api_server \
  --model /data/models/Qwen2.5-7B-Instruct \
  --served-model-name /data/models/Qwen2.5-7B-Instruct \
  --port 8000 \
  --gpu_memory_utilization 0.8 \
  --max-model-len 4096 \
  --max-seq-len-to-capture 8192 \
  --max-num-seqs 128 \
  --enforce-eager \
  --no-enable-prefix-caching
  • 第一次测试
1
2
3
4
5
6
7
vllm bench serve \
  --backend openai \
  --model /data/models/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /data/models/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 1024 \
  --request-rate 16
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  73.42
Total input tokens:                      225502
Total generated tokens:                  203130
Request throughput (req/s):              13.95
Output token throughput (tok/s):         2766.85
Total Token throughput (tok/s):          5838.43
---------------Time to First Token----------------
Mean TTFT (ms):                          61.55
Median TTFT (ms):                        54.89
P99 TTFT (ms):                           174.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.49
Median TPOT (ms):                        21.27
P99 TPOT (ms):                           36.38
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.00
Median ITL (ms):                         16.70
P99 ITL (ms):                            72.61
==================================================
  • 第二次测试
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
============ Serving Benchmark Result ============
Successful requests:                     1024
Request rate configured (RPS):           16.00
Benchmark duration (s):                  73.84
Total input tokens:                      225502
Total generated tokens:                  203659
Request throughput (req/s):              13.87
Output token throughput (tok/s):         2758.13
Total Token throughput (tok/s):          5812.08
---------------Time to First Token----------------
Mean TTFT (ms):                          59.70
Median TTFT (ms):                        54.41
P99 TTFT (ms):                           139.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.31
Median TPOT (ms):                        21.08
P99 TPOT (ms):                           36.62
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.78
Median ITL (ms):                         16.63
P99 ITL (ms):                            71.70
==================================================
  • 小结
指标第一次测试第二次测试降低
TTFT61.55ms59.70ms3%
TPOT21.49ms21.31ms1%
ITL21.00ms20.78ms1%

6. 总结

本篇主要是通过 benchmark 来展示 LMCache 的效果,并分别缓存到内存、磁盘、Redis 三种后端。

在 Qwen2.5-7B-Instruct 模型,使用 NVIDIA A100-SXM4-80GB 设备,16 个并发请求,测试结果如下:

缓存后端TTFT 降低TPOT 降低ITL 降低
内存47%28%26%
磁盘38%24%22%
Redis25%15%15%
无 LMCache3%1%1%

微信公众号

作者

微信公众号