Kubernetes 上的 AI 推理服务部署
原创
灵阙教研团队
S 精选 进阶 |
约 7 分钟阅读
更新于 2026-02-28 AI 导读
Kubernetes 上的 AI 推理服务部署 概述 将 AI 推理服务部署到 Kubernetes 集群面临几个独特挑战:GPU 资源调度、大模型镜像管理、推理延迟优化、自动扩缩容策略。本文从工程实践角度,系统介绍在 K8s 上部署 AI 推理服务的完整方案。 GPU 资源管理 NVIDIA Device Plugin 安装 # 安装 NVIDIA GPU Operator(推荐,一站式管理)...
Kubernetes 上的 AI 推理服务部署
概述
将 AI 推理服务部署到 Kubernetes 集群面临几个独特挑战:GPU 资源调度、大模型镜像管理、推理延迟优化、自动扩缩容策略。本文从工程实践角度,系统介绍在 K8s 上部署 AI 推理服务的完整方案。
GPU 资源管理
NVIDIA Device Plugin 安装
# 安装 NVIDIA GPU Operator(推荐,一站式管理)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set mig.strategy=single
# 验证 GPU 可见
kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq '.["nvidia.com/gpu"]'
GPU 资源请求
# 基础 GPU Pod 配置
apiVersion: v1
kind: Pod
metadata:
name: inference-pod
spec:
containers:
- name: inference
image: myregistry/inference-server:v3
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1" # 请求 1 张 GPU
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
volumeMounts:
- name: model-cache
mountPath: /models
- name: shm
mountPath: /dev/shm # 共享内存,PyTorch DataLoader 需要
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
nodeSelector:
accelerator: nvidia-a100 # 指定 GPU 型号
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
GPU 共享(MIG 与 Time-slicing)
对于推理负载较轻的模型,独占整张 GPU 是浪费。有两种共享方式:
# 方式一:NVIDIA MIG(A100 专属,硬件级隔离)
# 将一张 A100 切分为多个独立 GPU 实例
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
data:
config.yaml: |
version: v1
mig-configs:
all-3g.20gb: # 每个实例 20GB 显存
- devices: all
mig-enabled: true
mig-devices:
"3g.20gb": 2 # 切出 2 个实例
# 方式二:Time-slicing(任意 GPU,软件级时分复用)
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # 每张 GPU 虚拟成 4 份
推理服务框架
方案一:NVIDIA Triton Inference Server
Triton 是最成熟的生产级推理服务器,支持多框架、动态批处理、模型集成。
# 模型仓库目录结构
model_repository/
├── text_classifier/
│ ├── config.pbtxt
│ ├── 1/ # 版本 1
│ │ └── model.onnx
│ └── 2/ # 版本 2
│ └── model.onnx
└── text_embedder/
├── config.pbtxt
└── 1/
└── model.plan # TensorRT engine
# config.pbtxt
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ] # 动态序列长度
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
# 动态批处理配置
dynamic_batching {
preferred_batch_size: [ 8, 16, 32 ]
max_queue_delay_microseconds: 100000 # 最多等 100ms 凑批
}
# 模型实例配置
instance_group [
{
count: 2 # 每 GPU 2 个模型实例
kind: KIND_GPU
}
]
K8s 部署:
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
args:
- tritonserver
- --model-repository=s3://models/repository
- --model-control-mode=poll
- --repository-poll-secs=30
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
---
apiVersion: v1
kind: Service
metadata:
name: triton-service
spec:
selector:
app: triton
ports:
- name: http
port: 8000
- name: grpc
port: 8001
- name: metrics
port: 8002
方案二:KServe(K8s 原生 ML 服务)
KServe 提供了更高层的抽象,内置流量管理、自动扩缩、A/B 测试。
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llm-service
spec:
predictor:
minReplicas: 1
maxReplicas: 5
scaleTarget: 10 # 每实例 10 QPS
scaleMetric: concurrency
model:
modelFormat:
name: pytorch
storageUri: "s3://models/llama-8b"
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
runtime: kserve-torchserve
# 金丝雀发布
transformer: {}
canaryTrafficPercent: 10 # 10% 流量到金丝雀版本
方案三:vLLM(大语言模型专用)
vLLM 使用 PagedAttention 技术,显存利用率比原生 HuggingFace 高 2-4 倍。
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3.1-8B-Instruct
- --tensor-parallel-size=1
- --max-model-len=8192
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
ports:
- containerPort: 8000
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
自动扩缩容
基于 GPU 利用率的 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-server
minReplicas: 1
maxReplicas: 8
metrics:
# GPU 利用率
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "70" # GPU 利用率 > 70% 扩容
# 请求队列长度
- type: Pods
pods:
metric:
name: inference_queue_size
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
基于 KEDA 的事件驱动扩缩
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-scaler
spec:
scaleTargetRef:
name: inference-server
minReplicaCount: 0 # 允许缩到 0
maxReplicaCount: 10
idleReplicaCount: 1 # 空闲时保持 1 个
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_requests_pending
query: sum(inference_requests_pending)
threshold: "5"
- type: rabbitmq
metadata:
host: amqp://rabbitmq:5672
queueName: inference-queue
queueLength: "10"
模型预热与缓存
# 使用 Init Container 预加载模型
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-with-warmup
spec:
template:
spec:
initContainers:
# 阶段一:下载模型
- name: model-downloader
image: amazon/aws-cli
command: ["aws", "s3", "sync", "s3://models/v3/", "/models/"]
volumeMounts:
- name: model-volume
mountPath: /models
# 阶段二:模型预热
- name: model-warmup
image: myregistry/inference-server:v3
command:
- python
- -c
- |
import requests
import json
# 发送预热请求
warmup_inputs = json.load(open("/warmup/inputs.json"))
for inp in warmup_inputs:
requests.post("http://localhost:8080/predict", json=inp)
print("Warmup completed")
volumeMounts:
- name: model-volume
mountPath: /models
containers:
- name: inference
image: myregistry/inference-server:v3
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
emptyDir:
sizeLimit: 50Gi
监控与可观测性
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: inference-monitor
spec:
selector:
matchLabels:
app: inference-server
endpoints:
- port: metrics
interval: 15s
path: /metrics
关键监控指标:
| 指标 | 描述 | 告警阈值 |
|---|---|---|
inference_latency_p99 |
推理延迟 P99 | > 500ms |
gpu_memory_used_bytes |
GPU 显存使用 | > 90% |
gpu_utilization_percent |
GPU 计算利用率 | 持续 < 20%(浪费)或 > 95%(过载) |
inference_queue_size |
等待推理的请求数 | > 100 |
model_load_time_seconds |
模型加载时间 | > 120s |
inference_errors_total |
推理错误计数 | 任何增长 |
成本优化
Spot/Preemptible 实例
# 使用 Spot 实例降低 GPU 成本(可节省 60-70%)
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
nodeSelector:
cloud.google.com/gke-spot: "true" # GKE
# kubernetes.azure.com/scalesetpriority: spot # AKS
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
# 必须配合优雅终止
terminationGracePeriodSeconds: 30
多优先级队列
# 高优先级请求用 On-Demand GPU,低优先级用 Spot
import asyncio
from enum import Enum
class Priority(Enum):
REALTIME = 0 # On-Demand,SLA < 200ms
STANDARD = 1 # On-Demand,SLA < 2s
BATCH = 2 # Spot,无 SLA
class InferenceRouter:
def __init__(self):
self.queues = {p: asyncio.PriorityQueue() for p in Priority}
self.endpoints = {
Priority.REALTIME: "http://inference-ondemand:8000",
Priority.STANDARD: "http://inference-ondemand:8000",
Priority.BATCH: "http://inference-spot:8000",
}
async def route(self, request, priority: Priority):
endpoint = self.endpoints[priority]
# Spot 不可用时自动降级到 On-Demand
try:
return await self._call(endpoint, request, timeout=5)
except Exception:
if priority == Priority.BATCH:
return await self._call(
self.endpoints[Priority.STANDARD],
request,
timeout=30,
)
raise
总结
K8s 上部署 AI 推理服务的核心要点:
- GPU 调度:安装 GPU Operator,合理使用 MIG/Time-slicing 提高利用率
- 推理框架:分类模型用 Triton,大语言模型用 vLLM,需要 K8s 原生体验用 KServe
- 扩缩容:基于 GPU 利用率 + 队列长度的复合指标,允许缩到 0
- 成本控制:Spot 实例 + 多优先级队列 + 模型量化
- 可观测性:GPU 指标、推理延迟、错误率全链路监控
Maurice | maurice_wen@proton.me