Kubernetes 上的 AI 推理服务部署

概述

将 AI 推理服务部署到 Kubernetes 集群面临几个独特挑战:GPU 资源调度、大模型镜像管理、推理延迟优化、自动扩缩容策略。本文从工程实践角度,系统介绍在 K8s 上部署 AI 推理服务的完整方案。

GPU 资源管理

NVIDIA Device Plugin 安装

# 安装 NVIDIA GPU Operator(推荐,一站式管理)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set mig.strategy=single

# 验证 GPU 可见
kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq '.["nvidia.com/gpu"]'

GPU 资源请求

# 基础 GPU Pod 配置
apiVersion: v1
kind: Pod
metadata:
  name: inference-pod
spec:
  containers:
    - name: inference
      image: myregistry/inference-server:v3
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"       # 请求 1 张 GPU
        limits:
          cpu: "8"
          memory: "32Gi"
          nvidia.com/gpu: "1"
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
      volumeMounts:
        - name: model-cache
          mountPath: /models
        - name: shm
          mountPath: /dev/shm      # 共享内存,PyTorch DataLoader 需要
  volumes:
    - name: model-cache
      persistentVolumeClaim:
        claimName: model-pvc
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: "8Gi"
  nodeSelector:
    accelerator: nvidia-a100      # 指定 GPU 型号
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

GPU 共享(MIG 与 Time-slicing)

对于推理负载较轻的模型,独占整张 GPU 是浪费。有两种共享方式:

# 方式一:NVIDIA MIG(A100 专属,硬件级隔离)
# 将一张 A100 切分为多个独立 GPU 实例
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-3g.20gb:               # 每个实例 20GB 显存
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.20gb": 2         # 切出 2 个实例

# 方式二:Time-slicing(任意 GPU,软件级时分复用)
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4          # 每张 GPU 虚拟成 4 份

推理服务框架

方案一:NVIDIA Triton Inference Server

Triton 是最成熟的生产级推理服务器,支持多框架、动态批处理、模型集成。

# 模型仓库目录结构
model_repository/
├── text_classifier/
│   ├── config.pbtxt
│   ├── 1/                    # 版本 1
│   │   └── model.onnx
│   └── 2/                    # 版本 2
│       └── model.onnx
└── text_embedder/
    ├── config.pbtxt
    └── 1/
        └── model.plan        # TensorRT engine
# config.pbtxt
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]             # 动态序列长度
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

# 动态批处理配置
dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 100000   # 最多等 100ms 凑批
}

# 模型实例配置
instance_group [
  {
    count: 2                  # 每 GPU 2 个模型实例
    kind: KIND_GPU
  }
]

K8s 部署:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          args:
            - tritonserver
            - --model-repository=s3://models/repository
            - --model-control-mode=poll
            - --repository-poll-secs=30
          ports:
            - containerPort: 8000     # HTTP
            - containerPort: 8001     # gRPC
            - containerPort: 8002     # Metrics
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 30
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  selector:
    app: triton
  ports:
    - name: http
      port: 8000
    - name: grpc
      port: 8001
    - name: metrics
      port: 8002

方案二:KServe(K8s 原生 ML 服务)

KServe 提供了更高层的抽象,内置流量管理、自动扩缩、A/B 测试。

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-service
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    scaleTarget: 10              # 每实例 10 QPS
    scaleMetric: concurrency
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/llama-8b"
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: "32Gi"
          nvidia.com/gpu: "1"
      runtime: kserve-torchserve

  # 金丝雀发布
  transformer: {}
  canaryTrafficPercent: 10       # 10% 流量到金丝雀版本

方案三:vLLM(大语言模型专用)

vLLM 使用 PagedAttention 技术,显存利用率比原生 HuggingFace 高 2-4 倍。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=meta-llama/Llama-3.1-8B-Instruct
            - --tensor-parallel-size=1
            - --max-model-len=8192
            - --gpu-memory-utilization=0.9
            - --enable-prefix-caching
          ports:
            - containerPort: 8000
          resources:
            requests:
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

自动扩缩容

基于 GPU 利用率的 HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 1
  maxReplicas: 8
  metrics:
    # GPU 利用率
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"       # GPU 利用率 > 70% 扩容
    # 请求队列长度
    - type: Pods
      pods:
        metric:
          name: inference_queue_size
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

基于 KEDA 的事件驱动扩缩

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: inference-server
  minReplicaCount: 0           # 允许缩到 0
  maxReplicaCount: 10
  idleReplicaCount: 1          # 空闲时保持 1 个
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_requests_pending
        query: sum(inference_requests_pending)
        threshold: "5"
    - type: rabbitmq
      metadata:
        host: amqp://rabbitmq:5672
        queueName: inference-queue
        queueLength: "10"

模型预热与缓存

# 使用 Init Container 预加载模型
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-with-warmup
spec:
  template:
    spec:
      initContainers:
        # 阶段一:下载模型
        - name: model-downloader
          image: amazon/aws-cli
          command: ["aws", "s3", "sync", "s3://models/v3/", "/models/"]
          volumeMounts:
            - name: model-volume
              mountPath: /models
        # 阶段二:模型预热
        - name: model-warmup
          image: myregistry/inference-server:v3
          command:
            - python
            - -c
            - |
              import requests
              import json
              # 发送预热请求
              warmup_inputs = json.load(open("/warmup/inputs.json"))
              for inp in warmup_inputs:
                  requests.post("http://localhost:8080/predict", json=inp)
              print("Warmup completed")
          volumeMounts:
            - name: model-volume
              mountPath: /models
      containers:
        - name: inference
          image: myregistry/inference-server:v3
          volumeMounts:
            - name: model-volume
              mountPath: /models
      volumes:
        - name: model-volume
          emptyDir:
            sizeLimit: 50Gi

监控与可观测性

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: inference-monitor
spec:
  selector:
    matchLabels:
      app: inference-server
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

关键监控指标:

指标 描述 告警阈值
inference_latency_p99 推理延迟 P99 > 500ms
gpu_memory_used_bytes GPU 显存使用 > 90%
gpu_utilization_percent GPU 计算利用率 持续 < 20%(浪费)或 > 95%(过载)
inference_queue_size 等待推理的请求数 > 100
model_load_time_seconds 模型加载时间 > 120s
inference_errors_total 推理错误计数 任何增长

成本优化

Spot/Preemptible 实例

# 使用 Spot 实例降低 GPU 成本(可节省 60-70%)
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"    # GKE
        # kubernetes.azure.com/scalesetpriority: spot  # AKS
      tolerations:
        - key: cloud.google.com/gke-spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      # 必须配合优雅终止
      terminationGracePeriodSeconds: 30

多优先级队列

# 高优先级请求用 On-Demand GPU,低优先级用 Spot
import asyncio
from enum import Enum

class Priority(Enum):
    REALTIME = 0   # On-Demand,SLA < 200ms
    STANDARD = 1   # On-Demand,SLA < 2s
    BATCH = 2      # Spot,无 SLA

class InferenceRouter:
    def __init__(self):
        self.queues = {p: asyncio.PriorityQueue() for p in Priority}
        self.endpoints = {
            Priority.REALTIME: "http://inference-ondemand:8000",
            Priority.STANDARD: "http://inference-ondemand:8000",
            Priority.BATCH: "http://inference-spot:8000",
        }

    async def route(self, request, priority: Priority):
        endpoint = self.endpoints[priority]
        # Spot 不可用时自动降级到 On-Demand
        try:
            return await self._call(endpoint, request, timeout=5)
        except Exception:
            if priority == Priority.BATCH:
                return await self._call(
                    self.endpoints[Priority.STANDARD],
                    request,
                    timeout=30,
                )
            raise

总结

K8s 上部署 AI 推理服务的核心要点:

  1. GPU 调度:安装 GPU Operator,合理使用 MIG/Time-slicing 提高利用率
  2. 推理框架:分类模型用 Triton,大语言模型用 vLLM,需要 K8s 原生体验用 KServe
  3. 扩缩容:基于 GPU 利用率 + 队列长度的复合指标,允许缩到 0
  4. 成本控制:Spot 实例 + 多优先级队列 + 模型量化
  5. 可观测性:GPU 指标、推理延迟、错误率全链路监控

Maurice | maurice_wen@proton.me