AI 系统容错设计:降级、熔断与回退

LLM 熔断器模式、优雅降级策略、Fallback 模型链、重试机制与生产容错架构

引言

LLM 应用面临的可靠性挑战远超传统 API 服务。Provider API 偶发超时、模型输出质量波动、Token 配额耗尽、内容过滤误杀——任何一个环节失败都会导致用户请求失败。更糟糕的是,LLM 调用通常是同步阻塞的,一个 30 秒的超时会锁死一个用户连接。

本文系统讲解如何为 LLM 应用构建生产级容错体系。

故障模式分析

LLM 系统的故障分类

故障类型 触发原因 影响 频率 恢复时间
Provider 宕机 云服务全局故障 所有请求失败 分钟~小时
速率限制 超过 RPM/TPM 配额 部分请求 429 秒~分钟
超时 模型过载/网络抖动 单请求失败 即时重试
内容过滤 触发安全审查 输出被截断/拒绝 调整 Prompt
质量退化 模型更新/配置变更 输出质量下降 小时~天
Token 超限 输入/输出超过上下文窗口 请求被拒绝 即时截断
格式错误 模型输出非预期格式 解析失败 重试/降级

故障传播链

单点故障如何级联:

用户请求 → API Gateway → LLM Provider (故障!)
                            │
                            ▼
              线程阻塞 30s (等待超时)
                            │
                            ▼
              连接池耗尽 (所有线程被占用)
                            │
                            ▼
              网关返回 503 (所有用户受影响)
                            │
                            ▼
              前端重试风暴 (指数级放大)
                            │
                            ▼
              系统雪崩 (完全不可用)

熔断器模式

三状态熔断器

┌──────────────────────────────────────────────────┐
│                                                  │
│          ┌────────────┐                          │
│     ┌───▶│   CLOSED   │◀──── 成功计数达标         │
│     │    │  (正常通行)  │                          │
│     │    └──────┬─────┘                          │
│     │           │                                │
│     │     失败次数超阈值                           │
│     │           │                                │
│     │    ┌──────▼─────┐                          │
│     │    │    OPEN    │                          │
│     │    │ (快速失败)  │──── 超时后进入半开          │
│     │    └──────┬─────┘                          │
│     │           │                                │
│     │    ┌──────▼─────┐                          │
│     │    │ HALF-OPEN  │                          │
│     └────│ (试探请求)  │──── 试探失败, 回到 OPEN    │
│          └────────────┘                          │
└──────────────────────────────────────────────────┘

生产级熔断器实现

// src/resilience/circuit-breaker.ts
interface CircuitBreakerConfig {
  name: string;
  failureThreshold: number;       // Failures to trip
  recoveryTimeoutMs: number;      // Wait before half-open
  halfOpenMaxRequests: number;    // Test requests in half-open
  monitorWindowMs: number;        // Sliding window for counting
  onStateChange?: (from: string, to: string) => void;
}

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failures: number[] = [];
  private halfOpenSuccesses = 0;
  private halfOpenFailures = 0;
  private openedAt = 0;

  constructor(private config: CircuitBreakerConfig) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (!this.canExecute()) {
      throw new CircuitOpenError(
        `Circuit ${this.config.name} is OPEN, failing fast`
      );
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private canExecute(): boolean {
    switch (this.state) {
      case "closed":
        return true;

      case "open":
        if (Date.now() - this.openedAt >= this.config.recoveryTimeoutMs) {
          this.transitionTo("half-open");
          return true;
        }
        return false;

      case "half-open":
        return (
          this.halfOpenSuccesses + this.halfOpenFailures <
          this.config.halfOpenMaxRequests
        );
    }
  }

  private onSuccess(): void {
    if (this.state === "half-open") {
      this.halfOpenSuccesses++;
      if (this.halfOpenSuccesses >= this.config.halfOpenMaxRequests) {
        this.transitionTo("closed");
      }
    } else if (this.state === "closed") {
      // Clear old failures
      this.pruneFailures();
    }
  }

  private onFailure(): void {
    if (this.state === "half-open") {
      this.transitionTo("open");
      return;
    }

    this.failures.push(Date.now());
    this.pruneFailures();

    if (this.failures.length >= this.config.failureThreshold) {
      this.transitionTo("open");
    }
  }

  private pruneFailures(): void {
    const cutoff = Date.now() - this.config.monitorWindowMs;
    this.failures = this.failures.filter(t => t > cutoff);
  }

  private transitionTo(newState: typeof this.state): void {
    const oldState = this.state;
    this.state = newState;

    if (newState === "open") {
      this.openedAt = Date.now();
    }
    if (newState === "half-open") {
      this.halfOpenSuccesses = 0;
      this.halfOpenFailures = 0;
    }
    if (newState === "closed") {
      this.failures = [];
    }

    this.config.onStateChange?.(oldState, newState);
  }
}

class CircuitOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "CircuitOpenError";
  }
}

Fallback 模型链

分级降级策略

降级层级:

Level 0 (最优):  Claude Sonnet 4   → 最高质量
Level 1 (次优):  GPT-4o            → 高质量备选
Level 2 (经济):  Gemini Flash      → 快速低成本
Level 3 (本地):  本地 Llama 8B      → 零外部依赖
Level 4 (模板):  规则引擎 + 模板    → 零 LLM 调用
Level 5 (兜底):  预设回答            → 无推理能力

Fallback 链实现

// src/resilience/fallback-chain.ts
interface FallbackProvider {
  name: string;
  model: string;
  breaker: CircuitBreaker;
  call: (request: LLMRequest) => Promise<LLMResponse>;
  costPerToken: number;
  qualityTier: "premium" | "standard" | "economy" | "fallback";
}

class FallbackChain {
  private providers: FallbackProvider[];
  private attempts: AttemptRecord[] = [];

  constructor(providers: FallbackProvider[]) {
    this.providers = providers;
  }

  async execute(request: LLMRequest): Promise<FallbackResult> {
    this.attempts = [];

    for (const provider of this.providers) {
      const start = Date.now();

      try {
        const response = await provider.breaker.execute(
          () => this.callWithTimeout(provider, request)
        );

        this.attempts.push({
          provider: provider.name,
          model: provider.model,
          status: "success",
          latencyMs: Date.now() - start,
        });

        return {
          response,
          provider: provider.name,
          model: provider.model,
          qualityTier: provider.qualityTier,
          attempts: this.attempts,
          degraded: provider !== this.providers[0],
        };
      } catch (error) {
        this.attempts.push({
          provider: provider.name,
          model: provider.model,
          status: "failed",
          latencyMs: Date.now() - start,
          error: this.classifyError(error),
        });
      }
    }

    // All providers exhausted
    return this.templateFallback(request);
  }

  private async callWithTimeout(
    provider: FallbackProvider,
    request: LLMRequest,
  ): Promise<LLMResponse> {
    const timeout = this.getTimeout(provider.qualityTier);
    const controller = new AbortController();
    const timer = setTimeout(() => controller.abort(), timeout);

    try {
      return await provider.call({ ...request, signal: controller.signal });
    } finally {
      clearTimeout(timer);
    }
  }

  private getTimeout(tier: string): number {
    const timeouts: Record<string, number> = {
      premium: 30_000,
      standard: 20_000,
      economy: 15_000,
      fallback: 10_000,
    };
    return timeouts[tier] ?? 15_000;
  }

  private classifyError(error: unknown): string {
    if (error instanceof CircuitOpenError) return "circuit_open";
    if (error instanceof Error) {
      if (error.name === "AbortError") return "timeout";
      if (error.message.includes("429")) return "rate_limited";
      if (error.message.includes("500")) return "server_error";
      if (error.message.includes("content_filter")) return "content_filtered";
    }
    return "unknown";
  }

  private templateFallback(request: LLMRequest): FallbackResult {
    // Last resort: use template-based response
    const response: LLMResponse = {
      content: "I apologize, but I am currently unable to process your request. "
        + "Please try again in a few minutes or contact support.",
      model: "template-fallback",
      usage: { inputTokens: 0, outputTokens: 0 },
    };

    return {
      response,
      provider: "template",
      model: "fallback",
      qualityTier: "fallback",
      attempts: this.attempts,
      degraded: true,
    };
  }
}

重试策略

智能重试

// src/resilience/retry.ts
interface RetryConfig {
  maxRetries: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
  retryableErrors: Set<string>;
  jitterFactor: number;           // 0-1, prevents thundering herd
}

class SmartRetrier {
  constructor(
    private config: RetryConfig = {
      maxRetries: 3,
      initialDelayMs: 1000,
      maxDelayMs: 30000,
      backoffMultiplier: 2,
      retryableErrors: new Set([
        "timeout",
        "rate_limited",
        "server_error",
        "connection_reset",
      ]),
      jitterFactor: 0.3,
    }
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    let lastError: Error | undefined;
    let delay = this.config.initialDelayMs;

    for (let attempt = 0; attempt <= this.config.maxRetries; attempt++) {
      try {
        return await fn();
      } catch (error) {
        lastError = error as Error;
        const errorType = this.classifyError(error);

        // Don't retry non-retryable errors
        if (!this.config.retryableErrors.has(errorType)) {
          throw error;
        }

        // Don't retry after last attempt
        if (attempt === this.config.maxRetries) {
          break;
        }

        // Special handling for rate limits
        if (errorType === "rate_limited") {
          const retryAfter = this.extractRetryAfter(error);
          if (retryAfter) {
            delay = retryAfter * 1000;
          }
        }

        // Add jitter to prevent thundering herd
        const jitter = delay * this.config.jitterFactor * Math.random();
        const actualDelay = Math.min(delay + jitter, this.config.maxDelayMs);

        await sleep(actualDelay);

        // Exponential backoff
        delay = Math.min(
          delay * this.config.backoffMultiplier,
          this.config.maxDelayMs,
        );
      }
    }

    throw lastError;
  }

  private classifyError(error: unknown): string {
    if (!(error instanceof Error)) return "unknown";
    const msg = error.message.toLowerCase();
    if (msg.includes("timeout") || msg.includes("etimedout")) return "timeout";
    if (msg.includes("429") || msg.includes("rate")) return "rate_limited";
    if (msg.includes("500") || msg.includes("502") || msg.includes("503")) return "server_error";
    if (msg.includes("econnreset") || msg.includes("econnrefused")) return "connection_reset";
    if (msg.includes("content") && msg.includes("filter")) return "content_filtered";
    return "unknown";
  }

  private extractRetryAfter(error: unknown): number | undefined {
    // Extract Retry-After header value
    if (error instanceof Error && "headers" in error) {
      const headers = (error as any).headers;
      const retryAfter = headers?.["retry-after"];
      if (retryAfter) return parseInt(retryAfter, 10);
    }
    return undefined;
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

输出验证与修复

结构化输出保护

// src/resilience/output-guard.ts
import { z } from "zod";

class OutputGuard<T> {
  constructor(
    private schema: z.ZodSchema<T>,
    private repairAttempts: number = 2,
  ) {}

  async validate(
    rawOutput: string,
    repairFn?: (output: string, error: string) => Promise<string>,
  ): Promise<T> {
    // Attempt 1: Direct parse
    const parsed = this.tryParse(rawOutput);
    if (parsed.success) return parsed.data;

    // Attempt 2-N: Repair and retry
    let currentOutput = rawOutput;
    for (let i = 0; i < this.repairAttempts; i++) {
      if (!repairFn) break;

      currentOutput = await repairFn(currentOutput, parsed.error);
      const repaired = this.tryParse(currentOutput);
      if (repaired.success) return repaired.data;
    }

    throw new OutputValidationError(
      `Failed to parse output after ${this.repairAttempts} repair attempts`,
      rawOutput,
      parsed.error,
    );
  }

  private tryParse(output: string): { success: true; data: T } | { success: false; error: string } {
    try {
      // Try JSON extraction from markdown code blocks
      const jsonMatch = output.match(/```(?:json)?\s*([\s\S]*?)```/);
      const jsonStr = jsonMatch ? jsonMatch[1].trim() : output.trim();

      const data = JSON.parse(jsonStr);
      const validated = this.schema.parse(data);
      return { success: true, data: validated };
    } catch (error) {
      return {
        success: false,
        error: error instanceof Error ? error.message : String(error),
      };
    }
  }
}

// Usage example
const responseSchema = z.object({
  answer: z.string().min(1),
  confidence: z.number().min(0).max(1),
  sources: z.array(z.string()),
});

const guard = new OutputGuard(responseSchema);

const validated = await guard.validate(
  llmOutput,
  async (output, error) => {
    // Use LLM to repair the output
    return await callLLM(`Fix this JSON: ${output}\nError: ${error}`);
  },
);

容错架构总览

┌──────────────────────────────────────────────────────────────┐
│                    完整容错架构                                │
│                                                              │
│  请求 ──→ [限流] ──→ [缓存查找] ──→ [路由选择]                │
│              │            │              │                    │
│          超限降级     命中直接返回    Provider 选择             │
│                                         │                    │
│                              ┌──────────▼───────────┐        │
│                              │    重试 + 熔断器       │        │
│                              │                      │        │
│                              │  Provider A (主)      │        │
│                              │      │ 失败           │        │
│                              │  Provider B (备)      │        │
│                              │      │ 失败           │        │
│                              │  本地模型 (降级)       │        │
│                              │      │ 失败           │        │
│                              │  模板引擎 (兜底)       │        │
│                              └──────────┬───────────┘        │
│                                         │                    │
│                              ┌──────────▼───────────┐        │
│                              │    输出验证 + 修复     │        │
│                              │  Schema / 安全过滤     │        │
│                              └──────────┬───────────┘        │
│                                         │                    │
│                              ┌──────────▼───────────┐        │
│                              │    写入缓存 + 日志     │        │
│                              │  Metrics + Traces     │        │
│                              └──────────────────────┘        │
└──────────────────────────────────────────────────────────────┘

总结

  1. 熔断器是第一道防线:快速失败比慢速失败好一万倍,30 秒的超时会杀死整个系统。
  2. Fallback 链要有足够深度:至少 3 级(主 Provider、备 Provider、本地/模板),每级都有独立的熔断器。
  3. 重试要智能:只重试可恢复的错误,使用指数退避+抖动,尊重 Retry-After 头。
  4. 输出验证不是可选项:LLM 输出不确定,必须用 Schema 验证,失败时用修复链尝试恢复。
  5. 降级要对用户透明:当使用了低质量的 Fallback 时,应该在响应中标注降级状态,让用户知情。

Maurice | maurice_wen@proton.me