AI 系统容错设计:降级、熔断与回退
原创
灵阙教研团队
S 精选 进阶 |
约 9 分钟阅读
更新于 2026-02-28 AI 导读
AI 系统容错设计:降级、熔断与回退 LLM 熔断器模式、优雅降级策略、Fallback 模型链、重试机制与生产容错架构 引言 LLM 应用面临的可靠性挑战远超传统 API 服务。Provider API 偶发超时、模型输出质量波动、Token 配额耗尽、内容过滤误杀——任何一个环节失败都会导致用户请求失败。更糟糕的是,LLM 调用通常是同步阻塞的,一个 30 秒的超时会锁死一个用户连接。...
AI 系统容错设计:降级、熔断与回退
LLM 熔断器模式、优雅降级策略、Fallback 模型链、重试机制与生产容错架构
引言
LLM 应用面临的可靠性挑战远超传统 API 服务。Provider API 偶发超时、模型输出质量波动、Token 配额耗尽、内容过滤误杀——任何一个环节失败都会导致用户请求失败。更糟糕的是,LLM 调用通常是同步阻塞的,一个 30 秒的超时会锁死一个用户连接。
本文系统讲解如何为 LLM 应用构建生产级容错体系。
故障模式分析
LLM 系统的故障分类
| 故障类型 | 触发原因 | 影响 | 频率 | 恢复时间 |
|---|---|---|---|---|
| Provider 宕机 | 云服务全局故障 | 所有请求失败 | 低 | 分钟~小时 |
| 速率限制 | 超过 RPM/TPM 配额 | 部分请求 429 | 高 | 秒~分钟 |
| 超时 | 模型过载/网络抖动 | 单请求失败 | 中 | 即时重试 |
| 内容过滤 | 触发安全审查 | 输出被截断/拒绝 | 中 | 调整 Prompt |
| 质量退化 | 模型更新/配置变更 | 输出质量下降 | 低 | 小时~天 |
| Token 超限 | 输入/输出超过上下文窗口 | 请求被拒绝 | 中 | 即时截断 |
| 格式错误 | 模型输出非预期格式 | 解析失败 | 中 | 重试/降级 |
故障传播链
单点故障如何级联:
用户请求 → API Gateway → LLM Provider (故障!)
│
▼
线程阻塞 30s (等待超时)
│
▼
连接池耗尽 (所有线程被占用)
│
▼
网关返回 503 (所有用户受影响)
│
▼
前端重试风暴 (指数级放大)
│
▼
系统雪崩 (完全不可用)
熔断器模式
三状态熔断器
┌──────────────────────────────────────────────────┐
│ │
│ ┌────────────┐ │
│ ┌───▶│ CLOSED │◀──── 成功计数达标 │
│ │ │ (正常通行) │ │
│ │ └──────┬─────┘ │
│ │ │ │
│ │ 失败次数超阈值 │
│ │ │ │
│ │ ┌──────▼─────┐ │
│ │ │ OPEN │ │
│ │ │ (快速失败) │──── 超时后进入半开 │
│ │ └──────┬─────┘ │
│ │ │ │
│ │ ┌──────▼─────┐ │
│ │ │ HALF-OPEN │ │
│ └────│ (试探请求) │──── 试探失败, 回到 OPEN │
│ └────────────┘ │
└──────────────────────────────────────────────────┘
生产级熔断器实现
// src/resilience/circuit-breaker.ts
interface CircuitBreakerConfig {
name: string;
failureThreshold: number; // Failures to trip
recoveryTimeoutMs: number; // Wait before half-open
halfOpenMaxRequests: number; // Test requests in half-open
monitorWindowMs: number; // Sliding window for counting
onStateChange?: (from: string, to: string) => void;
}
class CircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed";
private failures: number[] = [];
private halfOpenSuccesses = 0;
private halfOpenFailures = 0;
private openedAt = 0;
constructor(private config: CircuitBreakerConfig) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (!this.canExecute()) {
throw new CircuitOpenError(
`Circuit ${this.config.name} is OPEN, failing fast`
);
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private canExecute(): boolean {
switch (this.state) {
case "closed":
return true;
case "open":
if (Date.now() - this.openedAt >= this.config.recoveryTimeoutMs) {
this.transitionTo("half-open");
return true;
}
return false;
case "half-open":
return (
this.halfOpenSuccesses + this.halfOpenFailures <
this.config.halfOpenMaxRequests
);
}
}
private onSuccess(): void {
if (this.state === "half-open") {
this.halfOpenSuccesses++;
if (this.halfOpenSuccesses >= this.config.halfOpenMaxRequests) {
this.transitionTo("closed");
}
} else if (this.state === "closed") {
// Clear old failures
this.pruneFailures();
}
}
private onFailure(): void {
if (this.state === "half-open") {
this.transitionTo("open");
return;
}
this.failures.push(Date.now());
this.pruneFailures();
if (this.failures.length >= this.config.failureThreshold) {
this.transitionTo("open");
}
}
private pruneFailures(): void {
const cutoff = Date.now() - this.config.monitorWindowMs;
this.failures = this.failures.filter(t => t > cutoff);
}
private transitionTo(newState: typeof this.state): void {
const oldState = this.state;
this.state = newState;
if (newState === "open") {
this.openedAt = Date.now();
}
if (newState === "half-open") {
this.halfOpenSuccesses = 0;
this.halfOpenFailures = 0;
}
if (newState === "closed") {
this.failures = [];
}
this.config.onStateChange?.(oldState, newState);
}
}
class CircuitOpenError extends Error {
constructor(message: string) {
super(message);
this.name = "CircuitOpenError";
}
}
Fallback 模型链
分级降级策略
降级层级:
Level 0 (最优): Claude Sonnet 4 → 最高质量
Level 1 (次优): GPT-4o → 高质量备选
Level 2 (经济): Gemini Flash → 快速低成本
Level 3 (本地): 本地 Llama 8B → 零外部依赖
Level 4 (模板): 规则引擎 + 模板 → 零 LLM 调用
Level 5 (兜底): 预设回答 → 无推理能力
Fallback 链实现
// src/resilience/fallback-chain.ts
interface FallbackProvider {
name: string;
model: string;
breaker: CircuitBreaker;
call: (request: LLMRequest) => Promise<LLMResponse>;
costPerToken: number;
qualityTier: "premium" | "standard" | "economy" | "fallback";
}
class FallbackChain {
private providers: FallbackProvider[];
private attempts: AttemptRecord[] = [];
constructor(providers: FallbackProvider[]) {
this.providers = providers;
}
async execute(request: LLMRequest): Promise<FallbackResult> {
this.attempts = [];
for (const provider of this.providers) {
const start = Date.now();
try {
const response = await provider.breaker.execute(
() => this.callWithTimeout(provider, request)
);
this.attempts.push({
provider: provider.name,
model: provider.model,
status: "success",
latencyMs: Date.now() - start,
});
return {
response,
provider: provider.name,
model: provider.model,
qualityTier: provider.qualityTier,
attempts: this.attempts,
degraded: provider !== this.providers[0],
};
} catch (error) {
this.attempts.push({
provider: provider.name,
model: provider.model,
status: "failed",
latencyMs: Date.now() - start,
error: this.classifyError(error),
});
}
}
// All providers exhausted
return this.templateFallback(request);
}
private async callWithTimeout(
provider: FallbackProvider,
request: LLMRequest,
): Promise<LLMResponse> {
const timeout = this.getTimeout(provider.qualityTier);
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeout);
try {
return await provider.call({ ...request, signal: controller.signal });
} finally {
clearTimeout(timer);
}
}
private getTimeout(tier: string): number {
const timeouts: Record<string, number> = {
premium: 30_000,
standard: 20_000,
economy: 15_000,
fallback: 10_000,
};
return timeouts[tier] ?? 15_000;
}
private classifyError(error: unknown): string {
if (error instanceof CircuitOpenError) return "circuit_open";
if (error instanceof Error) {
if (error.name === "AbortError") return "timeout";
if (error.message.includes("429")) return "rate_limited";
if (error.message.includes("500")) return "server_error";
if (error.message.includes("content_filter")) return "content_filtered";
}
return "unknown";
}
private templateFallback(request: LLMRequest): FallbackResult {
// Last resort: use template-based response
const response: LLMResponse = {
content: "I apologize, but I am currently unable to process your request. "
+ "Please try again in a few minutes or contact support.",
model: "template-fallback",
usage: { inputTokens: 0, outputTokens: 0 },
};
return {
response,
provider: "template",
model: "fallback",
qualityTier: "fallback",
attempts: this.attempts,
degraded: true,
};
}
}
重试策略
智能重试
// src/resilience/retry.ts
interface RetryConfig {
maxRetries: number;
initialDelayMs: number;
maxDelayMs: number;
backoffMultiplier: number;
retryableErrors: Set<string>;
jitterFactor: number; // 0-1, prevents thundering herd
}
class SmartRetrier {
constructor(
private config: RetryConfig = {
maxRetries: 3,
initialDelayMs: 1000,
maxDelayMs: 30000,
backoffMultiplier: 2,
retryableErrors: new Set([
"timeout",
"rate_limited",
"server_error",
"connection_reset",
]),
jitterFactor: 0.3,
}
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
let lastError: Error | undefined;
let delay = this.config.initialDelayMs;
for (let attempt = 0; attempt <= this.config.maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
const errorType = this.classifyError(error);
// Don't retry non-retryable errors
if (!this.config.retryableErrors.has(errorType)) {
throw error;
}
// Don't retry after last attempt
if (attempt === this.config.maxRetries) {
break;
}
// Special handling for rate limits
if (errorType === "rate_limited") {
const retryAfter = this.extractRetryAfter(error);
if (retryAfter) {
delay = retryAfter * 1000;
}
}
// Add jitter to prevent thundering herd
const jitter = delay * this.config.jitterFactor * Math.random();
const actualDelay = Math.min(delay + jitter, this.config.maxDelayMs);
await sleep(actualDelay);
// Exponential backoff
delay = Math.min(
delay * this.config.backoffMultiplier,
this.config.maxDelayMs,
);
}
}
throw lastError;
}
private classifyError(error: unknown): string {
if (!(error instanceof Error)) return "unknown";
const msg = error.message.toLowerCase();
if (msg.includes("timeout") || msg.includes("etimedout")) return "timeout";
if (msg.includes("429") || msg.includes("rate")) return "rate_limited";
if (msg.includes("500") || msg.includes("502") || msg.includes("503")) return "server_error";
if (msg.includes("econnreset") || msg.includes("econnrefused")) return "connection_reset";
if (msg.includes("content") && msg.includes("filter")) return "content_filtered";
return "unknown";
}
private extractRetryAfter(error: unknown): number | undefined {
// Extract Retry-After header value
if (error instanceof Error && "headers" in error) {
const headers = (error as any).headers;
const retryAfter = headers?.["retry-after"];
if (retryAfter) return parseInt(retryAfter, 10);
}
return undefined;
}
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
输出验证与修复
结构化输出保护
// src/resilience/output-guard.ts
import { z } from "zod";
class OutputGuard<T> {
constructor(
private schema: z.ZodSchema<T>,
private repairAttempts: number = 2,
) {}
async validate(
rawOutput: string,
repairFn?: (output: string, error: string) => Promise<string>,
): Promise<T> {
// Attempt 1: Direct parse
const parsed = this.tryParse(rawOutput);
if (parsed.success) return parsed.data;
// Attempt 2-N: Repair and retry
let currentOutput = rawOutput;
for (let i = 0; i < this.repairAttempts; i++) {
if (!repairFn) break;
currentOutput = await repairFn(currentOutput, parsed.error);
const repaired = this.tryParse(currentOutput);
if (repaired.success) return repaired.data;
}
throw new OutputValidationError(
`Failed to parse output after ${this.repairAttempts} repair attempts`,
rawOutput,
parsed.error,
);
}
private tryParse(output: string): { success: true; data: T } | { success: false; error: string } {
try {
// Try JSON extraction from markdown code blocks
const jsonMatch = output.match(/```(?:json)?\s*([\s\S]*?)```/);
const jsonStr = jsonMatch ? jsonMatch[1].trim() : output.trim();
const data = JSON.parse(jsonStr);
const validated = this.schema.parse(data);
return { success: true, data: validated };
} catch (error) {
return {
success: false,
error: error instanceof Error ? error.message : String(error),
};
}
}
}
// Usage example
const responseSchema = z.object({
answer: z.string().min(1),
confidence: z.number().min(0).max(1),
sources: z.array(z.string()),
});
const guard = new OutputGuard(responseSchema);
const validated = await guard.validate(
llmOutput,
async (output, error) => {
// Use LLM to repair the output
return await callLLM(`Fix this JSON: ${output}\nError: ${error}`);
},
);
容错架构总览
┌──────────────────────────────────────────────────────────────┐
│ 完整容错架构 │
│ │
│ 请求 ──→ [限流] ──→ [缓存查找] ──→ [路由选择] │
│ │ │ │ │
│ 超限降级 命中直接返回 Provider 选择 │
│ │ │
│ ┌──────────▼───────────┐ │
│ │ 重试 + 熔断器 │ │
│ │ │ │
│ │ Provider A (主) │ │
│ │ │ 失败 │ │
│ │ Provider B (备) │ │
│ │ │ 失败 │ │
│ │ 本地模型 (降级) │ │
│ │ │ 失败 │ │
│ │ 模板引擎 (兜底) │ │
│ └──────────┬───────────┘ │
│ │ │
│ ┌──────────▼───────────┐ │
│ │ 输出验证 + 修复 │ │
│ │ Schema / 安全过滤 │ │
│ └──────────┬───────────┘ │
│ │ │
│ ┌──────────▼───────────┐ │
│ │ 写入缓存 + 日志 │ │
│ │ Metrics + Traces │ │
│ └──────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
总结
- 熔断器是第一道防线:快速失败比慢速失败好一万倍,30 秒的超时会杀死整个系统。
- Fallback 链要有足够深度:至少 3 级(主 Provider、备 Provider、本地/模板),每级都有独立的熔断器。
- 重试要智能:只重试可恢复的错误,使用指数退避+抖动,尊重 Retry-After 头。
- 输出验证不是可选项:LLM 输出不确定,必须用 Schema 验证,失败时用修复链尝试恢复。
- 降级要对用户透明:当使用了低质量的 Fallback 时,应该在响应中标注降级状态,让用户知情。
Maurice | maurice_wen@proton.me