AI 网关设计:模型路由与负载均衡
原创
灵阙教研团队
S 精选 进阶 |
约 10 分钟阅读
更新于 2026-02-28 AI 导读
AI 网关设计:模型路由与负载均衡 LLM API 网关架构模式、智能路由策略、Fallback 链设计、成本追踪与 Token 计量 引言 当一个组织同时使用多个 LLM 提供商(OpenAI、Anthropic、Google、开源模型)时,直接在业务代码中硬编码 API 调用会导致紧耦合、切换困难和成本失控。AI...
AI 网关设计:模型路由与负载均衡
LLM API 网关架构模式、智能路由策略、Fallback 链设计、成本追踪与 Token 计量
引言
当一个组织同时使用多个 LLM 提供商(OpenAI、Anthropic、Google、开源模型)时,直接在业务代码中硬编码 API 调用会导致紧耦合、切换困难和成本失控。AI 网关作为统一接入层,解决了模型路由、负载均衡、成本控制、可观测性和容错等横切关注点。
本文从架构设计到工程实现,系统讲解如何构建生产级 AI 网关。
架构设计
整体拓扑
┌─────────────────────────────────────────────────────────────┐
│ 业务应用层 │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌─────────────┐ │
│ │ ChatBot │ │ Code Gen │ │ RAG App │ │ Agent System│ │
│ └────┬────┘ └─────┬────┘ └─────┬────┘ └──────┬──────┘ │
└───────┼─────────────┼─────────────┼──────────────┼──────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ AI 网关层 │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ 路由器 │ │ 限流器 │ │ 缓存层 │ │ 可观测性 │ │
│ │ (Router) │ │ (Limiter)│ │ (Cache) │ │ (Telemetry)│ │
│ └────┬─────┘ └──────────┘ └──────────┘ └────────────┘ │
│ │ │
│ ┌────▼─────────────────────────────────────────────────┐ │
│ │ Fallback Chain Manager │ │
│ └────┬──────────┬──────────┬──────────┬────────────────┘ │
└───────┼──────────┼──────────┼──────────┼────────────────────┘
│ │ │ │
┌────▼───┐ ┌───▼────┐ ┌───▼────┐ ┌───▼────┐
│OpenAI │ │Anthropic│ │Google │ │vLLM │
│GPT-4o │ │Claude │ │Gemini │ │(自部署) │
└────────┘ └────────┘ └────────┘ └────────┘
核心模块职责
| 模块 | 职责 | 关键指标 |
|---|---|---|
| 路由器 | 根据场景/模型/成本/延迟选择 Provider | 路由决策延迟 <1ms |
| 限流器 | RPM/TPM 限制,防止超出 Provider 配额 | 令牌桶/滑动窗口精度 |
| 缓存层 | 语义缓存 + 精确匹配缓存 | 缓存命中率 >30% |
| Fallback 管理 | 失败自动切换,熔断恢复 | 故障切换延迟 <100ms |
| 可观测性 | Token 计量、延迟追踪、质量评分 | 数据完整性 >99.9% |
智能路由策略
场景路由
不同任务场景对模型能力的要求不同,路由器根据场景类型选择最合适的模型:
// src/router/scene-router.ts
interface RouteConfig {
scene: string;
tiers: {
premium: ProviderModel[];
balanced: ProviderModel[];
fast: ProviderModel[];
};
}
const routeTable: RouteConfig[] = [
{
scene: "chat",
tiers: {
premium: [{ provider: "anthropic", model: "claude-sonnet-4-20250514" }],
balanced: [{ provider: "openai", model: "gpt-4o" }],
fast: [{ provider: "google", model: "gemini-2.0-flash" }],
},
},
{
scene: "code",
tiers: {
premium: [{ provider: "anthropic", model: "claude-sonnet-4-20250514" }],
balanced: [{ provider: "openai", model: "gpt-4o" }],
fast: [{ provider: "google", model: "gemini-2.0-flash" }],
},
},
{
scene: "vision",
tiers: {
premium: [{ provider: "google", model: "gemini-2.5-pro" }],
balanced: [{ provider: "openai", model: "gpt-4o" }],
fast: [{ provider: "google", model: "gemini-2.0-flash" }],
},
},
{
scene: "embedding",
tiers: {
premium: [{ provider: "openai", model: "text-embedding-3-large" }],
balanced: [{ provider: "openai", model: "text-embedding-3-small" }],
fast: [{ provider: "local", model: "bge-m3" }],
},
},
];
function routeRequest(scene: string, tier: string = "balanced"): ProviderModel {
const config = routeTable.find(r => r.scene === scene);
if (!config) throw new Error(`Unknown scene: ${scene}`);
const candidates = config.tiers[tier];
if (!candidates?.length) throw new Error(`No models for ${scene}/${tier}`);
// Check availability before returning
for (const candidate of candidates) {
if (circuitBreaker.isAvailable(candidate.provider)) {
return candidate;
}
}
// All providers down, try fallback tier
return routeRequest(scene, tier === "premium" ? "balanced" : "fast");
}
成本感知路由
// src/router/cost-router.ts
interface ModelPricing {
provider: string;
model: string;
inputPer1M: number; // USD per 1M input tokens
outputPer1M: number; // USD per 1M output tokens
cachedPer1M?: number; // USD per 1M cached input tokens
}
const pricing: ModelPricing[] = [
{ provider: "openai", model: "gpt-4o", inputPer1M: 2.50, outputPer1M: 10.00 },
{ provider: "openai", model: "gpt-4o-mini", inputPer1M: 0.15, outputPer1M: 0.60 },
{ provider: "anthropic", model: "claude-sonnet-4-20250514", inputPer1M: 3.00, outputPer1M: 15.00 },
{ provider: "anthropic", model: "claude-haiku-3.5", inputPer1M: 0.80, outputPer1M: 4.00 },
{ provider: "google", model: "gemini-2.0-flash", inputPer1M: 0.10, outputPer1M: 0.40 },
{ provider: "google", model: "gemini-2.5-pro", inputPer1M: 1.25, outputPer1M: 10.00 },
];
function estimateCost(
model: ModelPricing,
estimatedInputTokens: number,
estimatedOutputTokens: number,
): number {
return (
(estimatedInputTokens / 1_000_000) * model.inputPer1M +
(estimatedOutputTokens / 1_000_000) * model.outputPer1M
);
}
function routeByCost(
scene: string,
maxCostUsd: number,
estimatedInputTokens: number,
estimatedOutputTokens: number,
): ProviderModel {
const candidates = getSceneCandidates(scene);
// Filter by budget, sort by quality (premium first)
const affordable = candidates
.map(c => ({
...c,
cost: estimateCost(
pricing.find(p => p.model === c.model)!,
estimatedInputTokens,
estimatedOutputTokens,
),
}))
.filter(c => c.cost <= maxCostUsd)
.sort((a, b) => b.cost - a.cost); // Higher cost = higher quality (heuristic)
if (!affordable.length) {
throw new Error(`No model within budget $${maxCostUsd} for ${scene}`);
}
return affordable[0];
}
Fallback 链与容错
熔断器实现
// src/resilience/circuit-breaker.ts
enum CircuitState {
CLOSED = "closed", // Normal operation
OPEN = "open", // Failing, reject requests
HALF_OPEN = "half_open", // Testing recovery
}
interface CircuitConfig {
failureThreshold: number; // Failures before opening
recoveryTimeout: number; // ms before trying half-open
successThreshold: number; // Successes in half-open to close
monitorWindow: number; // ms window for failure counting
}
class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private failures: number[] = [];
private successes = 0;
private lastStateChange = Date.now();
constructor(
private provider: string,
private config: CircuitConfig = {
failureThreshold: 5,
recoveryTimeout: 30_000,
successThreshold: 3,
monitorWindow: 60_000,
},
) {}
isAvailable(): boolean {
if (this.state === CircuitState.CLOSED) return true;
if (this.state === CircuitState.OPEN) {
// Check if recovery timeout has passed
if (Date.now() - this.lastStateChange > this.config.recoveryTimeout) {
this.transition(CircuitState.HALF_OPEN);
return true;
}
return false;
}
// HALF_OPEN: allow limited traffic
return true;
}
recordSuccess(): void {
if (this.state === CircuitState.HALF_OPEN) {
this.successes++;
if (this.successes >= this.config.successThreshold) {
this.transition(CircuitState.CLOSED);
}
}
// Reset failure window
this.failures = [];
}
recordFailure(error: Error): void {
const now = Date.now();
this.failures.push(now);
// Clean old failures outside monitoring window
this.failures = this.failures.filter(
t => now - t < this.config.monitorWindow
);
if (this.state === CircuitState.HALF_OPEN) {
this.transition(CircuitState.OPEN);
return;
}
if (this.failures.length >= this.config.failureThreshold) {
this.transition(CircuitState.OPEN);
}
}
private transition(newState: CircuitState): void {
console.log(
`CircuitBreaker [${this.provider}]: ${this.state} -> ${newState}`
);
this.state = newState;
this.lastStateChange = Date.now();
this.successes = 0;
if (newState === CircuitState.CLOSED) {
this.failures = [];
}
}
}
Fallback 链执行器
// src/resilience/fallback-chain.ts
interface FallbackResult {
response: LLMResponse;
provider: string;
model: string;
attempts: AttemptRecord[];
}
interface AttemptRecord {
provider: string;
model: string;
latencyMs: number;
error?: string;
}
async function executeWithFallback(
request: LLMRequest,
chain: ProviderModel[],
): Promise<FallbackResult> {
const attempts: AttemptRecord[] = [];
for (const { provider, model } of chain) {
const breaker = getCircuitBreaker(provider);
if (!breaker.isAvailable()) {
attempts.push({
provider, model, latencyMs: 0,
error: "Circuit breaker OPEN",
});
continue;
}
const start = Date.now();
try {
const response = await callProvider(provider, model, request, {
timeout: 30_000,
retries: 1,
});
breaker.recordSuccess();
attempts.push({ provider, model, latencyMs: Date.now() - start });
return { response, provider, model, attempts };
} catch (error) {
const latencyMs = Date.now() - start;
breaker.recordFailure(error);
attempts.push({
provider, model, latencyMs,
error: error.message,
});
// Log for observability
console.error(
`Fallback: ${provider}/${model} failed (${latencyMs}ms): ${error.message}`
);
}
}
throw new FallbackExhaustedError(
`All providers failed for request`,
attempts,
);
}
Token 计量与成本追踪
精确 Token 计数
// src/metering/token-counter.ts
import { encode } from "gpt-tokenizer"; // tiktoken compatible
import Anthropic from "@anthropic-ai/sdk";
interface TokenUsage {
inputTokens: number;
outputTokens: number;
cachedTokens?: number;
totalTokens: number;
estimatedCostUsd: number;
}
function countTokens(
provider: string,
model: string,
messages: Message[],
response: LLMResponse,
): TokenUsage {
// Most providers return token counts in response
const usage = response.usage;
if (usage) {
const modelPricing = getPricing(provider, model);
return {
inputTokens: usage.input_tokens ?? usage.prompt_tokens,
outputTokens: usage.output_tokens ?? usage.completion_tokens,
cachedTokens: usage.cache_read_input_tokens ?? 0,
totalTokens: (usage.input_tokens ?? usage.prompt_tokens) +
(usage.output_tokens ?? usage.completion_tokens),
estimatedCostUsd: calculateCost(modelPricing, usage),
};
}
// Fallback: client-side estimation
const inputText = messages.map(m => m.content).join(" ");
const inputTokens = encode(inputText).length;
const outputTokens = encode(response.content).length;
return {
inputTokens,
outputTokens,
totalTokens: inputTokens + outputTokens,
estimatedCostUsd: 0, // Mark as estimated
};
}
成本看板数据模型
-- Cost tracking schema
CREATE TABLE llm_usage_log (
id BIGSERIAL PRIMARY KEY,
request_id UUID NOT NULL,
timestamp TIMESTAMPTZ DEFAULT NOW(),
-- Routing info
scene TEXT NOT NULL,
tier TEXT NOT NULL,
provider TEXT NOT NULL,
model TEXT NOT NULL,
-- Token usage
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
cached_tokens INTEGER DEFAULT 0,
total_tokens INTEGER GENERATED ALWAYS AS (input_tokens + output_tokens) STORED,
-- Cost
cost_usd NUMERIC(10, 6) NOT NULL,
-- Performance
latency_ms INTEGER NOT NULL,
ttft_ms INTEGER, -- Time to First Token
-- Context
user_id TEXT,
team_id TEXT,
app_id TEXT NOT NULL,
was_fallback BOOLEAN DEFAULT FALSE,
fallback_chain JSONB, -- Full attempt history
cache_hit BOOLEAN DEFAULT FALSE
);
-- Daily cost aggregation view
CREATE MATERIALIZED VIEW daily_cost_by_team AS
SELECT
DATE_TRUNC('day', timestamp) AS day,
team_id,
app_id,
provider,
model,
COUNT(*) AS request_count,
SUM(input_tokens) AS total_input_tokens,
SUM(output_tokens) AS total_output_tokens,
SUM(cost_usd) AS total_cost_usd,
AVG(latency_ms) AS avg_latency_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99_latency_ms,
SUM(CASE WHEN was_fallback THEN 1 ELSE 0 END) AS fallback_count,
SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END) AS cache_hit_count
FROM llm_usage_log
GROUP BY 1, 2, 3, 4, 5;
-- Query: monthly cost by team with trend
SELECT
team_id,
SUM(total_cost_usd) AS monthly_cost,
SUM(request_count) AS monthly_requests,
ROUND(SUM(total_cost_usd) / SUM(request_count) * 1000, 2) AS cost_per_1k_requests,
ROUND(SUM(cache_hit_count)::NUMERIC / SUM(request_count) * 100, 1) AS cache_hit_pct
FROM daily_cost_by_team
WHERE day >= DATE_TRUNC('month', NOW())
GROUP BY team_id
ORDER BY monthly_cost DESC;
速率限制
令牌桶限流器
// src/rate-limiter/token-bucket.ts
class TokenBucketLimiter {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number, // Bucket capacity
private refillRate: number, // Tokens added per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
tryConsume(cost: number = 1): boolean {
this.refill();
if (this.tokens >= cost) {
this.tokens -= cost;
return true;
}
return false;
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.maxTokens,
this.tokens + elapsed * this.refillRate,
);
this.lastRefill = now;
}
getWaitTime(cost: number = 1): number {
this.refill();
if (this.tokens >= cost) return 0;
return Math.ceil((cost - this.tokens) / this.refillRate * 1000);
}
}
// Per-provider rate limits (matching their documented limits)
const providerLimits = {
openai: {
rpm: new TokenBucketLimiter(500, 500 / 60), // 500 RPM
tpm: new TokenBucketLimiter(200_000, 200_000 / 60), // 200K TPM
},
anthropic: {
rpm: new TokenBucketLimiter(1000, 1000 / 60),
tpm: new TokenBucketLimiter(400_000, 400_000 / 60),
},
google: {
rpm: new TokenBucketLimiter(1000, 1000 / 60),
tpm: new TokenBucketLimiter(4_000_000, 4_000_000 / 60),
},
};
生产部署参考架构
┌─────────────┐
│ CloudFlare │
│ (CDN) │
└──────┬──────┘
│
┌──────▼──────┐
│ Nginx │
│ (TLS终止) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼────┐ ┌────▼─────┐ ┌────▼─────┐
│ Gateway │ │ Gateway │ │ Gateway │
│ Pod #1 │ │ Pod #2 │ │ Pod #3 │
└─────┬────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌─────▼────────────▼────────────▼─────┐
│ Redis │
│ (Rate Limits + Semantic Cache) │
└─────┬────────────┬──────────────────┘
│ │
┌─────▼────┐ ┌────▼──────┐
│PostgreSQL│ │ClickHouse │
│(Config) │ │(Analytics)│
└──────────┘ └───────────┘
总结
- 统一接口,多后端:业务代码只需感知场景和质量档位,不需要关心具体 Provider。
- Fallback 是必需品:任何单一 Provider 都会宕机,Fallback 链确保服务可用性。
- 熔断器防止雪崩:快速识别故障 Provider,避免超时等待浪费用户时间。
- 成本可见才能可控:逐请求记录 Token 用量和成本,建立团队/应用维度的成本看板。
- 缓存是最便宜的推理:语义缓存可以将重复查询的成本降为零,命中率 30% 就能节省可观成本。
Maurice | maurice_wen@proton.me