企业级 Agent 平台的可观测性
原创
灵阙教研团队
S 精选 进阶 |
约 10 分钟阅读
更新于 2026-02-28 AI 导读
企业级 Agent 平台的可观测性 Logging、Tracing、Metrics:构建 Agent 系统的全栈可观测性体系 为什么 Agent 可观测性不同于传统系统 传统微服务的可观测性关注"请求在服务间如何流转"。Agent 系统的可观测性则需要额外回答: Agent 为什么做了这个决策?(推理链可追溯) 工具调用的效果如何?(工具级别的成功率和延迟) Token...
企业级 Agent 平台的可观测性
Logging、Tracing、Metrics:构建 Agent 系统的全栈可观测性体系
为什么 Agent 可观测性不同于传统系统
传统微服务的可观测性关注"请求在服务间如何流转"。Agent 系统的可观测性则需要额外回答:
- Agent 为什么做了这个决策?(推理链可追溯)
- 工具调用的效果如何?(工具级别的成功率和延迟)
- Token 消耗了多少?成本是多少?(LLM 特有的成本可观测性)
- 多 Agent 之间如何协作?(Agent 间通信可追踪)
传统微服务可观测性:
Request ──→ Service A ──→ Service B ──→ Response
关注:延迟、错误率、吞吐量
Agent 可观测性(额外维度):
Task ──→ Agent ──→ [Think → Tool Call → Observe]×N ──→ Result
关注:推理质量、工具效率、Token成本、决策链路、Agent协作
可观测性三支柱在 Agent 系统中的映射
┌──────────────────────────────────────────────────────────────────┐
│ Agent 可观测性三支柱 │
├─────────────────┬─────────────────┬──────────────────────────────┤
│ Logging │ Tracing │ Metrics │
│ 日志 │ 链路追踪 │ 指标 │
├─────────────────┼─────────────────┼──────────────────────────────┤
│ 结构化事件记录 │ 端到端执行链路 │ 聚合统计数据 │
│ - 推理过程 │ - 任务→Agent │ - 成功率 │
│ - 工具调用详情 │ - Agent→Tool │ - 延迟分布 │
│ - 错误堆栈 │ - Agent→Agent │ - Token消耗 │
│ - 决策依据 │ - 多步骤关联 │ - 成本 │
├─────────────────┼─────────────────┼──────────────────────────────┤
│ 用途: │ 用途: │ 用途: │
│ 调试单个执行 │ 理解全链路 │ 监控系统健康 │
│ 审计合规 │ 定位瓶颈 │ 趋势分析 │
│ 知识沉淀 │ 根因分析 │ 告警触发 │
└─────────────────┴─────────────────┴──────────────────────────────┘
一、结构化日志(Structured Logging)
日志事件类型
class AgentLogEvent:
"""Agent 日志事件的标准结构"""
# 事件类型枚举
class EventType(Enum):
TASK_START = "task.start"
TASK_END = "task.end"
AGENT_THINK = "agent.think"
AGENT_DECIDE = "agent.decide"
TOOL_CALL_START = "tool.call.start"
TOOL_CALL_END = "tool.call.end"
TOOL_CALL_ERROR = "tool.call.error"
LLM_REQUEST = "llm.request"
LLM_RESPONSE = "llm.response"
MEMORY_READ = "memory.read"
MEMORY_WRITE = "memory.write"
HITL_REQUEST = "hitl.request"
HITL_RESPONSE = "hitl.response"
ESCALATION = "escalation"
HANDOFF = "agent.handoff"
SECURITY_EVENT = "security.event"
def __init__(self,
event_type: EventType,
trace_id: str,
span_id: str,
agent_id: str,
data: dict,
timestamp: datetime = None):
self.event_type = event_type
self.trace_id = trace_id
self.span_id = span_id
self.agent_id = agent_id
self.data = data
self.timestamp = timestamp or datetime.now()
self.level = self._infer_level()
def to_dict(self) -> dict:
return {
"timestamp": self.timestamp.isoformat(),
"level": self.level,
"event": self.event_type.value,
"trace_id": self.trace_id,
"span_id": self.span_id,
"agent_id": self.agent_id,
**self.data
}
日志输出示例
{"timestamp":"2026-02-28T10:15:30.123Z","level":"INFO","event":"task.start","trace_id":"tr_abc123","agent_id":"supervisor","task":"分析Q4财报","user_id":"u_789"}
{"timestamp":"2026-02-28T10:15:30.456Z","level":"DEBUG","event":"agent.think","trace_id":"tr_abc123","agent_id":"supervisor","reasoning":"需要先获取财报数据,再进行分析","confidence":0.92}
{"timestamp":"2026-02-28T10:15:30.789Z","level":"INFO","event":"tool.call.start","trace_id":"tr_abc123","span_id":"sp_001","agent_id":"supervisor","tool":"file_read","params":{"path":"/data/q4_report.xlsx"}}
{"timestamp":"2026-02-28T10:15:31.234Z","level":"INFO","event":"tool.call.end","trace_id":"tr_abc123","span_id":"sp_001","agent_id":"supervisor","tool":"file_read","status":"success","duration_ms":445,"result_size_bytes":15234}
{"timestamp":"2026-02-28T10:15:31.567Z","level":"INFO","event":"llm.request","trace_id":"tr_abc123","span_id":"sp_002","agent_id":"supervisor","model":"claude-opus-4-6","prompt_tokens":2340,"temperature":0.0}
{"timestamp":"2026-02-28T10:15:35.890Z","level":"INFO","event":"llm.response","trace_id":"tr_abc123","span_id":"sp_002","agent_id":"supervisor","model":"claude-opus-4-6","completion_tokens":890,"duration_ms":4323,"cost_usd":0.038}
敏感信息脱敏
class LogSanitizer:
"""日志脱敏处理"""
REDACT_PATTERNS = {
"api_key": r"(sk-|ghp_|gho_|Bearer\s+)[a-zA-Z0-9_-]{20,}",
"password": r"(?i)(password|passwd|secret|token)\s*[:=]\s*\S+",
"email": r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b",
"credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
}
# 不脱敏的字段白名单
SAFE_FIELDS = {"trace_id", "span_id", "agent_id", "event",
"timestamp", "level", "duration_ms", "status"}
def sanitize(self, log_entry: dict) -> dict:
"""脱敏日志条目"""
sanitized = {}
for key, value in log_entry.items():
if key in self.SAFE_FIELDS:
sanitized[key] = value
elif isinstance(value, str):
sanitized[key] = self._redact_string(value)
elif isinstance(value, dict):
sanitized[key] = self.sanitize(value)
else:
sanitized[key] = value
return sanitized
def _redact_string(self, text: str) -> str:
result = text
for name, pattern in self.REDACT_PATTERNS.items():
result = re.sub(pattern, f"[REDACTED:{name}]", result)
return result
二、分布式链路追踪(Distributed Tracing)
OpenTelemetry 集成
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter
)
# 初始化 Tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-platform")
class TracedAgent:
"""带链路追踪的 Agent"""
def __init__(self, name: str, llm, tools: list):
self.name = name
self.llm = llm
self.tools = tools
def run(self, task: str) -> str:
with tracer.start_as_current_span(
"agent.run",
attributes={
"agent.name": self.name,
"agent.task": task[:200], # 截断避免过长
}
) as span:
try:
result = self._execute(task)
span.set_attribute("agent.status", "success")
return result
except Exception as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
def _call_llm(self, messages: list) -> str:
with tracer.start_as_current_span(
"llm.call",
attributes={
"llm.model": self.llm.model,
"llm.prompt_tokens": self._count_tokens(messages),
}
) as span:
start = time.time()
response = self.llm.chat(messages)
duration = time.time() - start
span.set_attribute("llm.completion_tokens",
response.usage.completion_tokens)
span.set_attribute("llm.total_tokens",
response.usage.total_tokens)
span.set_attribute("llm.duration_ms", int(duration * 1000))
span.set_attribute("llm.cost_usd",
self._calc_cost(response.usage))
return response.content
def _call_tool(self, tool_name: str, params: dict) -> str:
with tracer.start_as_current_span(
"tool.call",
attributes={
"tool.name": tool_name,
"tool.params": json.dumps(params)[:500],
}
) as span:
tool = self._get_tool(tool_name)
start = time.time()
try:
result = tool.execute(params)
span.set_attribute("tool.status", "success")
span.set_attribute("tool.result_size",
len(str(result)))
return result
except Exception as e:
span.set_attribute("tool.status", "error")
span.set_attribute("tool.error", str(e))
raise
finally:
span.set_attribute("tool.duration_ms",
int((time.time() - start) * 1000))
Trace 结构示例
Trace: tr_abc123 (分析Q4财报)
│
├── Span: agent.run [supervisor] (12.5s)
│ │
│ ├── Span: llm.call [claude-opus-4-6] (2.1s)
│ │ └── tokens: 2340 in / 150 out / $0.012
│ │
│ ├── Span: tool.call [file_read] (0.4s)
│ │ └── path: /data/q4_report.xlsx, status: success
│ │
│ ├── Span: agent.handoff [supervisor -> analyst] (0.1s)
│ │
│ ├── Span: agent.run [analyst] (8.2s)
│ │ │
│ │ ├── Span: llm.call [claude-opus-4-6] (3.5s)
│ │ │ └── tokens: 5600 in / 1200 out / $0.048
│ │ │
│ │ ├── Span: tool.call [python_execute] (2.1s)
│ │ │ └── code: "import pandas...", status: success
│ │ │
│ │ └── Span: llm.call [claude-opus-4-6] (2.6s)
│ │ └── tokens: 3800 in / 900 out / $0.034
│ │
│ └── Span: llm.call [claude-opus-4-6] (1.7s)
│ └── tokens: 2100 in / 500 out / $0.018
│
└── Total: 12.5s, 14840 tokens, $0.112
多 Agent 追踪
class MultiAgentTracer:
"""多 Agent 协作的追踪"""
def trace_handoff(self, source_agent: str,
target_agent: str,
context: dict):
"""追踪 Agent 间的控制权转移"""
current_span = trace.get_current_span()
with tracer.start_as_current_span(
"agent.handoff",
attributes={
"handoff.source": source_agent,
"handoff.target": target_agent,
"handoff.reason": context.get("reason", ""),
"handoff.context_size": len(json.dumps(context)),
}
):
pass # Handoff span 用于标记转移点
def trace_parallel_agents(self, agent_tasks: list[tuple]):
"""追踪并行执行的 Agent"""
with tracer.start_as_current_span("agents.parallel") as parent:
parent.set_attribute(
"parallel.count", len(agent_tasks)
)
# 每个并行 Agent 创建子 Span
# 子 Span 的 parent 都是 parallel span
pass
三、指标体系(Metrics)
Agent 核心指标定义
from prometheus_client import (
Counter, Histogram, Gauge, Summary
)
# ---- 任务级指标 ----
task_total = Counter(
"agent_tasks_total",
"Total number of tasks",
["agent_name", "status"] # status: success/failure/timeout
)
task_duration = Histogram(
"agent_task_duration_seconds",
"Task execution duration",
["agent_name"],
buckets=[1, 5, 10, 30, 60, 120, 300, 600]
)
# ---- LLM 调用指标 ----
llm_calls = Counter(
"agent_llm_calls_total",
"Total LLM API calls",
["agent_name", "model", "status"]
)
llm_tokens = Counter(
"agent_llm_tokens_total",
"Total tokens consumed",
["agent_name", "model", "direction"] # direction: input/output
)
llm_cost = Counter(
"agent_llm_cost_usd_total",
"Total LLM cost in USD",
["agent_name", "model"]
)
llm_latency = Histogram(
"agent_llm_latency_seconds",
"LLM API call latency",
["model"],
buckets=[0.5, 1, 2, 5, 10, 30]
)
# ---- 工具调用指标 ----
tool_calls = Counter(
"agent_tool_calls_total",
"Total tool calls",
["agent_name", "tool_name", "status"]
)
tool_latency = Histogram(
"agent_tool_latency_seconds",
"Tool call latency",
["tool_name"],
buckets=[0.1, 0.5, 1, 5, 10, 30]
)
tool_fallback = Counter(
"agent_tool_fallback_total",
"Tool fallback triggered",
["tool_name", "fallback_tool"]
)
# ---- 系统健康指标 ----
active_agents = Gauge(
"agent_active_count",
"Currently active agents",
["agent_name"]
)
memory_usage = Gauge(
"agent_memory_usage_bytes",
"Agent memory usage",
["agent_name", "memory_type"] # buffer/vector/episodic
)
queue_depth = Gauge(
"agent_task_queue_depth",
"Pending tasks in queue",
["priority"]
)
指标仪表盘布局
┌──────────────────────────────────────────────────────────┐
│ Agent Platform Dashboard │
├──────────────────────────────────────────────────────────┤
│ │
│ [实时概览] │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 活跃Agent │ │ 任务成功率 │ │ 平均延迟 │ │ 今日成本 │ │
│ │ 12 │ │ 94.2% │ │ 8.3s │ │ $47.20 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ [任务趋势 - 24h] │
│ 成功 ████████████████████████████████░░░░ 94.2% │
│ 失败 ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 4.1% │
│ 超时 █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1.7% │
│ │
│ [LLM 调用分布] [工具调用 Top 5] │
│ claude-opus-4-6 45% file_read 234 │
│ gpt-4o 30% web_search 189 │
│ gemini-2.5-pro 25% code_execute 156 │
│ database_query 98 │
│ file_write 67 │
│ │
│ [成本明细 - 按模型] [错误率 - 按工具] │
│ claude: $28.50 shell_exec 8.2% │
│ gpt-4o: $12.30 web_search 3.1% │
│ gemini: $6.40 db_query 1.5% │
│ │
└──────────────────────────────────────────────────────────┘
四、告警规则设计
告警层级
ALERT_RULES = [
# P0: 立即处理
AlertRule(
name="agent_task_success_rate_critical",
condition="rate(agent_tasks_total{status='failure'}[5m]) "
"> 0.2",
severity="critical",
message="Agent 任务失败率超过 20%",
channels=["pagerduty", "slack-oncall"],
runbook="https://wiki/runbooks/agent-high-failure-rate"
),
AlertRule(
name="agent_security_violation",
condition="increase(agent_security_events_total"
"{severity='critical'}[1m]) > 0",
severity="critical",
message="检测到 Agent 安全违规事件",
channels=["pagerduty", "slack-security"],
),
# P1: 1 小时内处理
AlertRule(
name="agent_cost_anomaly",
condition="rate(agent_llm_cost_usd_total[1h]) "
"> 2 * avg_over_time("
"rate(agent_llm_cost_usd_total[1h])[7d:1h])",
severity="warning",
message="Agent LLM 成本异常,当前速率是过去 7 天均值的 2 倍",
channels=["slack-ops"],
),
AlertRule(
name="agent_latency_high",
condition="histogram_quantile(0.99, "
"agent_task_duration_seconds) > 120",
severity="warning",
message="Agent 任务 P99 延迟超过 2 分钟",
channels=["slack-ops"],
),
# P2: 下一工作日处理
AlertRule(
name="agent_tool_fallback_rate",
condition="rate(agent_tool_fallback_total[1h]) "
"/ rate(agent_tool_calls_total[1h]) > 0.1",
severity="info",
message="工具降级率超过 10%",
channels=["slack-dev"],
),
]
五、成本可观测性
Token 消耗追踪
class CostTracker:
"""LLM 成本追踪"""
# 模型定价(每百万 Token)
PRICING = {
"claude-opus-4-6": {"input": 15.0, "output": 75.0},
"claude-sonnet-4": {"input": 3.0, "output": 15.0},
"gpt-4o": {"input": 2.5, "output": 10.0},
"gemini-2.5-pro": {"input": 1.25, "output": 10.0},
}
def calculate_cost(self, model: str,
input_tokens: int,
output_tokens: int) -> float:
"""计算单次调用成本"""
pricing = self.PRICING.get(model, {"input": 5.0, "output": 15.0})
cost = (
input_tokens * pricing["input"] / 1_000_000 +
output_tokens * pricing["output"] / 1_000_000
)
return round(cost, 6)
def get_daily_report(self) -> dict:
"""生成每日成本报告"""
return {
"date": date.today().isoformat(),
"total_cost_usd": self._query_total_cost(),
"by_model": self._query_cost_by_model(),
"by_agent": self._query_cost_by_agent(),
"by_task_type": self._query_cost_by_task_type(),
"token_efficiency": self._calc_efficiency(),
"budget_remaining": self._budget_remaining(),
"forecast_eom": self._forecast_end_of_month(),
}
成本报告示例
Daily Cost Report - 2026-02-28
================================
Total: $47.20 / Budget: $100.00 (47.2%)
Forecast EOM: $1,378 / Monthly Budget: $3,000 (46.0%)
By Model:
claude-opus-4-6 $28.50 (60.4%) ██████████████████░░░░
gpt-4o $12.30 (26.1%) ████████░░░░░░░░░░░░░
gemini-2.5-pro $6.40 (13.5%) ████░░░░░░░░░░░░░░░░░
By Agent:
code-reviewer $18.20 384 tasks $0.047/task
research-agent $15.30 156 tasks $0.098/task
data-analyst $8.70 89 tasks $0.098/task
support-agent $5.00 412 tasks $0.012/task
Token Efficiency:
Useful output / Total tokens: 34.2%
Cache hit rate: 67.8%
六、可观测性基础设施
推荐技术栈
数据采集层:
OpenTelemetry SDK ──→ OTLP Collector ──→ 存储/分析
存储层:
Traces ──→ Jaeger / Tempo / Datadog
Metrics ──→ Prometheus / VictoriaMetrics
Logs ──→ Loki / Elasticsearch
展示层:
Grafana (统一仪表盘)
自定义 Agent 调试 UI
告警层:
Alertmanager / PagerDuty / Slack Webhook
部署配置示例
# docker-compose.observability.yml
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
volumes:
- ./otel-config.yaml:/etc/otelcol/config.yaml
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:10.4.0
ports:
- "3000:3000"
volumes:
- ./grafana/dashboards:/var/lib/grafana/dashboards
jaeger:
image: jaegertracing/all-in-one:1.55
ports:
- "16686:16686" # UI
- "14268:14268" # Collector
loki:
image: grafana/loki:2.9.5
ports:
- "3100:3100"
工程实践建议
- 从 Trace 开始:Trace 是最有价值的可观测性数据,能串联整个执行链路
- 成本可观测性是必需品:Agent 系统的成本可以在几分钟内失控,必须有实时监控
- 日志脱敏是硬性要求:Agent 日志中可能包含用户数据和 API 密钥
- 采样策略:生产环境使用尾部采样(tail-based sampling),错误和慢请求 100% 采集
- 推理链路保留:Agent 的 thinking/reasoning 是调试的关键信息,即使压缩也要保留摘要
- 告警疲劳防治:告警规则要分级,低优先级告警聚合批量发送
参考资料
- OpenTelemetry 官方文档:Agent 可观测性的行业标准
- LangSmith / LangFuse:LLM 应用专用的可观测性平台
- Arize Phoenix:开源的 LLM 可观测性工具
- Braintrust:Agent 评估 + 可观测性一体化平台
- Datadog LLM Observability
Maurice | maurice_wen@proton.me