企业级 Agent 平台的可观测性

Logging、Tracing、Metrics:构建 Agent 系统的全栈可观测性体系


为什么 Agent 可观测性不同于传统系统

传统微服务的可观测性关注"请求在服务间如何流转"。Agent 系统的可观测性则需要额外回答:

  1. Agent 为什么做了这个决策?(推理链可追溯)
  2. 工具调用的效果如何?(工具级别的成功率和延迟)
  3. Token 消耗了多少?成本是多少?(LLM 特有的成本可观测性)
  4. 多 Agent 之间如何协作?(Agent 间通信可追踪)
传统微服务可观测性:
  Request ──→ Service A ──→ Service B ──→ Response
  关注:延迟、错误率、吞吐量

Agent 可观测性(额外维度):
  Task ──→ Agent ──→ [Think → Tool Call → Observe]×N ──→ Result
  关注:推理质量、工具效率、Token成本、决策链路、Agent协作

可观测性三支柱在 Agent 系统中的映射

┌──────────────────────────────────────────────────────────────────┐
│                    Agent 可观测性三支柱                           │
├─────────────────┬─────────────────┬──────────────────────────────┤
│     Logging     │     Tracing     │      Metrics                 │
│     日志        │     链路追踪     │      指标                    │
├─────────────────┼─────────────────┼──────────────────────────────┤
│ 结构化事件记录   │ 端到端执行链路   │ 聚合统计数据                 │
│ - 推理过程       │ - 任务→Agent    │ - 成功率                     │
│ - 工具调用详情   │ - Agent→Tool    │ - 延迟分布                   │
│ - 错误堆栈       │ - Agent→Agent   │ - Token消耗                  │
│ - 决策依据       │ - 多步骤关联     │ - 成本                      │
├─────────────────┼─────────────────┼──────────────────────────────┤
│ 用途:           │ 用途:           │ 用途:                      │
│ 调试单个执行     │ 理解全链路       │ 监控系统健康                 │
│ 审计合规         │ 定位瓶颈         │ 趋势分析                    │
│ 知识沉淀         │ 根因分析         │ 告警触发                    │
└─────────────────┴─────────────────┴──────────────────────────────┘

一、结构化日志(Structured Logging)

日志事件类型

class AgentLogEvent:
    """Agent 日志事件的标准结构"""

    # 事件类型枚举
    class EventType(Enum):
        TASK_START = "task.start"
        TASK_END = "task.end"
        AGENT_THINK = "agent.think"
        AGENT_DECIDE = "agent.decide"
        TOOL_CALL_START = "tool.call.start"
        TOOL_CALL_END = "tool.call.end"
        TOOL_CALL_ERROR = "tool.call.error"
        LLM_REQUEST = "llm.request"
        LLM_RESPONSE = "llm.response"
        MEMORY_READ = "memory.read"
        MEMORY_WRITE = "memory.write"
        HITL_REQUEST = "hitl.request"
        HITL_RESPONSE = "hitl.response"
        ESCALATION = "escalation"
        HANDOFF = "agent.handoff"
        SECURITY_EVENT = "security.event"

    def __init__(self,
                 event_type: EventType,
                 trace_id: str,
                 span_id: str,
                 agent_id: str,
                 data: dict,
                 timestamp: datetime = None):
        self.event_type = event_type
        self.trace_id = trace_id
        self.span_id = span_id
        self.agent_id = agent_id
        self.data = data
        self.timestamp = timestamp or datetime.now()
        self.level = self._infer_level()

    def to_dict(self) -> dict:
        return {
            "timestamp": self.timestamp.isoformat(),
            "level": self.level,
            "event": self.event_type.value,
            "trace_id": self.trace_id,
            "span_id": self.span_id,
            "agent_id": self.agent_id,
            **self.data
        }

日志输出示例

{"timestamp":"2026-02-28T10:15:30.123Z","level":"INFO","event":"task.start","trace_id":"tr_abc123","agent_id":"supervisor","task":"分析Q4财报","user_id":"u_789"}
{"timestamp":"2026-02-28T10:15:30.456Z","level":"DEBUG","event":"agent.think","trace_id":"tr_abc123","agent_id":"supervisor","reasoning":"需要先获取财报数据,再进行分析","confidence":0.92}
{"timestamp":"2026-02-28T10:15:30.789Z","level":"INFO","event":"tool.call.start","trace_id":"tr_abc123","span_id":"sp_001","agent_id":"supervisor","tool":"file_read","params":{"path":"/data/q4_report.xlsx"}}
{"timestamp":"2026-02-28T10:15:31.234Z","level":"INFO","event":"tool.call.end","trace_id":"tr_abc123","span_id":"sp_001","agent_id":"supervisor","tool":"file_read","status":"success","duration_ms":445,"result_size_bytes":15234}
{"timestamp":"2026-02-28T10:15:31.567Z","level":"INFO","event":"llm.request","trace_id":"tr_abc123","span_id":"sp_002","agent_id":"supervisor","model":"claude-opus-4-6","prompt_tokens":2340,"temperature":0.0}
{"timestamp":"2026-02-28T10:15:35.890Z","level":"INFO","event":"llm.response","trace_id":"tr_abc123","span_id":"sp_002","agent_id":"supervisor","model":"claude-opus-4-6","completion_tokens":890,"duration_ms":4323,"cost_usd":0.038}

敏感信息脱敏

class LogSanitizer:
    """日志脱敏处理"""

    REDACT_PATTERNS = {
        "api_key": r"(sk-|ghp_|gho_|Bearer\s+)[a-zA-Z0-9_-]{20,}",
        "password": r"(?i)(password|passwd|secret|token)\s*[:=]\s*\S+",
        "email": r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b",
        "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
    }

    # 不脱敏的字段白名单
    SAFE_FIELDS = {"trace_id", "span_id", "agent_id", "event",
                   "timestamp", "level", "duration_ms", "status"}

    def sanitize(self, log_entry: dict) -> dict:
        """脱敏日志条目"""
        sanitized = {}
        for key, value in log_entry.items():
            if key in self.SAFE_FIELDS:
                sanitized[key] = value
            elif isinstance(value, str):
                sanitized[key] = self._redact_string(value)
            elif isinstance(value, dict):
                sanitized[key] = self.sanitize(value)
            else:
                sanitized[key] = value
        return sanitized

    def _redact_string(self, text: str) -> str:
        result = text
        for name, pattern in self.REDACT_PATTERNS.items():
            result = re.sub(pattern, f"[REDACTED:{name}]", result)
        return result

二、分布式链路追踪(Distributed Tracing)

OpenTelemetry 集成

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter
)

# 初始化 Tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent-platform")


class TracedAgent:
    """带链路追踪的 Agent"""

    def __init__(self, name: str, llm, tools: list):
        self.name = name
        self.llm = llm
        self.tools = tools

    def run(self, task: str) -> str:
        with tracer.start_as_current_span(
            "agent.run",
            attributes={
                "agent.name": self.name,
                "agent.task": task[:200],  # 截断避免过长
            }
        ) as span:
            try:
                result = self._execute(task)
                span.set_attribute("agent.status", "success")
                return result
            except Exception as e:
                span.set_status(trace.StatusCode.ERROR, str(e))
                span.record_exception(e)
                raise

    def _call_llm(self, messages: list) -> str:
        with tracer.start_as_current_span(
            "llm.call",
            attributes={
                "llm.model": self.llm.model,
                "llm.prompt_tokens": self._count_tokens(messages),
            }
        ) as span:
            start = time.time()
            response = self.llm.chat(messages)
            duration = time.time() - start

            span.set_attribute("llm.completion_tokens",
                             response.usage.completion_tokens)
            span.set_attribute("llm.total_tokens",
                             response.usage.total_tokens)
            span.set_attribute("llm.duration_ms", int(duration * 1000))
            span.set_attribute("llm.cost_usd",
                             self._calc_cost(response.usage))

            return response.content

    def _call_tool(self, tool_name: str, params: dict) -> str:
        with tracer.start_as_current_span(
            "tool.call",
            attributes={
                "tool.name": tool_name,
                "tool.params": json.dumps(params)[:500],
            }
        ) as span:
            tool = self._get_tool(tool_name)
            start = time.time()
            try:
                result = tool.execute(params)
                span.set_attribute("tool.status", "success")
                span.set_attribute("tool.result_size",
                                 len(str(result)))
                return result
            except Exception as e:
                span.set_attribute("tool.status", "error")
                span.set_attribute("tool.error", str(e))
                raise
            finally:
                span.set_attribute("tool.duration_ms",
                                 int((time.time() - start) * 1000))

Trace 结构示例

Trace: tr_abc123 (分析Q4财报)
│
├── Span: agent.run [supervisor] (12.5s)
│   │
│   ├── Span: llm.call [claude-opus-4-6] (2.1s)
│   │   └── tokens: 2340 in / 150 out / $0.012
│   │
│   ├── Span: tool.call [file_read] (0.4s)
│   │   └── path: /data/q4_report.xlsx, status: success
│   │
│   ├── Span: agent.handoff [supervisor -> analyst] (0.1s)
│   │
│   ├── Span: agent.run [analyst] (8.2s)
│   │   │
│   │   ├── Span: llm.call [claude-opus-4-6] (3.5s)
│   │   │   └── tokens: 5600 in / 1200 out / $0.048
│   │   │
│   │   ├── Span: tool.call [python_execute] (2.1s)
│   │   │   └── code: "import pandas...", status: success
│   │   │
│   │   └── Span: llm.call [claude-opus-4-6] (2.6s)
│   │       └── tokens: 3800 in / 900 out / $0.034
│   │
│   └── Span: llm.call [claude-opus-4-6] (1.7s)
│       └── tokens: 2100 in / 500 out / $0.018
│
└── Total: 12.5s, 14840 tokens, $0.112

多 Agent 追踪

class MultiAgentTracer:
    """多 Agent 协作的追踪"""

    def trace_handoff(self, source_agent: str,
                       target_agent: str,
                       context: dict):
        """追踪 Agent 间的控制权转移"""
        current_span = trace.get_current_span()

        with tracer.start_as_current_span(
            "agent.handoff",
            attributes={
                "handoff.source": source_agent,
                "handoff.target": target_agent,
                "handoff.reason": context.get("reason", ""),
                "handoff.context_size": len(json.dumps(context)),
            }
        ):
            pass  # Handoff span 用于标记转移点

    def trace_parallel_agents(self, agent_tasks: list[tuple]):
        """追踪并行执行的 Agent"""
        with tracer.start_as_current_span("agents.parallel") as parent:
            parent.set_attribute(
                "parallel.count", len(agent_tasks)
            )
            # 每个并行 Agent 创建子 Span
            # 子 Span 的 parent 都是 parallel span
            pass

三、指标体系(Metrics)

Agent 核心指标定义

from prometheus_client import (
    Counter, Histogram, Gauge, Summary
)

# ---- 任务级指标 ----
task_total = Counter(
    "agent_tasks_total",
    "Total number of tasks",
    ["agent_name", "status"]  # status: success/failure/timeout
)

task_duration = Histogram(
    "agent_task_duration_seconds",
    "Task execution duration",
    ["agent_name"],
    buckets=[1, 5, 10, 30, 60, 120, 300, 600]
)

# ---- LLM 调用指标 ----
llm_calls = Counter(
    "agent_llm_calls_total",
    "Total LLM API calls",
    ["agent_name", "model", "status"]
)

llm_tokens = Counter(
    "agent_llm_tokens_total",
    "Total tokens consumed",
    ["agent_name", "model", "direction"]  # direction: input/output
)

llm_cost = Counter(
    "agent_llm_cost_usd_total",
    "Total LLM cost in USD",
    ["agent_name", "model"]
)

llm_latency = Histogram(
    "agent_llm_latency_seconds",
    "LLM API call latency",
    ["model"],
    buckets=[0.5, 1, 2, 5, 10, 30]
)

# ---- 工具调用指标 ----
tool_calls = Counter(
    "agent_tool_calls_total",
    "Total tool calls",
    ["agent_name", "tool_name", "status"]
)

tool_latency = Histogram(
    "agent_tool_latency_seconds",
    "Tool call latency",
    ["tool_name"],
    buckets=[0.1, 0.5, 1, 5, 10, 30]
)

tool_fallback = Counter(
    "agent_tool_fallback_total",
    "Tool fallback triggered",
    ["tool_name", "fallback_tool"]
)

# ---- 系统健康指标 ----
active_agents = Gauge(
    "agent_active_count",
    "Currently active agents",
    ["agent_name"]
)

memory_usage = Gauge(
    "agent_memory_usage_bytes",
    "Agent memory usage",
    ["agent_name", "memory_type"]  # buffer/vector/episodic
)

queue_depth = Gauge(
    "agent_task_queue_depth",
    "Pending tasks in queue",
    ["priority"]
)

指标仪表盘布局

┌──────────────────────────────────────────────────────────┐
│                  Agent Platform Dashboard                │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  [实时概览]                                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐  │
│  │ 活跃Agent │ │ 任务成功率 │ │ 平均延迟  │ │ 今日成本  │  │
│  │    12     │ │  94.2%   │ │  8.3s    │ │  $47.20  │  │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘  │
│                                                          │
│  [任务趋势 - 24h]                                        │
│  成功 ████████████████████████████████░░░░  94.2%       │
│  失败 ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   4.1%       │
│  超时 █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   1.7%       │
│                                                          │
│  [LLM 调用分布]          [工具调用 Top 5]               │
│  claude-opus-4-6  45%    file_read      234             │
│  gpt-4o           30%    web_search     189             │
│  gemini-2.5-pro   25%    code_execute   156             │
│                          database_query  98             │
│                          file_write      67             │
│                                                          │
│  [成本明细 - 按模型]      [错误率 - 按工具]               │
│  claude: $28.50          shell_exec  8.2%               │
│  gpt-4o: $12.30          web_search  3.1%               │
│  gemini: $6.40           db_query    1.5%               │
│                                                          │
└──────────────────────────────────────────────────────────┘

四、告警规则设计

告警层级

ALERT_RULES = [
    # P0: 立即处理
    AlertRule(
        name="agent_task_success_rate_critical",
        condition="rate(agent_tasks_total{status='failure'}[5m]) "
                  "> 0.2",
        severity="critical",
        message="Agent 任务失败率超过 20%",
        channels=["pagerduty", "slack-oncall"],
        runbook="https://wiki/runbooks/agent-high-failure-rate"
    ),

    AlertRule(
        name="agent_security_violation",
        condition="increase(agent_security_events_total"
                  "{severity='critical'}[1m]) > 0",
        severity="critical",
        message="检测到 Agent 安全违规事件",
        channels=["pagerduty", "slack-security"],
    ),

    # P1: 1 小时内处理
    AlertRule(
        name="agent_cost_anomaly",
        condition="rate(agent_llm_cost_usd_total[1h]) "
                  "> 2 * avg_over_time("
                  "rate(agent_llm_cost_usd_total[1h])[7d:1h])",
        severity="warning",
        message="Agent LLM 成本异常,当前速率是过去 7 天均值的 2 倍",
        channels=["slack-ops"],
    ),

    AlertRule(
        name="agent_latency_high",
        condition="histogram_quantile(0.99, "
                  "agent_task_duration_seconds) > 120",
        severity="warning",
        message="Agent 任务 P99 延迟超过 2 分钟",
        channels=["slack-ops"],
    ),

    # P2: 下一工作日处理
    AlertRule(
        name="agent_tool_fallback_rate",
        condition="rate(agent_tool_fallback_total[1h]) "
                  "/ rate(agent_tool_calls_total[1h]) > 0.1",
        severity="info",
        message="工具降级率超过 10%",
        channels=["slack-dev"],
    ),
]

五、成本可观测性

Token 消耗追踪

class CostTracker:
    """LLM 成本追踪"""

    # 模型定价(每百万 Token)
    PRICING = {
        "claude-opus-4-6": {"input": 15.0, "output": 75.0},
        "claude-sonnet-4": {"input": 3.0, "output": 15.0},
        "gpt-4o": {"input": 2.5, "output": 10.0},
        "gemini-2.5-pro": {"input": 1.25, "output": 10.0},
    }

    def calculate_cost(self, model: str,
                       input_tokens: int,
                       output_tokens: int) -> float:
        """计算单次调用成本"""
        pricing = self.PRICING.get(model, {"input": 5.0, "output": 15.0})
        cost = (
            input_tokens * pricing["input"] / 1_000_000 +
            output_tokens * pricing["output"] / 1_000_000
        )
        return round(cost, 6)

    def get_daily_report(self) -> dict:
        """生成每日成本报告"""
        return {
            "date": date.today().isoformat(),
            "total_cost_usd": self._query_total_cost(),
            "by_model": self._query_cost_by_model(),
            "by_agent": self._query_cost_by_agent(),
            "by_task_type": self._query_cost_by_task_type(),
            "token_efficiency": self._calc_efficiency(),
            "budget_remaining": self._budget_remaining(),
            "forecast_eom": self._forecast_end_of_month(),
        }

成本报告示例

Daily Cost Report - 2026-02-28
================================

Total: $47.20 / Budget: $100.00 (47.2%)
Forecast EOM: $1,378 / Monthly Budget: $3,000 (46.0%)

By Model:
  claude-opus-4-6    $28.50  (60.4%)  ██████████████████░░░░
  gpt-4o             $12.30  (26.1%)  ████████░░░░░░░░░░░░░
  gemini-2.5-pro     $6.40   (13.5%)  ████░░░░░░░░░░░░░░░░░

By Agent:
  code-reviewer       $18.20  384 tasks  $0.047/task
  research-agent      $15.30  156 tasks  $0.098/task
  data-analyst        $8.70   89 tasks   $0.098/task
  support-agent       $5.00   412 tasks  $0.012/task

Token Efficiency:
  Useful output / Total tokens: 34.2%
  Cache hit rate: 67.8%

六、可观测性基础设施

推荐技术栈

数据采集层:
  OpenTelemetry SDK ──→ OTLP Collector ──→ 存储/分析

存储层:
  Traces  ──→ Jaeger / Tempo / Datadog
  Metrics ──→ Prometheus / VictoriaMetrics
  Logs    ──→ Loki / Elasticsearch

展示层:
  Grafana (统一仪表盘)
  自定义 Agent 调试 UI

告警层:
  Alertmanager / PagerDuty / Slack Webhook

部署配置示例

# docker-compose.observability.yml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-config.yaml:/etc/otelcol/config.yaml

  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    volumes:
      - ./grafana/dashboards:/var/lib/grafana/dashboards

  jaeger:
    image: jaegertracing/all-in-one:1.55
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector

  loki:
    image: grafana/loki:2.9.5
    ports:
      - "3100:3100"

工程实践建议

  1. 从 Trace 开始:Trace 是最有价值的可观测性数据,能串联整个执行链路
  2. 成本可观测性是必需品:Agent 系统的成本可以在几分钟内失控,必须有实时监控
  3. 日志脱敏是硬性要求:Agent 日志中可能包含用户数据和 API 密钥
  4. 采样策略:生产环境使用尾部采样(tail-based sampling),错误和慢请求 100% 采集
  5. 推理链路保留:Agent 的 thinking/reasoning 是调试的关键信息,即使压缩也要保留摘要
  6. 告警疲劳防治:告警规则要分级,低优先级告警聚合批量发送

参考资料

  • OpenTelemetry 官方文档:Agent 可观测性的行业标准
  • LangSmith / LangFuse:LLM 应用专用的可观测性平台
  • Arize Phoenix:开源的 LLM 可观测性工具
  • Braintrust:Agent 评估 + 可观测性一体化平台
  • Datadog LLM Observability

Maurice | maurice_wen@proton.me