AI 产品指标看板设计

从 DAU 到 Cost-per-Query:构建 AI 产品的数据可观测体系


为什么 AI 产品需要专属指标体系

传统 SaaS 产品的核心指标是 DAU、留存率、转化率。AI 产品除了这些,还必须追踪模型质量推理成本两个独特维度。一个 DAU 增长 50% 但推理成本增长 200% 的 AI 产品,可能正在走向死亡。

本文覆盖指标设计、看板布局、告警阈值、数据管道和 Grafana/Metabase 落地实践。


一、AI 产品指标分层框架

1.1 四层指标模型

┌──────────────────────────────────────────────────┐
│  Layer 1: 业务指标(Business Metrics)             │
│  DAU/MAU, Revenue, Conversion, Churn              │
│  -> 回答: 产品有没有商业价值?                      │
├──────────────────────────────────────────────────┤
│  Layer 2: 产品指标(Product Metrics)              │
│  Session Duration, Feature Usage, Task Success    │
│  -> 回答: 用户在用什么?用得好吗?                   │
├──────────────────────────────────────────────────┤
│  Layer 3: AI 质量指标(AI Quality Metrics)        │
│  Accuracy, Latency, Hallucination Rate, CSAT      │
│  -> 回答: AI 够好吗?在变好还是变差?                │
├──────────────────────────────────────────────────┤
│  Layer 4: 基础设施指标(Infra Metrics)            │
│  Cost/Query, GPU Util, Error Rate, Throughput     │
│  -> 回答: 系统健康吗?钱花得值吗?                   │
└──────────────────────────────────────────────────┘

1.2 核心指标矩阵

指标 层级 采集方式 刷新频率 健康阈值
DAU/MAU L1 事件追踪 实时 DAU/MAU > 25%
付费转化率 L1 支付事件 > 3%
月流失率 L1 订阅状态 < 5%
会话完成率 L2 事件追踪 实时 > 80%
功能采纳率 L2 事件追踪 Top 3 功能 > 60%
AI 准确率 L3 人工评审 + 自动评估 > 90%
平均延迟 L3 APM 实时 P95 < 5s
幻觉率 L3 自动检测 + 人工抽样 < 3%
CSAT L3 用户反馈 > 4.0/5.0
Cost/Query L4 计费 API 实时 < ¥0.10
GPU 利用率 L4 监控 Agent 实时 60-85%
错误率 L4 日志聚合 实时 < 0.5%

二、看板布局设计

2.1 Executive Dashboard(高管视图)

一屏展示最关键的 6-8 个指标,30 秒内看完全局:

┌─────────────────────────────────────────────────────────┐
│  AI Product Executive Dashboard           2026-02-28    │
├─────────────┬───────────────┬───────────────────────────┤
│  DAU         │  Revenue       │  AI Quality Score        │
│  12,847      │  ¥485,200      │  ████████░░  82/100     │
│  +12% WoW   │  +8% MoM       │  +3 pts MoM             │
├─────────────┼───────────────┼───────────────────────────┤
│  Retention   │  Cost/Query    │  CSAT                    │
│  D7: 45%     │  ¥0.067        │  4.2 / 5.0              │
│  D30: 28%    │  -15% MoM      │  +0.1 MoM               │
├─────────────┴───────────────┴───────────────────────────┤
│  [7-Day Trend: DAU]   ▁▂▃▃▅▆█                          │
│  [7-Day Trend: Rev]   ▃▃▄▅▅▆▇                          │
│  [7-Day Trend: CSAT]  ▅▅▆▆▆▇▇                          │
├─────────────────────────────────────────────────────────┤
│  Active Alerts: 1 WARNING (P95 latency > 4s)           │
└─────────────────────────────────────────────────────────┘

2.2 Operations Dashboard(运营视图)

聚焦用户行为和产品使用情况:

┌─────────────────────────────────────────────────────────┐
│  Operations Dashboard                                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  [User Funnel]                                           │
│  Visit -> Signup -> Activate -> Retain -> Pay            │
│  100%  -> 22%    -> 68%      -> 45%    -> 8%            │
│                                                          │
│  [Feature Usage Heatmap]                                 │
│  Chat:           ████████████████  82%                   │
│  Doc Analysis:   ██████████░░░░░░  55%                   │
│  Report Gen:     ████████░░░░░░░░  42%                   │
│  API Access:     ████░░░░░░░░░░░░  18%                   │
│                                                          │
│  [Session Quality Distribution]                          │
│  Excellent (>0.8):  ████████░░  35%                      │
│  Good (0.5-0.8):    ██████████  45%                      │
│  Poor (<0.5):       ████░░░░░░  20%                      │
│                                                          │
│  [Top User Queries This Week]                            │
│  1. 发票合规检查 (2,847)                                  │
│  2. 税率计算 (1,923)                                      │
│  3. 报表生成 (1,456)                                      │
│                                                          │
└─────────────────────────────────────────────────────────┘

2.3 AI Quality Dashboard(模型质量视图)

这是 AI 产品独有的看板:

┌─────────────────────────────────────────────────────────┐
│  AI Quality Dashboard                                    │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  [Model Performance by Category]                         │
│  Category          Accuracy  Latency  Hallucination     │
│  Tax Classification  94.2%    1.2s     1.8%             │
│  Invoice Parsing     91.7%    2.3s     2.5%             │
│  Compliance Check    88.5%    3.8s     3.2%             │
│  Report Generation   86.3%    5.1s     4.1%             │
│                                                          │
│  [Quality Trend (30 Days)]                               │
│  Accuracy:   ▁▂▂▃▃▃▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇█████           │
│  Latency:    █▇▇▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁           │
│                                                          │
│  [User Feedback Distribution]                            │
│  Thumbs Up:    ████████████  72%                         │
│  Thumbs Down:  ████░░░░░░░░  15%                         │
│  Regenerated:  ███░░░░░░░░░  13%                         │
│                                                          │
│  [Hallucination Detection]                               │
│  Auto-detected:   45 / day                               │
│  User-reported:   12 / day                               │
│  False positive:  8%                                     │
│                                                          │
└─────────────────────────────────────────────────────────┘

三、告警阈值设计

3.1 分级告警策略

级别 条件 通知方式 响应时间
P0 Critical 服务完全不可用 / 数据泄露 电话 + 短信 + 钉钉 5 分钟
P1 High 准确率骤降 > 10% / 错误率 > 5% 短信 + 钉钉 15 分钟
P2 Medium 延迟 P95 > 8s / Cost 异常 > 50% 钉钉 + 邮件 1 小时
P3 Low 指标轻微偏离 / 趋势预警 邮件 + 日报 24 小时

3.2 AI 专属告警规则

# alerting-rules.yaml
alerts:
  - name: accuracy_drop
    metric: ai.accuracy.rolling_24h
    condition: decrease > 5% compared to 7-day avg
    severity: P1
    message: "AI accuracy dropped {value}% in last 24h"

  - name: hallucination_spike
    metric: ai.hallucination.rate.1h
    condition: value > 5%
    severity: P1
    message: "Hallucination rate spiked to {value}%"

  - name: cost_anomaly
    metric: infra.cost_per_query.1h
    condition: value > 2x of 7-day avg
    severity: P2
    message: "Cost per query anomaly: {value} (avg: {avg})"

  - name: latency_degradation
    metric: ai.latency.p95.5m
    condition: value > 8000  # milliseconds
    severity: P2
    message: "P95 latency: {value}ms"

  - name: feedback_negative_surge
    metric: ai.feedback.negative_rate.1h
    condition: value > 25%
    severity: P2
    message: "Negative feedback rate: {value}%"

  - name: model_drift
    metric: ai.distribution.kl_divergence.daily
    condition: value > 0.15
    severity: P3
    message: "Model input distribution drift detected: KL={value}"

四、数据管道架构

4.1 端到端数据流

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Client   │    │  API     │    │  Stream  │    │  Storage │
│  SDK      │───>│  Gateway │───>│  Kafka   │───>│  ClickH. │
│           │    │          │    │          │    │          │
│  Events:  │    │  Enrich: │    │  Topics: │    │  Tables: │
│  - click  │    │  - user  │    │  - events│    │  - events│
│  - query  │    │  - geo   │    │  - metrics│   │  - metrics│
│  - feedback│   │  - device│    │  - logs  │    │  - agg   │
│  - timing │    │  - session│   │          │    │          │
└──────────┘    └──────────┘    └──────────┘    └──────────┘
                                                      │
                                                      ▼
                                              ┌──────────────┐
                                              │   Dashboard   │
                                              │  Grafana /    │
                                              │  Metabase     │
                                              └──────────────┘

4.2 事件追踪 Schema

interface AIEvent {
  // Standard fields
  event_id: string;           // UUID
  timestamp: string;          // ISO 8601
  user_id: string;
  session_id: string;
  event_type: string;         // "query" | "feedback" | "action" | "error"

  // AI-specific fields
  model_id: string;           // "gpt-4" | "claude-3" | "custom-v2"
  prompt_tokens: number;
  completion_tokens: number;
  latency_ms: number;
  cost_cents: number;         // Cost in cents (USD/RMB)

  // Quality fields
  confidence_score: number;   // 0.0 - 1.0
  hallucination_detected: boolean;
  user_feedback: "positive" | "negative" | "neutral" | null;
  regeneration_count: number;

  // Context
  feature: string;            // "chat" | "doc_analysis" | "report"
  input_type: string;         // "text" | "file" | "image"
  output_type: string;        // "text" | "table" | "chart"

  // Metadata
  metadata: Record<string, unknown>;
}

4.3 聚合查询示例

-- Daily AI quality metrics
SELECT
    toDate(timestamp) AS date,
    model_id,
    feature,
    count() AS total_queries,
    avg(latency_ms) AS avg_latency,
    quantile(0.95)(latency_ms) AS p95_latency,
    avg(confidence_score) AS avg_confidence,
    countIf(hallucination_detected) / count() AS hallucination_rate,
    countIf(user_feedback = 'positive') /
        nullIf(countIf(user_feedback IS NOT NULL), 0) AS positive_rate,
    sum(cost_cents) / 100.0 AS total_cost_yuan,
    sum(cost_cents) / count() / 100.0 AS cost_per_query_yuan
FROM ai_events
WHERE event_type = 'query'
  AND timestamp >= today() - INTERVAL 30 DAY
GROUP BY date, model_id, feature
ORDER BY date DESC, total_queries DESC;

五、Grafana 落地实践

5.1 Dashboard 组织结构

Grafana Folder Structure:
  AI Product/
    ├── Executive Overview          # 高管看板
    ├── User & Product Metrics      # 用户与产品指标
    ├── AI Quality Monitoring       # AI 质量监控
    ├── Cost & Infrastructure       # 成本与基础设施
    └── Alerts & Incidents          # 告警与事件

5.2 关键面板配置

{
  "dashboard": {
    "title": "AI Quality Monitoring",
    "panels": [
      {
        "title": "Accuracy by Feature (7-Day Rolling)",
        "type": "timeseries",
        "datasource": "ClickHouse",
        "targets": [{
          "rawSql": "SELECT toStartOfHour(timestamp) AS time, feature, avg(confidence_score) AS accuracy FROM ai_events WHERE timestamp >= now() - INTERVAL 7 DAY GROUP BY time, feature ORDER BY time"
        }],
        "fieldConfig": {
          "defaults": {
            "min": 0.7,
            "max": 1.0,
            "thresholds": {
              "steps": [
                { "value": 0.85, "color": "red" },
                { "value": 0.90, "color": "yellow" },
                { "value": 0.95, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Cost per Query (Hourly)",
        "type": "stat",
        "datasource": "ClickHouse",
        "targets": [{
          "rawSql": "SELECT sum(cost_cents)/count()/100.0 AS cost FROM ai_events WHERE event_type='query' AND timestamp >= now() - INTERVAL 1 HOUR"
        }]
      }
    ]
  }
}

六、Metabase 业务分析设置

6.1 适用场景对比

维度 Grafana Metabase
定位 实时监控 + 告警 业务分析 + 自助查询
用户 工程师 / SRE 产品经理 / 运营 / 管理层
数据刷新 秒级 分钟级
可视化 时序图为主 表格/漏斗/地图
告警 原生支持 有限支持
自助查询 需 SQL 可视化拖拽
推荐用法 L3/L4 指标 L1/L2 指标

6.2 Metabase 核心 Question 配置

Saved Questions:
  1. "Daily Active Users Trend"
     - Table: user_sessions
     - Group by: date, user_type
     - Visualization: Line chart

  2. "Feature Usage Breakdown"
     - Table: ai_events
     - Filter: event_type = 'query'
     - Group by: feature
     - Visualization: Bar chart

  3. "Conversion Funnel"
     - Custom SQL with CTE
     - Steps: Visit -> Signup -> First Query -> 10th Query -> Paid
     - Visualization: Funnel

  4. "Cost Analysis by Model"
     - Table: ai_events
     - Group by: model_id, week
     - Metrics: total_cost, avg_cost_per_query, total_queries
     - Visualization: Pivot table

七、指标驱动决策框架

7.1 常见决策场景

场景 看什么指标 决策标准
是否上线新模型 Accuracy + Latency + Cost Accuracy >= 当前, Latency <= 1.5x, Cost <= 2x
是否推广新功能 Feature Usage + CSAT + Retention Impact Day 7 Retention 提升 > 2%
是否调整定价 Conversion + Churn + Revenue Revenue +15% AND Churn < +2%
是否降级模型 Cost/Query + Accuracy Drop Cost 下降 > 30% AND Accuracy 下降 < 3%
是否扩容 GPU Util + P95 Latency + Error Rate GPU > 80% 或 P95 > 5s

7.2 A/B 测试框架

# AI-specific A/B test configuration
AB_TEST_CONFIG = {
    "model_comparison": {
        "control": "gpt-4-turbo",
        "treatment": "claude-3-opus",
        "metrics": {
            "primary": "user_satisfaction_score",
            "secondary": ["accuracy", "latency_p95", "cost_per_query"],
            "guardrail": ["hallucination_rate", "error_rate"]
        },
        "split": 50,  # 50/50 split
        "min_sample": 1000,  # queries per arm
        "duration_days": 14,
        "success_criteria": {
            "primary_lift": 0.05,     # 5% improvement
            "guardrail_max_increase": 0.01  # No more than 1% increase
        }
    }
}

总结

AI 产品指标体系的核心原则:

  1. 四层分明 —— 业务/产品/AI 质量/基础设施各司其职
  2. AI 独有指标不可缺 —— 准确率、幻觉率、Cost/Query 是 AI 产品的生命线
  3. 两套系统互补 —— Grafana 管监控告警,Metabase 管业务分析
  4. 告警分级响应 —— P0 电话叫人,P3 日报提醒,不一刀切
  5. 数据驱动决策 —— 每个决策场景都有对应的指标组合和判断标准

指标不是目的,决策才是。建设看板的终极目标是让团队在 30 秒内看到问题,5 分钟内定位原因,1 小时内推动修复。


Maurice | maurice_wen@proton.me