提示词版本管理与 A/B 测试

将提示词作为代码资产进行版本控制、回归测试和持续优化


为什么提示词需要版本管理

提示词在 LLM 应用中的地位等同于传统软件中的"业务逻辑代码"。一个提示词的微小修改可能导致输出质量的巨大变化。然而,大多数团队对提示词的管理仍然处于"复制粘贴"阶段。

提示词管理的成熟度模型:

Level 0: 硬编码
  提示词直接写在代码里,散落各处,无版本控制

Level 1: 文件化
  提示词提取为独立文件(.txt/.yaml),纳入 Git

Level 2: 模板化
  使用变量和模板引擎,支持参数化和复用

Level 3: 版本化
  每个提示词有独立的版本号,变更有审计记录

Level 4: 数据驱动
  A/B 测试、自动化评估、基于数据持续优化

一、版本控制体系

提示词仓库结构

prompts/
  registry.yaml              # 提示词注册表(元数据)
  code_review/
    v1.0.yaml                # 版本 1.0
    v1.1.yaml                # 版本 1.1(小改进)
    v2.0.yaml                # 版本 2.0(重构)
    tests/
      test_cases.yaml        # 测试用例
      golden_outputs.yaml    # 黄金标准输出
    CHANGELOG.md             # 变更记录
  data_analysis/
    v1.0.yaml
    tests/
      test_cases.yaml
  partials/
    safety_rules.yaml        # 共享的安全规则模块
    output_json.yaml          # 共享的 JSON 输出格式

提示词元数据

# prompts/registry.yaml
prompts:
  code_review:
    current_version: "2.0"
    description: "代码审查提示词"
    owner: "platform-team"
    model_compatibility:
      - "claude-opus-4-6"
      - "gpt-4o"
    tags: ["code", "review", "security"]
    created_at: "2025-10-01"
    updated_at: "2026-02-15"

  data_analysis:
    current_version: "1.1"
    description: "数据分析提示词"
    owner: "data-team"
    model_compatibility:
      - "claude-opus-4-6"
      - "gemini-2.5-pro"
    tags: ["data", "analysis", "visualization"]

版本文件格式

# prompts/code_review/v2.0.yaml
metadata:
  version: "2.0"
  parent_version: "1.1"
  author: "alice"
  date: "2026-02-15"
  change_summary: "重构输出格式为 JSON Schema,增加安全审查维度"
  breaking_changes: true
  migration_notes: "输出格式从 Markdown 变为 JSON,调用方需要更新解析逻辑"

template:
  role: "高级代码审查专家"

  system: |
    你是一个高级代码审查专家,拥有 10 年以上的软件工程经验。
    你的审查以安全性为最高优先级,其次是正确性和可维护性。

  context: |
    项目:${project_name}
    技术栈:${tech_stack}
    代码规范:${code_standard}

  task: |
    审查以下代码变更,按照审查标准给出结构化的审查意见。

  constraints:
    - "每个发现必须包含具体的行号和代码片段"
    - "严重级别分为 critical/major/minor/suggestion"
    - "必须提供可执行的修复建议"
    - "安全相关发现必须标注 OWASP 分类"

  output_schema:
    type: object
    properties:
      summary:
        type: string
        description: "100字以内的审查摘要"
      findings:
        type: array
        items:
          type: object
          properties:
            severity: { type: string, enum: [critical, major, minor, suggestion] }
            category: { type: string }
            file: { type: string }
            line: { type: integer }
            description: { type: string }
            suggestion: { type: string }
          required: [severity, category, description, suggestion]
      verdict:
        type: string
        enum: [approve, request_changes, reject]

  examples:
    - input: "def get_user(id):\n    return db.execute(f'SELECT * FROM users WHERE id = {id}')"
      output: |
        {"findings": [{"severity": "critical", "category": "security", "description": "SQL injection"}], "verdict": "reject"}

parameters:
  project_name: { type: string, required: true }
  tech_stack: { type: string, required: true }
  code_standard: { type: string, default: "PEP 8" }

二、变更管理流程

提示词变更工作流

提出变更 ──→ 编写新版本 ──→ 自动化测试 ──→ 代码审查 ──→ A/B测试 ──→ 全量发布
                              │               │            │
                              v               v            v
                         回归检测         人工评审      数据验证
                         (自动)           (同行)       (统计显著性)

变更记录

<!-- prompts/code_review/CHANGELOG.md -->

# Code Review Prompt Changelog

## v2.0 (2026-02-15) [BREAKING]
- 输出格式从 Markdown 改为 JSON Schema
- 新增安全审查维度(OWASP 分类)
- 新增 verdict 字段(approve/request_changes/reject)
- 测试用例从 15 个增加到 32 个
- 调用方需要更新解析逻辑

## v1.1 (2026-01-20)
- 增加性能审查维度
- 优化 Few-shot 示例(更具代表性)
- 修复:对 Python type hints 的误报

## v1.0 (2025-10-01)
- 初始版本
- 支持安全性、正确性、可维护性审查

三、自动化回归测试

测试框架设计

class PromptTestRunner:
    """提示词自动化测试运行器"""

    def __init__(self, prompt_dir: str):
        self.prompt_dir = Path(prompt_dir)

    def run_tests(self, prompt_name: str,
                   version: str) -> TestReport:
        """运行指定提示词版本的所有测试"""
        prompt = self._load_prompt(prompt_name, version)
        test_cases = self._load_test_cases(prompt_name)
        results = []

        for test_case in test_cases:
            result = self._run_single_test(prompt, test_case)
            results.append(result)

        return TestReport(
            prompt_name=prompt_name,
            version=version,
            total=len(results),
            passed=sum(1 for r in results if r.passed),
            failed=sum(1 for r in results if not r.passed),
            details=results
        )

    def _run_single_test(self, prompt: dict,
                          test_case: dict) -> TestResult:
        """执行单个测试用例"""
        # 渲染提示词
        rendered = self._render(prompt, test_case["input_params"])

        # 调用 LLM
        output = self._call_llm(rendered, test_case.get("input_text"))

        # 执行断言
        assertions = test_case.get("assertions", [])
        assertion_results = []

        for assertion in assertions:
            result = self._check_assertion(output, assertion)
            assertion_results.append(result)

        return TestResult(
            test_id=test_case["id"],
            passed=all(r.passed for r in assertion_results),
            assertions=assertion_results,
            output=output,
            duration_ms=self._last_duration_ms,
            token_usage=self._last_token_usage
        )

    def _check_assertion(self, output: str,
                          assertion: dict) -> AssertionResult:
        """执行单个断言"""
        assert_type = assertion["type"]

        if assert_type == "contains":
            # 输出包含特定文本
            passed = assertion["value"] in output
            return AssertionResult(
                type=assert_type,
                passed=passed,
                expected=assertion["value"],
                actual=f"{'found' if passed else 'not found'} in output"
            )

        elif assert_type == "json_schema":
            # 输出符合 JSON Schema
            try:
                parsed = json.loads(output)
                jsonschema.validate(parsed, assertion["schema"])
                return AssertionResult(type=assert_type, passed=True)
            except (json.JSONDecodeError, jsonschema.ValidationError) as e:
                return AssertionResult(
                    type=assert_type, passed=False, error=str(e)
                )

        elif assert_type == "not_contains":
            # 输出不包含特定文本(安全性检查)
            passed = assertion["value"] not in output
            return AssertionResult(type=assert_type, passed=passed)

        elif assert_type == "llm_judge":
            # 用 LLM 评判输出质量
            score = self._llm_judge(output, assertion["criteria"])
            passed = score >= assertion.get("threshold", 0.7)
            return AssertionResult(
                type=assert_type, passed=passed,
                score=score, threshold=assertion.get("threshold", 0.7)
            )

测试用例定义

# prompts/code_review/tests/test_cases.yaml
test_cases:
  - id: "tc-001"
    name: "SQL 注入检测"
    category: "security"
    input_params:
      project_name: "test-app"
      tech_stack: "Python + SQLAlchemy"
    input_text: |
      def get_user(user_id):
          query = f"SELECT * FROM users WHERE id = {user_id}"
          return db.execute(query)
    assertions:
      - type: "json_schema"
        schema: { "$ref": "#/definitions/review_output" }
      - type: "contains"
        value: "critical"
      - type: "contains"
        value: "SQL"
      - type: "not_contains"
        value: "approve"

  - id: "tc-002"
    name: "安全代码应通过"
    category: "positive"
    input_params:
      project_name: "test-app"
      tech_stack: "Python + SQLAlchemy"
    input_text: |
      def get_user(user_id: int) -> User:
          return db.session.query(User).filter(User.id == user_id).first()
    assertions:
      - type: "json_schema"
        schema: { "$ref": "#/definitions/review_output" }
      - type: "not_contains"
        value: "critical"
      - type: "llm_judge"
        criteria: "审查结论是否合理?安全代码不应被报告重大问题。"
        threshold: 0.8

  - id: "tc-003"
    name: "空输入处理"
    category: "edge_case"
    input_params:
      project_name: "test-app"
      tech_stack: "Python"
    input_text: ""
    assertions:
      - type: "json_schema"
        schema: { "$ref": "#/definitions/review_output" }
      - type: "not_contains"
        value: "undefined"

回归检测

class RegressionDetector:
    """提示词回归检测"""

    def compare_versions(self, prompt_name: str,
                          old_version: str,
                          new_version: str,
                          num_runs: int = 3) -> RegressionReport:
        """对比两个版本的测试结果"""
        old_results = []
        new_results = []

        for _ in range(num_runs):
            old_results.append(
                self.runner.run_tests(prompt_name, old_version)
            )
            new_results.append(
                self.runner.run_tests(prompt_name, new_version)
            )

        # 聚合结果
        old_pass_rate = np.mean([r.pass_rate for r in old_results])
        new_pass_rate = np.mean([r.pass_rate for r in new_results])

        # 统计显著性检验
        _, p_value = stats.ttest_ind(
            [r.pass_rate for r in old_results],
            [r.pass_rate for r in new_results]
        )

        regressions = self._find_regressions(old_results, new_results)

        return RegressionReport(
            old_version=old_version,
            new_version=new_version,
            old_pass_rate=old_pass_rate,
            new_pass_rate=new_pass_rate,
            delta=new_pass_rate - old_pass_rate,
            p_value=p_value,
            statistically_significant=p_value < 0.05,
            regressions=regressions,
            verdict="PASS" if not regressions else "FAIL"
        )

    def _find_regressions(self, old_results, new_results):
        """找出具体的回归测试用例"""
        regressions = []
        old_pass_counts = defaultdict(int)
        new_pass_counts = defaultdict(int)

        for report in old_results:
            for detail in report.details:
                if detail.passed:
                    old_pass_counts[detail.test_id] += 1

        for report in new_results:
            for detail in report.details:
                if detail.passed:
                    new_pass_counts[detail.test_id] += 1

        for test_id in old_pass_counts:
            old_rate = old_pass_counts[test_id] / len(old_results)
            new_rate = new_pass_counts.get(test_id, 0) / len(new_results)
            if old_rate > 0.8 and new_rate < 0.5:
                regressions.append({
                    "test_id": test_id,
                    "old_pass_rate": old_rate,
                    "new_pass_rate": new_rate
                })

        return regressions

四、A/B 测试框架

实验设计

class PromptExperiment:
    """提示词 A/B 测试实验"""

    def __init__(self, name: str, config: ExperimentConfig):
        self.name = name
        self.config = config
        self.results = {"control": [], "treatment": []}

    def assign_variant(self, request_id: str) -> str:
        """分配实验组"""
        # 基于请求 ID 的确定性分配(可复现)
        hash_val = hash(f"{self.name}:{request_id}") % 100
        if hash_val < self.config.treatment_percentage:
            return "treatment"
        return "control"

    def record_result(self, variant: str,
                       metrics: dict):
        """记录实验结果"""
        self.results[variant].append({
            "timestamp": datetime.now(),
            **metrics
        })

    def analyze(self) -> ExperimentAnalysis:
        """分析实验结果"""
        control = self.results["control"]
        treatment = self.results["treatment"]

        if len(control) < 30 or len(treatment) < 30:
            return ExperimentAnalysis(
                status="insufficient_data",
                message=f"Need 30+ samples per group. "
                        f"Control: {len(control)}, "
                        f"Treatment: {len(treatment)}"
            )

        analyses = {}
        for metric_name in self.config.metrics:
            ctrl_values = [r[metric_name] for r in control]
            treat_values = [r[metric_name] for r in treatment]

            # t 检验
            t_stat, p_value = stats.ttest_ind(ctrl_values, treat_values)

            ctrl_mean = np.mean(ctrl_values)
            treat_mean = np.mean(treat_values)
            lift = (treat_mean - ctrl_mean) / ctrl_mean if ctrl_mean else 0

            analyses[metric_name] = {
                "control_mean": ctrl_mean,
                "treatment_mean": treat_mean,
                "lift": f"{lift:+.1%}",
                "p_value": p_value,
                "significant": p_value < 0.05,
                "recommendation": (
                    "ADOPT" if p_value < 0.05 and lift > 0
                    else "REJECT" if p_value < 0.05 and lift < 0
                    else "CONTINUE"  # 需要更多数据
                )
            }

        return ExperimentAnalysis(
            status="completed",
            sample_sizes={"control": len(control),
                         "treatment": len(treatment)},
            metrics=analyses
        )

A/B 测试配置

# experiments/code_review_v2.yaml
experiment:
  name: "code_review_v2_rollout"
  description: "测试 v2.0 代码审查提示词的效果"
  start_date: "2026-02-20"
  end_date: "2026-03-05"

  variants:
    control:
      prompt: "code_review/v1.1"
      description: "当前线上版本"
    treatment:
      prompt: "code_review/v2.0"
      description: "新版 JSON Schema 输出"

  traffic_split:
    control: 80
    treatment: 20

  metrics:
    - name: "output_quality"
      description: "LLM Judge 评分 (0-1)"
      primary: true
      direction: "higher_is_better"

    - name: "schema_compliance"
      description: "输出符合 Schema 的比例"
      direction: "higher_is_better"

    - name: "token_usage"
      description: "平均 Token 消耗"
      direction: "lower_is_better"

    - name: "latency_ms"
      description: "平均延迟(毫秒)"
      direction: "lower_is_better"

  guardrails:
    min_sample_size: 100
    max_regression_allowed: 0.05  # 主指标回归超过 5% 自动回滚
    auto_rollback: true

实验结果报告

Experiment Report: code_review_v2_rollout
==========================================

Duration: 2026-02-20 to 2026-02-28 (8 days)
Status: COMPLETED

Sample Sizes:
  Control (v1.1):   412 requests
  Treatment (v2.0):  98 requests

Metrics:
  output_quality (primary):
    Control:   0.72
    Treatment: 0.81
    Lift:      +12.5%
    p-value:   0.003
    Verdict:   SIGNIFICANT IMPROVEMENT

  schema_compliance:
    Control:   0.65
    Treatment: 0.94
    Lift:      +44.6%
    p-value:   < 0.001
    Verdict:   SIGNIFICANT IMPROVEMENT

  token_usage:
    Control:   2,340
    Treatment: 2,890
    Lift:      +23.5%
    p-value:   < 0.001
    Verdict:   SIGNIFICANT REGRESSION (cost increase)

  latency_ms:
    Control:   3,200
    Treatment: 3,800
    Lift:      +18.8%
    p-value:   0.012
    Verdict:   SIGNIFICANT REGRESSION (slower)

Recommendation:
  ADOPT with cost awareness.
  Quality and compliance significantly improved.
  Token usage and latency increased, acceptable tradeoff.
  Consider optimizing v2.0 to reduce token consumption.

五、持续优化循环

提示词持续优化循环:

  发布版本 ──→ 收集数据 ──→ 分析质量 ──→ 识别问题
       ^                                    │
       │                                    v
       └──── A/B 测试 ←── 编写新版本 ←── 设计改进

自动化优化建议

class PromptOptimizer:
    """提示词自动化优化建议"""

    def analyze_failures(self, prompt_name: str,
                          recent_days: int = 7) -> list[dict]:
        """分析近期失败案例,生成优化建议"""
        failures = self._get_recent_failures(prompt_name, recent_days)

        # 按错误模式聚类
        clusters = self._cluster_failures(failures)

        suggestions = []
        for cluster in clusters:
            # 用 LLM 分析失败模式并生成优化建议
            analysis = llm.generate(f"""
分析以下提示词失败案例的共同模式:

{json.dumps(cluster['samples'][:5], ensure_ascii=False)}

当前提示词:
{self._load_current_prompt(prompt_name)}

请给出具体的提示词修改建议。""")

            suggestions.append({
                "cluster_size": cluster["count"],
                "failure_pattern": cluster["pattern"],
                "suggestion": analysis,
                "estimated_impact": cluster["count"] / len(failures)
            })

        return sorted(suggestions,
                      key=lambda x: x["estimated_impact"],
                      reverse=True)

工程实践建议

  1. 提示词即代码:提示词变更应该走和代码一样的 PR Review 流程
  2. 测试覆盖率:每个提示词至少 10 个测试用例,覆盖正例、反例、边界
  3. 多次运行:LLM 输出非确定性,每个测试用例至少运行 3 次取共识
  4. A/B 测试先行:大的提示词变更必须经过 A/B 测试验证后才全量发布
  5. 成本追踪:提示词优化可能增加 Token 消耗,要同时追踪成本指标
  6. 模型兼容性:同一提示词在不同模型上的表现可能差异巨大,需分别测试

参考资料

  • PromptFoo:开源的提示词测试框架
  • LangSmith:LangChain 的提示词管理和评估平台
  • Braintrust:提示词版本管理 + A/B 测试平台
  • Humanloop:提示词管理 + 评估 SaaS
  • DSPy:自动化提示词优化框架

Maurice | maurice_wen@proton.me