Meta-Prompting:让AI优化AI的提示词

自动化提示词优化、DSPy 框架与评估驱动的 Prompt 进化 | 2026-02


一、Meta-Prompting 的核心思想

Meta-Prompting 是用 LLM 来优化 LLM 的提示词。这不是一个新概念——它本质上是把"提示词工程"这个人类任务也交给 AI 来完成,形成自我改进的闭环

Traditional Prompt Engineering:
  Human writes prompt -> Test -> Human edits prompt -> Test -> ...

Meta-Prompting:
  Human defines task + eval -> AI generates prompt -> Auto-eval
       -> AI improves prompt -> Auto-eval -> ... (loop until good enough)

二、Meta-Prompting 技术路线

2.1 四种主要方法

方法 原理 代表工具 自动化程度
Prompt 生成 描述任务让 AI 写 prompt 手动 / AI Studio
Prompt 优化 基于反馈迭代改进 OPRO / APE
编程式优化 将 prompt 视为程序参数 DSPy
进化式搜索 基因算法搜索 prompt 空间 EvoPrompt

2.2 方法对比

Automation Level vs Quality

Quality
  |
  |              DSPy
  |             /
  |       OPRO /
  |         / /
  |  APE  /  /
  |     / EvoPrompt
  |   /
  |  / Manual
  | /
  +-----------------> Automation
     Low      High

Trade-offs:
- Manual: Highest control, lowest scale
- APE/OPRO: Good balance, needs good eval
- DSPy: Best for pipelines, steep learning curve
- EvoPrompt: Creative exploration, expensive

三、基础 Meta-Prompting

3.1 Prompt 生成器

META_PROMPT_GENERATOR = """
You are an expert prompt engineer. Your task is to create an optimal
system prompt for a specific use case.

## Task Description
{task_description}

## Requirements
- The prompt should be clear, specific, and unambiguous
- Include role definition, rules, output format, and examples
- Use structured sections (Identity, Rules, Format, Examples)
- Anticipate edge cases and include handling instructions
- Optimize for the model: {target_model}

## Evaluation Criteria
The prompt will be evaluated on:
{eval_criteria}

## Output
Generate a complete system prompt ready for production use.
Include your reasoning for key design decisions.
"""

async def generate_prompt(
    task_description: str,
    eval_criteria: list[str],
    target_model: str = "gpt-4o",
) -> str:
    response = await openai.chat.completions.create(
        model="gpt-4o",  # Use strong model for meta-prompting
        messages=[
            {"role": "system", "content": "You are a world-class prompt engineer."},
            {"role": "user", "content": META_PROMPT_GENERATOR.format(
                task_description=task_description,
                eval_criteria="\n".join(f"- {c}" for c in eval_criteria),
                target_model=target_model,
            )},
        ],
        temperature=0.7,
        max_tokens=4096,
    )
    return response.choices[0].message.content

3.2 自动优化循环

OPTIMIZER_PROMPT = """
You are a prompt optimization specialist.

## Current Prompt
{current_prompt}

## Evaluation Results
{eval_results}

## Failure Cases
{failure_cases}

## Task
Analyze why the prompt failed on these cases and generate an improved version.
Focus on:
1. What pattern do the failures have in common?
2. What instruction is missing or unclear?
3. How can you make the prompt more robust?

Output the improved prompt only, with brief annotations for changes.
"""

async def optimize_prompt_loop(
    initial_prompt: str,
    test_dataset: list[dict],
    evaluator: callable,
    max_iterations: int = 5,
    target_score: float = 0.90,
) -> tuple[str, float]:
    """Iteratively optimize a prompt using LLM feedback."""
    current_prompt = initial_prompt
    best_prompt = initial_prompt
    best_score = 0.0

    for iteration in range(max_iterations):
        # Evaluate current prompt
        results = await evaluate_prompt(current_prompt, test_dataset, evaluator)
        score = results["avg_score"]

        print(f"Iteration {iteration + 1}: Score = {score:.3f}")

        if score > best_score:
            best_prompt = current_prompt
            best_score = score

        if score >= target_score:
            print(f"Target reached at iteration {iteration + 1}")
            break

        # Collect failure cases
        failures = [r for r in results["details"] if r["score"] < 0.5][:5]
        failure_text = "\n".join(
            f"Input: {f['input']}\nExpected: {f['expected']}\nGot: {f['output']}\n"
            for f in failures
        )

        # Generate improved prompt
        response = await openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": OPTIMIZER_PROMPT.format(
                    current_prompt=current_prompt,
                    eval_results=f"Score: {score:.3f}, Pass rate: {results['pass_rate']:.1%}",
                    failure_cases=failure_text,
                )},
            ],
            temperature=0.5,
        )
        current_prompt = response.choices[0].message.content

    return best_prompt, best_score

四、DSPy 框架

4.1 DSPy 核心理念

DSPy 将提示词工程从"写提示词"转变为"编写程序"。它的核心理念是:声明你想要什么(签名),而不是告诉模型怎么做(提示词)

import dspy

# Traditional approach: manually craft prompt
manual_prompt = """Given a question and context, provide a concise answer.
Be factual. Cite the context. Keep it under 50 words."""

# DSPy approach: declare the signature
class QA(dspy.Signature):
    """Answer the question based on the given context."""
    context: str = dspy.InputField(desc="Relevant information")
    question: str = dspy.InputField(desc="User's question")
    answer: str = dspy.OutputField(desc="Concise, factual answer")

# DSPy automatically optimizes the prompt behind the scenes

4.2 DSPy 模块化

import dspy
from dspy.teleprompt import BootstrapFewShot

# Configure the LLM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Define modules
class RAGPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=5)
        self.generate = dspy.ChainOfThought(QA)

    def forward(self, question: str) -> dspy.Prediction:
        # Retrieve relevant passages
        context = self.retrieve(question).passages
        # Generate answer with chain-of-thought
        prediction = self.generate(
            context="\n".join(context),
            question=question,
        )
        return prediction

# Define evaluation metric
def metric(example, prediction, trace=None):
    """Evaluate if the answer is correct and faithful."""
    # Check correctness
    correct = dspy.evaluate.answer_exact_match(example, prediction)
    # Check faithfulness (answer supported by context)
    faithful = dspy.evaluate.faithfulness(example, prediction)
    return correct and faithful

# Compile (optimize) the pipeline
trainset = [
    dspy.Example(question="What is RAG?", answer="Retrieval Augmented Generation"),
    dspy.Example(question="Who created Python?", answer="Guido van Rossum"),
    # ... more training examples
]

optimizer = BootstrapFewShot(metric=metric, max_bootstrapped_demos=4)
compiled_rag = optimizer.compile(RAGPipeline(), trainset=trainset)

# The compiled module has optimized prompts and few-shot examples
result = compiled_rag(question="What is vector search?")

4.3 DSPy Optimizers

Optimizer 原理 适用场景 成本
BootstrapFewShot 自动选择最佳 few-shot 示例 通用分类/生成
BootstrapFewShotWithRandomSearch 随机搜索示例组合 有足够训练数据
MIPRO 多任务联合优化 多步骤 pipeline
COPRO 协调优化提示词+示例 复杂链路
SignatureOptimizer 直接优化签名描述 精细调控

五、OPRO(Optimization by PROmpting)

5.1 OPRO 算法

async def opro_optimize(
    task_description: str,
    test_cases: list[dict],
    n_candidates: int = 8,
    n_iterations: int = 10,
) -> str:
    """OPRO: Use LLM to propose and evaluate prompt candidates."""
    # Initialize with random prompts
    prompt_scores: list[tuple[str, float]] = []

    for iteration in range(n_iterations):
        # Build meta-prompt with history
        history_text = "\n".join(
            f'Prompt: "{p}"\nScore: {s:.3f}'
            for p, s in sorted(prompt_scores, key=lambda x: x[1])[-10:]
        )

        # Generate new candidate prompts
        meta_prompt = f"""
Task: {task_description}

Previous prompts and their scores (higher is better):
{history_text}

Generate {n_candidates} new prompt candidates that might score higher.
Learn from the patterns in high-scoring prompts.
Output each prompt on a separate line, prefixed with "PROMPT: ".
"""
        response = await openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": meta_prompt}],
            temperature=1.0,  # High temperature for diversity
        )

        # Extract candidates
        candidates = [
            line.replace("PROMPT: ", "").strip()
            for line in response.choices[0].message.content.split("\n")
            if line.strip().startswith("PROMPT:")
        ]

        # Evaluate each candidate
        for candidate in candidates:
            score = await evaluate_prompt_on_test_cases(candidate, test_cases)
            prompt_scores.append((candidate, score))

    # Return best prompt
    best_prompt, best_score = max(prompt_scores, key=lambda x: x[1])
    return best_prompt

六、进化式搜索

6.1 EvoPrompt 概念

import random

async def evo_prompt(
    initial_population: list[str],
    test_cases: list[dict],
    generations: int = 20,
    population_size: int = 10,
    mutation_rate: float = 0.3,
) -> str:
    """Evolutionary prompt optimization."""
    # Evaluate initial population
    population = []
    for prompt in initial_population:
        score = await evaluate_prompt_on_test_cases(prompt, test_cases)
        population.append({"prompt": prompt, "score": score})

    for gen in range(generations):
        # Selection: keep top 50%
        population.sort(key=lambda x: x["score"], reverse=True)
        survivors = population[:population_size // 2]

        # Crossover: combine pairs of survivors
        offspring = []
        for i in range(0, len(survivors) - 1, 2):
            child = await crossover(
                survivors[i]["prompt"],
                survivors[i + 1]["prompt"],
            )
            offspring.append(child)

        # Mutation: randomly modify some prompts
        mutants = []
        for item in survivors:
            if random.random() < mutation_rate:
                mutated = await mutate(item["prompt"])
                mutants.append(mutated)

        # Evaluate new candidates
        new_candidates = offspring + mutants
        for prompt in new_candidates:
            score = await evaluate_prompt_on_test_cases(prompt, test_cases)
            population.append({"prompt": prompt, "score": score})

        # Keep only top N
        population.sort(key=lambda x: x["score"], reverse=True)
        population = population[:population_size]

        print(f"Gen {gen + 1}: Best = {population[0]['score']:.3f}")

    return population[0]["prompt"]

async def crossover(prompt_a: str, prompt_b: str) -> str:
    """Combine two prompts using LLM."""
    response = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"""
Combine the best aspects of these two prompts into one:

Prompt A: {prompt_a}

Prompt B: {prompt_b}

Output a single improved prompt that takes the strengths of both.
"""}],
        temperature=0.7,
    )
    return response.choices[0].message.content

async def mutate(prompt: str) -> str:
    """Randomly modify a prompt using LLM."""
    mutations = [
        "Make the instructions more specific",
        "Add an edge case handling rule",
        "Simplify the language",
        "Add a constraint to reduce errors",
        "Rephrase for clarity",
    ]
    mutation = random.choice(mutations)
    response = await openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"""
Modify this prompt by: {mutation}

Original: {prompt}

Output the modified prompt only.
"""}],
        temperature=0.8,
    )
    return response.choices[0].message.content

七、评估框架设计

7.1 评估驱动的优化

步骤 描述 工具
定义指标 明确什么是"好" 人工标注 + LLM 评判
构建测试集 覆盖正常和边界情况 至少 50 个样本
自动评估 每次改动自动评分 Python + LLM-as-Judge
统计检验 改进是否显著 t-test / bootstrap
回归检测 新版本不退步 基准测试集

7.2 评估函数设计

from dataclasses import dataclass

@dataclass
class EvalResult:
    score: float           # 0.0 - 1.0
    pass_rate: float       # Percentage passing threshold
    avg_latency_ms: float
    avg_cost_usd: float
    details: list[dict]

async def comprehensive_eval(
    prompt: str,
    test_set: list[dict],
    model: str = "gpt-4o-mini",
) -> EvalResult:
    """Evaluate a prompt across multiple dimensions."""
    details = []

    for sample in test_set:
        start = time.time()
        output = await generate(prompt, sample["input"], model)
        latency = (time.time() - start) * 1000

        # Multi-dimensional scoring
        scores = {
            "correctness": await score_correctness(output, sample["expected"]),
            "format": score_format_compliance(output, sample.get("format")),
            "safety": score_safety(output),
        }
        avg_score = sum(scores.values()) / len(scores)

        details.append({
            "input": sample["input"],
            "output": output,
            "expected": sample["expected"],
            "scores": scores,
            "score": avg_score,
            "latency_ms": latency,
        })

    return EvalResult(
        score=sum(d["score"] for d in details) / len(details),
        pass_rate=sum(1 for d in details if d["score"] >= 0.7) / len(details),
        avg_latency_ms=sum(d["latency_ms"] for d in details) / len(details),
        avg_cost_usd=estimate_cost(details, model),
        details=details,
    )

八、总结

Meta-Prompting 代表了提示词工程从"手工艺"走向"自动化"的趋势。DSPy 适合有明确评估指标的 pipeline 优化;OPRO/EvoPrompt 适合探索性的 prompt 搜索;简单的优化循环适合大多数实际场景。

核心原则:

  1. 评估先于优化:没有好的评估函数,任何优化都是盲目的
  2. 人机协作:AI 搜索 prompt 空间,人类定义目标和判断标准
  3. 渐进式采用:先手动调优,遇到瓶颈再引入自动化
  4. 保持可解释:优化后的 prompt 应该是人能理解的

Maurice | maurice_wen@proton.me