Agent 记忆系统设计:短期、长期与工作记忆

记忆架构(Buffer/Summary/Entity/Vector)、对话窗口管理、RAG 记忆检索与情景记忆实战

引言

人类的记忆系统分为短期记忆(工作记忆,容量约 7 项)、长期记忆(近乎无限容量)和情景记忆(特定事件的回忆)。LLM Agent 面临类似的记忆挑战:上下文窗口有限(类似工作记忆容量),需要在多轮对话和多次会话之间保持连续性。

没有记忆系统的 Agent 就像一个"金鱼"——每次交互都从零开始,无法积累经验,也无法建立持续的用户关系。本文系统设计 Agent 的三层记忆架构。

记忆架构全景

三层记忆模型

┌──────────────────────────────────────────────────────────┐
│                    Agent 记忆系统                          │
│                                                          │
│  ┌──────────────────────────────────────────────────┐    │
│  │  Layer 1: 工作记忆 (Working Memory)               │    │
│  │  容量: 上下文窗口 (128K tokens)                    │    │
│  │  时效: 当前会话                                    │    │
│  │  形式: 对话历史 + 当前任务状态                      │    │
│  └──────────────────────────┬───────────────────────┘    │
│                             │ 溢出/压缩                   │
│  ┌──────────────────────────▼───────────────────────┐    │
│  │  Layer 2: 短期记忆 (Short-term Memory)            │    │
│  │  容量: 中等 (最近 N 轮对话摘要)                     │    │
│  │  时效: 跨轮次,当前会话内                           │    │
│  │  形式: 摘要 + 关键实体 + 待办事项                   │    │
│  └──────────────────────────┬───────────────────────┘    │
│                             │ 沉淀                       │
│  ┌──────────────────────────▼───────────────────────┐    │
│  │  Layer 3: 长期记忆 (Long-term Memory)             │    │
│  │  容量: 无限 (向量数据库 + 结构化存储)               │    │
│  │  时效: 永久(跨会话持久化)                         │    │
│  │  形式: 向量索引 + 实体图谱 + 情景快照               │    │
│  └──────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────┘

记忆类型对比

记忆类型 容量 时效 存储 检索方式 适用场景
Buffer Memory 最近 K 轮 当前会话 内存 直接拼接 简单对话
Summary Memory 压缩摘要 当前会话 内存 前置摘要 长对话
Entity Memory 实体属性表 跨会话 DB 实体查找 用户画像
Vector Memory 无限 永久 向量DB 语义检索 知识积累
Episodic Memory 事件快照 永久 DB 时间+语义 经验学习

工作记忆管理

对话窗口策略

# src/memory/conversation_window.py
from typing import Optional
from dataclasses import dataclass, field

@dataclass
class Message:
    role: str           # "system" | "user" | "assistant"
    content: str
    token_count: int = 0
    timestamp: float = 0.0

class ConversationWindow:
    """Manages the conversation context within token limits."""

    def __init__(
        self,
        max_tokens: int = 100_000,
        system_reserve: int = 5_000,
        output_reserve: int = 4_000,
    ):
        self.max_tokens = max_tokens
        self.system_reserve = system_reserve
        self.output_reserve = output_reserve
        self.messages: list[Message] = []
        self.system_message: Optional[Message] = None

    @property
    def available_tokens(self) -> int:
        used = sum(m.token_count for m in self.messages)
        if self.system_message:
            used += self.system_message.token_count
        return self.max_tokens - used - self.output_reserve

    def add_message(self, message: Message) -> list[Message]:
        """Add message, evicting oldest if necessary. Returns evicted messages."""
        evicted = []
        self.messages.append(message)

        # Evict oldest non-system messages until within budget
        while self.available_tokens < 0 and len(self.messages) > 2:
            # Keep at least the latest user message and assistant response
            evicted.append(self.messages.pop(0))

        return evicted

    def get_context(self) -> list[dict]:
        """Build the context for LLM call."""
        context = []
        if self.system_message:
            context.append({"role": "system", "content": self.system_message.content})
        for msg in self.messages:
            context.append({"role": msg.role, "content": msg.content})
        return context

    def get_summary_candidates(self, threshold: int = 10) -> list[Message]:
        """Get old messages that should be summarized."""
        if len(self.messages) > threshold:
            return self.messages[:len(self.messages) - threshold]
        return []

自动摘要压缩

# src/memory/summary_memory.py

class SummaryMemory:
    """Progressively summarizes conversation history."""

    def __init__(self, llm_client, max_summary_tokens: int = 2000):
        self.llm = llm_client
        self.max_summary_tokens = max_summary_tokens
        self.running_summary: str = ""
        self.summarized_count: int = 0

    async def compress(self, messages_to_compress: list[Message]) -> str:
        """Compress messages into an updated running summary."""
        if not messages_to_compress:
            return self.running_summary

        conversation = "\n".join([
            f"{m.role}: {m.content}" for m in messages_to_compress
        ])

        prompt = f"""Progressively summarize the conversation, incorporating new lines into the existing summary.

EXISTING SUMMARY:
{self.running_summary or '(none)'}

NEW CONVERSATION LINES:
{conversation}

Produce a concise summary that captures:
1. Key topics discussed
2. Important decisions made
3. User preferences expressed
4. Pending action items
5. Critical context for future responses

Keep the summary under {self.max_summary_tokens} tokens."""

        self.running_summary = await self.llm.generate(prompt, model="gpt-4o-mini")
        self.summarized_count += len(messages_to_compress)
        return self.running_summary

    def get_context_prefix(self) -> str:
        if not self.running_summary:
            return ""
        return f"[CONVERSATION HISTORY SUMMARY]\n{self.running_summary}\n[END SUMMARY]\n"

长期记忆(Vector Memory)

记忆存储与检索

# src/memory/vector_memory.py
from datetime import datetime
from typing import Optional
from dataclasses import dataclass
import uuid

@dataclass
class MemoryEntry:
    id: str
    content: str
    memory_type: str      # "fact" | "preference" | "experience" | "instruction"
    user_id: str
    session_id: str
    importance: float     # 0-1, how important this memory is
    created_at: datetime
    last_accessed: datetime
    access_count: int = 0
    metadata: dict = None

class VectorMemory:
    """Long-term memory backed by vector database."""

    def __init__(self, qdrant_client, embedding_model, collection: str = "agent_memory"):
        self.qdrant = qdrant_client
        self.embedder = embedding_model
        self.collection = collection

    async def store(
        self,
        content: str,
        memory_type: str,
        user_id: str,
        session_id: str,
        importance: float = 0.5,
        metadata: Optional[dict] = None,
    ) -> str:
        """Store a new memory entry."""
        memory_id = str(uuid.uuid4())
        embedding = await self.embedder.embed(content)
        now = datetime.utcnow()

        from qdrant_client import models
        self.qdrant.upsert(
            collection_name=self.collection,
            points=[models.PointStruct(
                id=memory_id,
                vector=embedding,
                payload={
                    "content": content,
                    "memory_type": memory_type,
                    "user_id": user_id,
                    "session_id": session_id,
                    "importance": importance,
                    "created_at": now.isoformat(),
                    "last_accessed": now.isoformat(),
                    "access_count": 0,
                    **(metadata or {}),
                },
            )],
        )
        return memory_id

    async def recall(
        self,
        query: str,
        user_id: str,
        limit: int = 5,
        memory_types: Optional[list[str]] = None,
        min_importance: float = 0.0,
    ) -> list[MemoryEntry]:
        """Retrieve relevant memories using semantic search."""
        query_embedding = await self.embedder.embed(query)

        # Build filter
        from qdrant_client import models
        must_conditions = [
            models.FieldCondition(
                key="user_id",
                match=models.MatchValue(value=user_id),
            ),
        ]

        if memory_types:
            must_conditions.append(
                models.FieldCondition(
                    key="memory_type",
                    match=models.MatchAny(any=memory_types),
                )
            )

        if min_importance > 0:
            must_conditions.append(
                models.FieldCondition(
                    key="importance",
                    range=models.Range(gte=min_importance),
                )
            )

        results = self.qdrant.search(
            collection_name=self.collection,
            query_vector=query_embedding,
            query_filter=models.Filter(must=must_conditions),
            limit=limit,
            score_threshold=0.5,
        )

        memories = []
        for hit in results:
            p = hit.payload
            memories.append(MemoryEntry(
                id=str(hit.id),
                content=p["content"],
                memory_type=p["memory_type"],
                user_id=p["user_id"],
                session_id=p["session_id"],
                importance=p["importance"],
                created_at=datetime.fromisoformat(p["created_at"]),
                last_accessed=datetime.fromisoformat(p["last_accessed"]),
                access_count=p.get("access_count", 0),
            ))

            # Update access metadata
            self.qdrant.set_payload(
                collection_name=self.collection,
                payload={
                    "last_accessed": datetime.utcnow().isoformat(),
                    "access_count": p.get("access_count", 0) + 1,
                },
                points=[hit.id],
            )

        return memories

实体记忆

结构化实体提取与存储

# src/memory/entity_memory.py

class EntityMemory:
    """Tracks entities mentioned in conversations."""

    def __init__(self, llm_client, db):
        self.llm = llm_client
        self.db = db

    async def extract_entities(self, messages: list[Message]) -> list[dict]:
        """Extract entities from recent messages."""
        conversation = "\n".join([f"{m.role}: {m.content}" for m in messages[-4:]])

        prompt = f"""Extract key entities from this conversation.
For each entity, provide:
- name: entity identifier
- type: person|organization|product|topic|preference
- attributes: key-value pairs of information about the entity

Conversation:
{conversation}

Output as JSON array."""

        result = await self.llm.generate(prompt, model="gpt-4o-mini")
        return parse_json(result)

    async def update_entities(self, user_id: str, entities: list[dict]):
        """Merge new entity information with existing records."""
        for entity in entities:
            existing = await self.db.get_entity(user_id, entity["name"])

            if existing:
                # Merge attributes (new overrides old)
                merged_attrs = {**existing.get("attributes", {}), **entity.get("attributes", {})}
                await self.db.update_entity(
                    user_id, entity["name"],
                    attributes=merged_attrs,
                    last_mentioned=datetime.utcnow(),
                )
            else:
                await self.db.create_entity(
                    user_id=user_id,
                    name=entity["name"],
                    entity_type=entity["type"],
                    attributes=entity.get("attributes", {}),
                )

    async def get_relevant_entities(self, user_id: str, context: str) -> str:
        """Get entity information relevant to current context."""
        entities = await self.db.search_entities(user_id, context, limit=10)

        if not entities:
            return ""

        entity_text = "Known information about relevant entities:\n"
        for e in entities:
            attrs = ", ".join(f"{k}: {v}" for k, v in e["attributes"].items())
            entity_text += f"- {e['name']} ({e['type']}): {attrs}\n"

        return entity_text

记忆整合层

统一记忆管理器

# src/memory/memory_manager.py

class MemoryManager:
    """Orchestrates all memory layers for an agent."""

    def __init__(
        self,
        conversation_window: ConversationWindow,
        summary_memory: SummaryMemory,
        vector_memory: VectorMemory,
        entity_memory: EntityMemory,
    ):
        self.window = conversation_window
        self.summary = summary_memory
        self.vector = vector_memory
        self.entity = entity_memory

    async def build_context(
        self,
        user_id: str,
        current_query: str,
    ) -> list[dict]:
        """Build the complete context for LLM, integrating all memory layers."""

        # Layer 1: Long-term memories (most relevant)
        relevant_memories = await self.vector.recall(
            query=current_query,
            user_id=user_id,
            limit=5,
        )

        # Layer 2: Entity information
        entity_context = await self.entity.get_relevant_entities(
            user_id=user_id,
            context=current_query,
        )

        # Layer 3: Conversation summary
        summary = self.summary.get_context_prefix()

        # Layer 4: Recent conversation (working memory)
        messages = self.window.get_context()

        # Inject memory context into system message
        memory_context = ""
        if relevant_memories:
            memory_context += "Relevant past interactions:\n"
            for m in relevant_memories:
                memory_context += f"- [{m.memory_type}] {m.content}\n"
            memory_context += "\n"

        if entity_context:
            memory_context += entity_context + "\n"

        if summary:
            memory_context += summary + "\n"

        # Prepend memory to system message
        if messages and messages[0]["role"] == "system":
            messages[0]["content"] = memory_context + messages[0]["content"]
        else:
            messages.insert(0, {"role": "system", "content": memory_context})

        return messages

    async def process_turn(
        self,
        user_message: Message,
        assistant_message: Message,
        user_id: str,
        session_id: str,
    ):
        """Process a completed conversation turn for memory updates."""

        # Add to working memory
        evicted = self.window.add_message(user_message)
        evicted += self.window.add_message(assistant_message)

        # Compress evicted messages into summary
        if evicted:
            await self.summary.compress(evicted)

        # Extract and update entities
        entities = await self.entity.extract_entities(
            [user_message, assistant_message]
        )
        if entities:
            await self.entity.update_entities(user_id, entities)

        # Store important information in long-term memory
        importance = await self._assess_importance(user_message, assistant_message)
        if importance > 0.6:
            await self.vector.store(
                content=f"User: {user_message.content}\nAssistant: {assistant_message.content}",
                memory_type="experience",
                user_id=user_id,
                session_id=session_id,
                importance=importance,
            )

    async def _assess_importance(
        self,
        user_msg: Message,
        assistant_msg: Message,
    ) -> float:
        """Assess if this turn contains important information worth remembering."""
        prompt = f"""Rate the importance of this conversation turn for long-term memory (0-1).
High importance: user preferences, decisions, personal information, key instructions.
Low importance: greetings, small talk, routine questions.

User: {user_msg.content}
Assistant: {assistant_msg.content}

Respond with only a number."""

        score_str = await self.summary.llm.generate(prompt, model="gpt-4o-mini")
        try:
            return float(score_str.strip())
        except ValueError:
            return 0.5

记忆遗忘与衰减

策略 机制 参数 效果
时间衰减 importance *= decay^days_since_access decay=0.95 旧记忆逐渐淡化
访问频率 频繁召回的记忆权重提升 access_boost=0.1 常用记忆更易召回
容量限制 单用户记忆上限 max_entries=10000 防止存储膨胀
主动遗忘 用户请求删除特定记忆 N/A 隐私合规
合并去重 语义相似的记忆合并 similarity>0.95 减少冗余

总结

  1. 三层记忆互补:工作记忆处理当前对话,短期记忆(摘要)保持会话连续性,长期记忆实现跨会话学习。
  2. 摘要是最经济的上下文管理:当对话超过窗口限制时,渐进式摘要比截断更好地保留了关键信息。
  3. 向量记忆实现"越用越聪明":每次交互中的重要信息沉淀为长期记忆,下次相似场景自动召回。
  4. 实体记忆构建用户画像:结构化的实体属性比自由文本更易于精确检索和更新。
  5. 记忆遗忘同样重要:无限增长的记忆会降低检索精度,适度的遗忘和合并是必要的。

Maurice | maurice_wen@proton.me