Agent 记忆系统设计:短期、长期与工作记忆
原创
灵阙教研团队
S 精选 进阶 |
约 9 分钟阅读
更新于 2026-02-28 AI 导读
Agent 记忆系统设计:短期、长期与工作记忆 记忆架构(Buffer/Summary/Entity/Vector)、对话窗口管理、RAG 记忆检索与情景记忆实战 引言 人类的记忆系统分为短期记忆(工作记忆,容量约 7 项)、长期记忆(近乎无限容量)和情景记忆(特定事件的回忆)。LLM Agent 面临类似的记忆挑战:上下文窗口有限(类似工作记忆容量),需要在多轮对话和多次会话之间保持连续性。...
Agent 记忆系统设计:短期、长期与工作记忆
记忆架构(Buffer/Summary/Entity/Vector)、对话窗口管理、RAG 记忆检索与情景记忆实战
引言
人类的记忆系统分为短期记忆(工作记忆,容量约 7 项)、长期记忆(近乎无限容量)和情景记忆(特定事件的回忆)。LLM Agent 面临类似的记忆挑战:上下文窗口有限(类似工作记忆容量),需要在多轮对话和多次会话之间保持连续性。
没有记忆系统的 Agent 就像一个"金鱼"——每次交互都从零开始,无法积累经验,也无法建立持续的用户关系。本文系统设计 Agent 的三层记忆架构。
记忆架构全景
三层记忆模型
┌──────────────────────────────────────────────────────────┐
│ Agent 记忆系统 │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Layer 1: 工作记忆 (Working Memory) │ │
│ │ 容量: 上下文窗口 (128K tokens) │ │
│ │ 时效: 当前会话 │ │
│ │ 形式: 对话历史 + 当前任务状态 │ │
│ └──────────────────────────┬───────────────────────┘ │
│ │ 溢出/压缩 │
│ ┌──────────────────────────▼───────────────────────┐ │
│ │ Layer 2: 短期记忆 (Short-term Memory) │ │
│ │ 容量: 中等 (最近 N 轮对话摘要) │ │
│ │ 时效: 跨轮次,当前会话内 │ │
│ │ 形式: 摘要 + 关键实体 + 待办事项 │ │
│ └──────────────────────────┬───────────────────────┘ │
│ │ 沉淀 │
│ ┌──────────────────────────▼───────────────────────┐ │
│ │ Layer 3: 长期记忆 (Long-term Memory) │ │
│ │ 容量: 无限 (向量数据库 + 结构化存储) │ │
│ │ 时效: 永久(跨会话持久化) │ │
│ │ 形式: 向量索引 + 实体图谱 + 情景快照 │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
记忆类型对比
| 记忆类型 | 容量 | 时效 | 存储 | 检索方式 | 适用场景 |
|---|---|---|---|---|---|
| Buffer Memory | 最近 K 轮 | 当前会话 | 内存 | 直接拼接 | 简单对话 |
| Summary Memory | 压缩摘要 | 当前会话 | 内存 | 前置摘要 | 长对话 |
| Entity Memory | 实体属性表 | 跨会话 | DB | 实体查找 | 用户画像 |
| Vector Memory | 无限 | 永久 | 向量DB | 语义检索 | 知识积累 |
| Episodic Memory | 事件快照 | 永久 | DB | 时间+语义 | 经验学习 |
工作记忆管理
对话窗口策略
# src/memory/conversation_window.py
from typing import Optional
from dataclasses import dataclass, field
@dataclass
class Message:
role: str # "system" | "user" | "assistant"
content: str
token_count: int = 0
timestamp: float = 0.0
class ConversationWindow:
"""Manages the conversation context within token limits."""
def __init__(
self,
max_tokens: int = 100_000,
system_reserve: int = 5_000,
output_reserve: int = 4_000,
):
self.max_tokens = max_tokens
self.system_reserve = system_reserve
self.output_reserve = output_reserve
self.messages: list[Message] = []
self.system_message: Optional[Message] = None
@property
def available_tokens(self) -> int:
used = sum(m.token_count for m in self.messages)
if self.system_message:
used += self.system_message.token_count
return self.max_tokens - used - self.output_reserve
def add_message(self, message: Message) -> list[Message]:
"""Add message, evicting oldest if necessary. Returns evicted messages."""
evicted = []
self.messages.append(message)
# Evict oldest non-system messages until within budget
while self.available_tokens < 0 and len(self.messages) > 2:
# Keep at least the latest user message and assistant response
evicted.append(self.messages.pop(0))
return evicted
def get_context(self) -> list[dict]:
"""Build the context for LLM call."""
context = []
if self.system_message:
context.append({"role": "system", "content": self.system_message.content})
for msg in self.messages:
context.append({"role": msg.role, "content": msg.content})
return context
def get_summary_candidates(self, threshold: int = 10) -> list[Message]:
"""Get old messages that should be summarized."""
if len(self.messages) > threshold:
return self.messages[:len(self.messages) - threshold]
return []
自动摘要压缩
# src/memory/summary_memory.py
class SummaryMemory:
"""Progressively summarizes conversation history."""
def __init__(self, llm_client, max_summary_tokens: int = 2000):
self.llm = llm_client
self.max_summary_tokens = max_summary_tokens
self.running_summary: str = ""
self.summarized_count: int = 0
async def compress(self, messages_to_compress: list[Message]) -> str:
"""Compress messages into an updated running summary."""
if not messages_to_compress:
return self.running_summary
conversation = "\n".join([
f"{m.role}: {m.content}" for m in messages_to_compress
])
prompt = f"""Progressively summarize the conversation, incorporating new lines into the existing summary.
EXISTING SUMMARY:
{self.running_summary or '(none)'}
NEW CONVERSATION LINES:
{conversation}
Produce a concise summary that captures:
1. Key topics discussed
2. Important decisions made
3. User preferences expressed
4. Pending action items
5. Critical context for future responses
Keep the summary under {self.max_summary_tokens} tokens."""
self.running_summary = await self.llm.generate(prompt, model="gpt-4o-mini")
self.summarized_count += len(messages_to_compress)
return self.running_summary
def get_context_prefix(self) -> str:
if not self.running_summary:
return ""
return f"[CONVERSATION HISTORY SUMMARY]\n{self.running_summary}\n[END SUMMARY]\n"
长期记忆(Vector Memory)
记忆存储与检索
# src/memory/vector_memory.py
from datetime import datetime
from typing import Optional
from dataclasses import dataclass
import uuid
@dataclass
class MemoryEntry:
id: str
content: str
memory_type: str # "fact" | "preference" | "experience" | "instruction"
user_id: str
session_id: str
importance: float # 0-1, how important this memory is
created_at: datetime
last_accessed: datetime
access_count: int = 0
metadata: dict = None
class VectorMemory:
"""Long-term memory backed by vector database."""
def __init__(self, qdrant_client, embedding_model, collection: str = "agent_memory"):
self.qdrant = qdrant_client
self.embedder = embedding_model
self.collection = collection
async def store(
self,
content: str,
memory_type: str,
user_id: str,
session_id: str,
importance: float = 0.5,
metadata: Optional[dict] = None,
) -> str:
"""Store a new memory entry."""
memory_id = str(uuid.uuid4())
embedding = await self.embedder.embed(content)
now = datetime.utcnow()
from qdrant_client import models
self.qdrant.upsert(
collection_name=self.collection,
points=[models.PointStruct(
id=memory_id,
vector=embedding,
payload={
"content": content,
"memory_type": memory_type,
"user_id": user_id,
"session_id": session_id,
"importance": importance,
"created_at": now.isoformat(),
"last_accessed": now.isoformat(),
"access_count": 0,
**(metadata or {}),
},
)],
)
return memory_id
async def recall(
self,
query: str,
user_id: str,
limit: int = 5,
memory_types: Optional[list[str]] = None,
min_importance: float = 0.0,
) -> list[MemoryEntry]:
"""Retrieve relevant memories using semantic search."""
query_embedding = await self.embedder.embed(query)
# Build filter
from qdrant_client import models
must_conditions = [
models.FieldCondition(
key="user_id",
match=models.MatchValue(value=user_id),
),
]
if memory_types:
must_conditions.append(
models.FieldCondition(
key="memory_type",
match=models.MatchAny(any=memory_types),
)
)
if min_importance > 0:
must_conditions.append(
models.FieldCondition(
key="importance",
range=models.Range(gte=min_importance),
)
)
results = self.qdrant.search(
collection_name=self.collection,
query_vector=query_embedding,
query_filter=models.Filter(must=must_conditions),
limit=limit,
score_threshold=0.5,
)
memories = []
for hit in results:
p = hit.payload
memories.append(MemoryEntry(
id=str(hit.id),
content=p["content"],
memory_type=p["memory_type"],
user_id=p["user_id"],
session_id=p["session_id"],
importance=p["importance"],
created_at=datetime.fromisoformat(p["created_at"]),
last_accessed=datetime.fromisoformat(p["last_accessed"]),
access_count=p.get("access_count", 0),
))
# Update access metadata
self.qdrant.set_payload(
collection_name=self.collection,
payload={
"last_accessed": datetime.utcnow().isoformat(),
"access_count": p.get("access_count", 0) + 1,
},
points=[hit.id],
)
return memories
实体记忆
结构化实体提取与存储
# src/memory/entity_memory.py
class EntityMemory:
"""Tracks entities mentioned in conversations."""
def __init__(self, llm_client, db):
self.llm = llm_client
self.db = db
async def extract_entities(self, messages: list[Message]) -> list[dict]:
"""Extract entities from recent messages."""
conversation = "\n".join([f"{m.role}: {m.content}" for m in messages[-4:]])
prompt = f"""Extract key entities from this conversation.
For each entity, provide:
- name: entity identifier
- type: person|organization|product|topic|preference
- attributes: key-value pairs of information about the entity
Conversation:
{conversation}
Output as JSON array."""
result = await self.llm.generate(prompt, model="gpt-4o-mini")
return parse_json(result)
async def update_entities(self, user_id: str, entities: list[dict]):
"""Merge new entity information with existing records."""
for entity in entities:
existing = await self.db.get_entity(user_id, entity["name"])
if existing:
# Merge attributes (new overrides old)
merged_attrs = {**existing.get("attributes", {}), **entity.get("attributes", {})}
await self.db.update_entity(
user_id, entity["name"],
attributes=merged_attrs,
last_mentioned=datetime.utcnow(),
)
else:
await self.db.create_entity(
user_id=user_id,
name=entity["name"],
entity_type=entity["type"],
attributes=entity.get("attributes", {}),
)
async def get_relevant_entities(self, user_id: str, context: str) -> str:
"""Get entity information relevant to current context."""
entities = await self.db.search_entities(user_id, context, limit=10)
if not entities:
return ""
entity_text = "Known information about relevant entities:\n"
for e in entities:
attrs = ", ".join(f"{k}: {v}" for k, v in e["attributes"].items())
entity_text += f"- {e['name']} ({e['type']}): {attrs}\n"
return entity_text
记忆整合层
统一记忆管理器
# src/memory/memory_manager.py
class MemoryManager:
"""Orchestrates all memory layers for an agent."""
def __init__(
self,
conversation_window: ConversationWindow,
summary_memory: SummaryMemory,
vector_memory: VectorMemory,
entity_memory: EntityMemory,
):
self.window = conversation_window
self.summary = summary_memory
self.vector = vector_memory
self.entity = entity_memory
async def build_context(
self,
user_id: str,
current_query: str,
) -> list[dict]:
"""Build the complete context for LLM, integrating all memory layers."""
# Layer 1: Long-term memories (most relevant)
relevant_memories = await self.vector.recall(
query=current_query,
user_id=user_id,
limit=5,
)
# Layer 2: Entity information
entity_context = await self.entity.get_relevant_entities(
user_id=user_id,
context=current_query,
)
# Layer 3: Conversation summary
summary = self.summary.get_context_prefix()
# Layer 4: Recent conversation (working memory)
messages = self.window.get_context()
# Inject memory context into system message
memory_context = ""
if relevant_memories:
memory_context += "Relevant past interactions:\n"
for m in relevant_memories:
memory_context += f"- [{m.memory_type}] {m.content}\n"
memory_context += "\n"
if entity_context:
memory_context += entity_context + "\n"
if summary:
memory_context += summary + "\n"
# Prepend memory to system message
if messages and messages[0]["role"] == "system":
messages[0]["content"] = memory_context + messages[0]["content"]
else:
messages.insert(0, {"role": "system", "content": memory_context})
return messages
async def process_turn(
self,
user_message: Message,
assistant_message: Message,
user_id: str,
session_id: str,
):
"""Process a completed conversation turn for memory updates."""
# Add to working memory
evicted = self.window.add_message(user_message)
evicted += self.window.add_message(assistant_message)
# Compress evicted messages into summary
if evicted:
await self.summary.compress(evicted)
# Extract and update entities
entities = await self.entity.extract_entities(
[user_message, assistant_message]
)
if entities:
await self.entity.update_entities(user_id, entities)
# Store important information in long-term memory
importance = await self._assess_importance(user_message, assistant_message)
if importance > 0.6:
await self.vector.store(
content=f"User: {user_message.content}\nAssistant: {assistant_message.content}",
memory_type="experience",
user_id=user_id,
session_id=session_id,
importance=importance,
)
async def _assess_importance(
self,
user_msg: Message,
assistant_msg: Message,
) -> float:
"""Assess if this turn contains important information worth remembering."""
prompt = f"""Rate the importance of this conversation turn for long-term memory (0-1).
High importance: user preferences, decisions, personal information, key instructions.
Low importance: greetings, small talk, routine questions.
User: {user_msg.content}
Assistant: {assistant_msg.content}
Respond with only a number."""
score_str = await self.summary.llm.generate(prompt, model="gpt-4o-mini")
try:
return float(score_str.strip())
except ValueError:
return 0.5
记忆遗忘与衰减
| 策略 | 机制 | 参数 | 效果 |
|---|---|---|---|
| 时间衰减 | importance *= decay^days_since_access | decay=0.95 | 旧记忆逐渐淡化 |
| 访问频率 | 频繁召回的记忆权重提升 | access_boost=0.1 | 常用记忆更易召回 |
| 容量限制 | 单用户记忆上限 | max_entries=10000 | 防止存储膨胀 |
| 主动遗忘 | 用户请求删除特定记忆 | N/A | 隐私合规 |
| 合并去重 | 语义相似的记忆合并 | similarity>0.95 | 减少冗余 |
总结
- 三层记忆互补:工作记忆处理当前对话,短期记忆(摘要)保持会话连续性,长期记忆实现跨会话学习。
- 摘要是最经济的上下文管理:当对话超过窗口限制时,渐进式摘要比截断更好地保留了关键信息。
- 向量记忆实现"越用越聪明":每次交互中的重要信息沉淀为长期记忆,下次相似场景自动召回。
- 实体记忆构建用户画像:结构化的实体属性比自由文本更易于精确检索和更新。
- 记忆遗忘同样重要:无限增长的记忆会降低检索精度,适度的遗忘和合并是必要的。
Maurice | maurice_wen@proton.me