Neo4j实战:AI应用中的图数据库
原创
灵阙教研团队
S 精选 进阶 |
约 7 分钟阅读
更新于 2026-02-28 AI 导读
Neo4j实战:AI应用中的图数据库 Cypher查询基础、图建模模式、GDS算法、向量索引与LLM集成工程指南 引言...
Neo4j实战:AI应用中的图数据库
Cypher查询基础、图建模模式、GDS算法、向量索引与LLM集成工程指南
引言
图数据库在AI应用中的角色正在从"辅助存储"升级为"核心推理引擎"。Neo4j作为图数据库领域的领导者,通过原生向量索引、图数据科学库(GDS)和LLM集成工具链,成为知识图谱、GraphRAG和智能推荐系统的首选技术栈。本文将从Cypher基础到高级AI集成,覆盖Neo4j在AI应用中的完整实践路径。
Cypher查询基础
核心语法
Cypher是Neo4j的声明式图查询语言,其设计哲学是"让图查询像画图一样直观"。
// === 节点和关系创建 ===
// 创建节点
CREATE (p:Person {name: 'Zhang Wei', age: 35, title: 'CTO'})
CREATE (c:Company {name: 'TechCorp', founded: 2020, industry: 'AI'})
// 创建关系
MATCH (p:Person {name: 'Zhang Wei'}), (c:Company {name: 'TechCorp'})
CREATE (p)-[:WORKS_AT {since: 2021, role: 'CTO'}]->(c)
// === 查询模式 ===
// 基础匹配
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE c.industry = 'AI'
RETURN p.name, c.name, p.title
// 多跳查询(朋友的朋友)
MATCH (a:Person {name: 'Zhang Wei'})-[:KNOWS*2]->(fof:Person)
WHERE fof <> a
RETURN DISTINCT fof.name
// 变长路径
MATCH path = (a:Person)-[:REPORTS_TO*1..5]->(ceo:Person {title: 'CEO'})
RETURN path, length(path) AS depth
// 最短路径
MATCH path = shortestPath(
(a:Person {name: 'Zhang Wei'})-[:KNOWS*]-(b:Person {name: 'Li Na'})
)
RETURN path, length(path)
// 聚合与排序
MATCH (c:Company)<-[:WORKS_AT]-(p:Person)
RETURN c.name, count(p) AS employee_count, collect(p.name) AS employees
ORDER BY employee_count DESC
LIMIT 10
常用查询模式
// 模式1: 推荐(协同过滤)
// 找到和我看过相同电影的人,推荐他们看过但我没看过的电影
MATCH (me:Person {name: 'Zhang Wei'})-[:WATCHED]->(m:Movie)<-[:WATCHED]-(other:Person)
MATCH (other)-[:WATCHED]->(rec:Movie)
WHERE NOT (me)-[:WATCHED]->(rec)
RETURN rec.title, count(other) AS score
ORDER BY score DESC
LIMIT 10
// 模式2: 影响力传播(PageRank应用场景)
MATCH (p:Person)-[:FOLLOWS]->(target:Person)
WITH target, count(p) AS followers
ORDER BY followers DESC
RETURN target.name, followers
LIMIT 20
// 模式3: 社区检测
MATCH (p:Person)-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City)
RETURN city.name, collect(DISTINCT c.name) AS companies,
count(DISTINCT p) AS talent_pool
ORDER BY talent_pool DESC
// 模式4: 知识图谱问答
MATCH (e:Entity {name: $entity_name})-[r]->(related)
RETURN type(r) AS relation, related.name, labels(related) AS types
图建模模式
AI应用的典型图模型
知识图谱模型
(Person)─────[:WORKS_AT]────→(Company)
│ │
│[:KNOWS] │[:LOCATED_IN]
│ │
▼ ▼
(Person) (Location)
│
│[:AUTHORED]
│
▼
(Paper)─────[:CITES]─────→(Paper)
│
│[:ABOUT]
│
▼
(Topic)─────[:SUBTOPIC_OF]──→(Topic)
RAG知识库模型
(Document)───[:HAS_CHUNK]───→(Chunk)
│ │
│[:IN_COLLECTION] │[:SIMILAR_TO]
│ │
▼ ▼
(Collection) (Chunk)
│ │
│ │[:MENTIONS]
│ │
│ ▼
│ (Entity)
│ │
│ │[:RELATED_TO]
│ ▼
│ (Entity)
└───[:TAGGED_WITH]────→(Tag)
建模最佳实践
| 原则 | 说明 | 示例 |
|---|---|---|
| 实体→节点 | 独立存在的事物建模为节点 | Person, Company, Product |
| 动作→关系 | 实体间的动态联系建模为关系 | WORKS_AT, PURCHASED |
| 属性→属性 | 描述性信息附加为属性 | name, created_date |
| 中间实体 | N:M关系携带丰富属性时提升为节点 | Order, Transaction |
| 避免超级节点 | 单个节点关系数不宜超过10万 | 按时间/地区分片 |
GDS图算法
常用算法分类
# Neo4j GDS algorithm categories and use cases
gds_algorithms = {
"centrality": {
"PageRank": {
"use_case": "Identify influential nodes",
"complexity": "O(V + E)",
"ai_application": "Entity importance ranking in KG",
},
"Betweenness": {
"use_case": "Find bridge nodes",
"complexity": "O(V * E)",
"ai_application": "Key connector identification",
},
"Degree": {
"use_case": "Count connections",
"complexity": "O(V)",
"ai_application": "Popularity/activity scoring",
},
},
"community_detection": {
"Louvain": {
"use_case": "Find communities/clusters",
"complexity": "O(V * log V)",
"ai_application": "Topic clustering, user segmentation",
},
"Label Propagation": {
"use_case": "Fast community detection",
"complexity": "O(V + E)",
"ai_application": "Real-time community assignment",
},
},
"similarity": {
"Node Similarity": {
"use_case": "Find similar nodes by neighbors",
"complexity": "O(V^2)",
"ai_application": "Recommendation, entity matching",
},
"KNN": {
"use_case": "K nearest neighbors",
"complexity": "O(V * log V)",
"ai_application": "Vector similarity search",
},
},
"path_finding": {
"Shortest Path": {
"use_case": "Find optimal paths",
"complexity": "O(V + E)",
"ai_application": "Relationship explanation",
},
"All Shortest Paths": {
"use_case": "Find all optimal paths",
"complexity": "O(V * E)",
"ai_application": "Multi-hop reasoning",
},
},
}
GDS实战:GraphRAG中的社区摘要
// Step 1: Create graph projection
CALL gds.graph.project(
'knowledge-graph',
['Entity', 'Chunk'],
{
RELATED_TO: {orientation: 'UNDIRECTED'},
MENTIONS: {orientation: 'UNDIRECTED'}
}
)
// Step 2: Run Louvain community detection
CALL gds.louvain.write('knowledge-graph', {
writeProperty: 'community_id',
maxLevels: 10,
maxIterations: 20
})
YIELD communityCount, modularity
// Step 3: Get community members for LLM summarization
MATCH (e:Entity)
WITH e.community_id AS community, collect(e.name) AS members
WHERE size(members) >= 3
RETURN community, members, size(members) AS size
ORDER BY size DESC
LIMIT 50
// Step 4: Run PageRank within communities
CALL gds.pageRank.write('knowledge-graph', {
writeProperty: 'pagerank',
maxIterations: 20,
dampingFactor: 0.85
})
向量索引
Neo4j原生向量搜索
从Neo4j 5.11开始,原生支持向量索引,这使得Neo4j可以同时承担图数据库和向量数据库的双重角色。
// 创建向量索引
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}
}
// 写入带向量的节点
CREATE (c:Chunk {
text: 'Knowledge graphs enable structured reasoning...',
source: 'paper_001',
embedding: $embedding_vector
})
// 向量相似搜索
CALL db.index.vector.queryNodes(
'chunk_embeddings',
10, -- top-K
$query_embedding -- query vector
)
YIELD node, score
RETURN node.text, node.source, score
ORDER BY score DESC
// 混合查询: 向量搜索 + 图遍历
CALL db.index.vector.queryNodes('chunk_embeddings', 20, $query_embedding)
YIELD node AS chunk, score
WHERE score > 0.7
MATCH (chunk)-[:MENTIONS]->(entity:Entity)-[:RELATED_TO]-(related:Entity)
RETURN chunk.text, score,
collect(DISTINCT entity.name) AS entities,
collect(DISTINCT related.name) AS related_entities
ORDER BY score DESC
LIMIT 5
LLM集成
Neo4j + LangChain GraphRAG
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import GraphCypherQAChain
# Initialize Neo4j connection
graph = Neo4jGraph(
url="bolt://localhost:7687",
username="neo4j",
password="password",
)
# Vector store on top of Neo4j
vector_store = Neo4jVector.from_existing_graph(
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
url="bolt://localhost:7687",
username="neo4j",
password="password",
index_name="chunk_embeddings",
node_label="Chunk",
text_node_properties=["text"],
embedding_node_property="embedding",
)
# Text-to-Cypher chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
cypher_chain = GraphCypherQAChain.from_llm(
llm=llm,
graph=graph,
verbose=True,
validate_cypher=True,
return_intermediate_steps=True,
)
# Query: natural language -> Cypher -> answer
result = cypher_chain.invoke({
"query": "Which companies in Beijing have more than 100 employees working on AI?"
})
print(result["result"])
# Intermediate: MATCH (c:Company)-[:LOCATED_IN]->(:City {name:'Beijing'})
# MATCH (c)<-[:WORKS_AT]-(p:Person)
# WHERE c.industry = 'AI'
# WITH c, count(p) AS emp_count
# WHERE emp_count > 100
# RETURN c.name, emp_count
混合检索:向量+图结构
class HybridGraphRetriever:
"""Combine vector similarity with graph traversal for RAG."""
def __init__(self, driver, embed_fn):
self.driver = driver
self.embed_fn = embed_fn
def retrieve(self, query: str, top_k: int = 5,
graph_depth: int = 1) -> list[dict]:
"""
1. Vector search for relevant chunks
2. Graph traversal to enrich context
3. Combine and rank
"""
query_embedding = self.embed_fn(query)
with self.driver.session() as session:
result = session.run("""
// Step 1: Vector search
CALL db.index.vector.queryNodes(
'chunk_embeddings', $top_k * 3, $embedding
)
YIELD node AS chunk, score
WHERE score > 0.6
// Step 2: Get mentioned entities
OPTIONAL MATCH (chunk)-[:MENTIONS]->(entity:Entity)
// Step 3: Get related entities (1-hop)
OPTIONAL MATCH (entity)-[:RELATED_TO]-(related:Entity)
// Step 4: Get chunks mentioning related entities
OPTIONAL MATCH (related_chunk:Chunk)-[:MENTIONS]->(related)
WHERE related_chunk <> chunk
RETURN chunk.text AS text,
score,
collect(DISTINCT entity.name) AS entities,
collect(DISTINCT related.name) AS related_entities,
collect(DISTINCT related_chunk.text)[0..2] AS related_texts
ORDER BY score DESC
LIMIT $top_k
""", embedding=query_embedding, top_k=top_k)
return [dict(record) for record in result]
性能优化
索引策略
| 索引类型 | 用途 | 创建语法 |
|---|---|---|
| B-tree | 精确匹配/范围查询 | CREATE INDEX FOR (n:Label) ON (n.prop) |
| Full-text | 文本搜索 | CREATE FULLTEXT INDEX FOR (n:Label) ON EACH [n.text] |
| Vector | 向量相似度 | CREATE VECTOR INDEX ... OPTIONS {vector.dimensions: N} |
| Composite | 多属性联合 | CREATE INDEX FOR (n:L) ON (n.a, n.b) |
查询优化清单
// 1. 使用 PROFILE 分析查询计划
PROFILE
MATCH (p:Person)-[:WORKS_AT]->(c:Company {industry: 'AI'})
RETURN p.name, c.name
// 2. 避免笛卡尔积
// Bad: 无连接条件的多 MATCH
MATCH (a:Person), (b:Company) -- Cartesian product!
// Good: 通过关系连接
MATCH (a:Person)-[:WORKS_AT]->(b:Company)
// 3. 尽早过滤
// Bad:
MATCH (p:Person)-[*1..5]->(target)
WHERE p.name = 'Zhang Wei'
// Good:
MATCH (p:Person {name: 'Zhang Wei'})-[*1..5]->(target)
// 4. 使用参数化查询
// Good: 利用查询缓存
MATCH (p:Person {name: $name}) RETURN p
结论
Neo4j在AI应用中的价值正在从"存储图数据"扩展为"图原生推理引擎"。原生向量索引让Neo4j可以同时承担向量数据库和图数据库的双重角色,GDS算法库为社区检测和实体重要性排序提供了高效工具,而与LangChain等框架的深度集成则大幅降低了GraphRAG的开发门槛。对于需要结合结构化关系推理和语义检索的AI应用,Neo4j是当前最成熟的技术选择。
Maurice | maurice_wen@proton.me