GraphRAG-03-配置系统与数据模型

一、配置系统（Config）

1.1 模块概览

Config 模块负责加载、解析和验证 GraphRAG 的配置，支持 YAML 格式配置文件、环境变量注入和 CLI 参数覆盖。

1.2 配置加载流程

flowchart LR
    ConfigFile[YAML 配置文件]
    EnvFile[.env 文件]
    EnvVars[环境变量]
    CLIOverrides[CLI 参数]
    
    ConfigFile --> Parser[配置解析器]
    EnvFile --> EnvReader[环境变量读取器]
    EnvVars --> EnvReader
    EnvReader --> Parser
    CLIOverrides --> Parser
    
    Parser --> Validator[Pydantic 校验]
    Validator --> GraphRagConfig[GraphRagConfig 对象]

1.3 核心 API

load_config

函数签名：

def load_config(
    root_dir: Path,
    config_filepath: Path | None = None,
    cli_overrides: dict[str, Any] | None = None,
) -> GraphRagConfig

参数说明：

参数	类型	必填	说明
root_dir	Path	是	项目根目录
config_filepath	Path	否	配置文件路径（默认查找 settings.yaml）
cli_overrides	dict	否	CLI 参数覆盖（扁平化字典，如 `{'output.base_dir': '/tmp'}`)

核心代码：

def load_config(root_dir, config_filepath, cli_overrides):
    # 1. 查找配置文件
    config_path = _get_config_path(root_dir, config_filepath)
    
    # 2. 加载 .env 文件
    _load_dotenv(config_path)
    
    # 3. 读取配置文件内容
    config_text = config_path.read_text(encoding="utf-8")
    
    # 4. 解析环境变量（${VAR_NAME}）
    config_text = _parse_env_variables(config_text)
    
    # 5. 解析 YAML
    config_data = yaml.safe_load(config_text)
    
    # 6. 应用 CLI 覆盖
    if cli_overrides:
        _apply_overrides(config_data, cli_overrides)
    
    # 7. 创建并验证配置对象
    return create_graphrag_config(config_data, root_dir=str(root_dir))

1.4 配置结构

GraphRagConfig 主要字段：

字段	类型	说明
root_dir	str	项目根目录
models	dict[str, LanguageModelConfig]	语言模型配置字典
input	InputConfig	输入配置
storage	StorageConfig	存储配置
cache	CacheConfig	缓存配置
chunks	ChunkingConfig	文本分块配置
extract_graph	ExtractGraphConfig	实体抽取配置
cluster_graph	ClusterGraphConfig	图聚类配置
community_reports	CommunityReportsConfig	社区报告配置
local_search	LocalSearchConfig	局部搜索配置
global_search	GlobalSearchConfig	全局搜索配置
vector_store	dict[str, VectorStoreConfig]	向量存储配置

1.5 配置示例

# settings.yaml
root_dir: "."

# 模型配置
models:
  default:
    type: openai_chat
    model: gpt-4o-mini
    api_key: ${OPENAI_API_KEY}
    max_tokens: 4000
    temperature: 0.0
  
  embedding:
    type: openai_embedding
    model: text-embedding-3-small
    api_key: ${OPENAI_API_KEY}

# 输入配置
input:
  type: file
  base_dir: "./input"
  file_type: text
  file_pattern: ".*\\.txt$"
  encoding: utf-8

# 存储配置
storage:
  type: file  # 或 blob, cosmosdb
  base_dir: "./output"

output:
  base_dir: "./output"

# 缓存配置
cache:
  type: file  # 或 json, memory, none
  base_dir: "./cache"

# 文本分块配置
chunks:
  size: 300
  overlap: 100
  encoding_model: cl100k_base
  strategy: tokens

# 实体抽取配置
extract_graph:
  model_id: default
  entity_types:
    - organization
    - person
    - geo
    - event
  max_gleanings: 1
  concurrent_requests: 25

# 图聚类配置
cluster_graph:
  max_cluster_size: 10
  use_lcc: true
  seed: 0xDEADBEEF

# 社区报告配置
community_reports:
  model_id: default
  max_length: 2000
  concurrent_requests: 25

# 局部搜索配置
local_search:
  chat_model_id: default
  text_embedding_model_id: embedding
  max_context_tokens: 5000
  top_k_entities: 10
  top_k_relationships: 10
  text_unit_prop: 0.5
  community_prop: 0.1

# 全局搜索配置
global_search:
  chat_model_id: default
  map_max_length: 1000
  reduce_max_length: 2000
  max_context_tokens: 8000

# 向量存储配置
vector_store:
  entity_description_embedding:
    type: lancedb
    db_uri: "./lancedb"
    container_name: entity-description

1.6 环境变量注入

语法：在配置文件中使用 ${VAR_NAME} 引用环境变量

.env 文件示例：

OPENAI_API_KEY=sk-...
AZURE_OPENAI_ENDPOINT=https://...
AZURE_OPENAI_API_KEY=...

使用示例：

models:
  default:
    api_key: ${OPENAI_API_KEY}

二、数据模型（Data Model）

2.1 模块概览

Data Model 模块定义了 GraphRAG 的核心数据结构，包括实体、关系、社区、文档等。所有数据模型基于 dataclass 实现。

2.2 核心数据模型 UML

classDiagram
    class Identified {
        +str id
        +int short_id
    }
    
    class Named {
        +str title
    }
    
    class Document {
        +str text
        +dict metadata
        +from_dict()
    }
    
    class Entity {
        +str type
        +str description
        +list~float~ description_embedding
        +list~str~ community_ids
        +list~str~ text_unit_ids
        +int rank
        +dict attributes
        +from_dict()
    }
    
    class Relationship {
        +str source
        +str target
        +float weight
        +str description
        +list~float~ description_embedding
        +list~str~ text_unit_ids
        +int rank
        +dict attributes
        +from_dict()
    }
    
    class Community {
        +int level
        +int parent
        +list~int~ children
        +list~str~ entity_ids
        +list~str~ relationship_ids
        +list~str~ text_unit_ids
        +int size
        +str period
        +from_dict()
    }
    
    class CommunityReport {
        +int community
        +int level
        +str summary
        +str full_content
        +dict full_content_json
        +float rank
        +list~dict~ findings
        +from_dict()
    }
    
    class TextUnit {
        +str text
        +int n_tokens
        +list~str~ document_ids
        +list~str~ entity_ids
        +list~str~ relationship_ids
        +from_dict()
    }
    
    class Covariate {
        +str subject_id
        +str object_id
        +str type
        +str status
        +str description
        +list~str~ text_unit_ids
        +from_dict()
    }
    
    Identified <|-- Named
    Identified <|-- TextUnit
    Named <|-- Document
    Named <|-- Entity
    Named <|-- Community
    Named <|-- CommunityReport
    Identified <|-- Relationship
    Identified <|-- Covariate

2.3 Entity（实体）

字段说明：

字段	类型	说明
id	str	实体唯一标识（UUID）
short_id	int	人类可读 ID
title	str	实体名称（唯一）
type	str	实体类型（如 organization、person）
description	str	实体描述
description_embedding	list[float]	描述的向量嵌入
community_ids	list[str]	所属社区 ID 列表
text_unit_ids	list[str]	提及该实体的文本块 ID
rank	int	实体重要性排名（基于度中心性）
attributes	dict	额外属性（如时间范围、属性值）

from_dict 用法：

entity = Entity.from_dict({
    "id": "uuid-123",
    "title": "Microsoft",
    "type": "organization",
    "description": "A technology company",
    "degree": 50,  # 映射到 rank
})

2.4 Relationship（关系）

字段说明：

字段	类型	说明
id	str	关系唯一标识
source	str	源实体名称
target	str	目标实体名称
weight	float	关系权重（累加）
description	str	关系描述
description_embedding	list[float]	描述的向量嵌入
text_unit_ids	list[str]	提及该关系的文本块 ID
rank	int	关系重要性排名
attributes	dict	额外属性

特点：

source 和 target 使用实体名称而非 ID
weight 为浮点数，表示关系强度（多次提及累加）

2.5 Community（社区）

字段说明：

字段	类型	说明
id	str	社区唯一标识
community	int	社区编号（层级内唯一）
level	int	层级级别（0=最细粒度）
title	str	社区标题
parent	int	父社区编号（-1 表示根）
children	list[int]	子社区编号列表
entity_ids	list[str]	社区包含的实体 ID
relationship_ids	list[str]	社区内部关系 ID
text_unit_ids	list[str]	社区涉及的文本块 ID
size	int	社区大小（实体数）
period	str	创建日期（ISO 8601）

层级结构：

level=0：最细粒度社区
level=1, 2, ...：逐渐聚合的更高层次社区
parent 和 children 构成层级树

2.6 CommunityReport（社区报告）

字段说明：

字段	类型	说明
id	str	报告唯一标识
community	int	社区编号
level	int	层级级别
title	str	报告标题
summary	str	社区摘要
full_content	str	完整报告内容
full_content_json	dict	报告结构化数据
rank	float	报告重要性排名
findings	list[dict]	发现列表

findings 结构：

findings = [
    {
        "summary": "Microsoft is a major player in AI",
        "explanation": "The company has invested heavily..."
    },
    ...
]

2.7 TextUnit（文本单元）

字段说明：

字段	类型	说明
id	str	文本块唯一标识
text	str	文本块内容
n_tokens	int	Token 数量
document_ids	list[str]	来源文档 ID 列表
entity_ids	list[str]	提及的实体 ID 列表
relationship_ids	list[str]	提及的关系 ID 列表

2.8 Parquet Schema

entities.parquet：

ENTITIES_FINAL_COLUMNS = [
    "id",
    "human_readable_id",
    "title",
    "type",
    "description",
    "text_unit_ids",
    "frequency",  # 出现频率
    "degree",     # 图度中心性（映射到 rank）
    "x",          # 图布局 x 坐标（可选）
    "y",          # 图布局 y 坐标（可选）
]

relationships.parquet：

RELATIONSHIPS_FINAL_COLUMNS = [
    "id",
    "human_readable_id",
    "source",
    "target",
    "description",
    "text_unit_ids",
    "weight",
    "rank",
]

communities.parquet：

COMMUNITIES_FINAL_COLUMNS = [
    "id",
    "human_readable_id",
    "community",
    "level",
    "title",
    "parent",
    "children",
    "entity_ids",
    "relationship_ids",
    "text_unit_ids",
    "period",
    "size",
]

三、存储层（Storage）

3.1 存储抽象接口

PipelineStorage 接口：

class PipelineStorage(ABC):
    @abstractmethod
    async def get(self, key: str) -> bytes | None:
        """获取文件内容"""
    
    @abstractmethod
    async def set(self, key: str, value: bytes) -> None:
        """写入文件"""
    
    @abstractmethod
    async def has(self, key: str) -> bool:
        """检查文件是否存在"""
    
    @abstractmethod
    async def delete(self, key: str) -> None:
        """删除文件"""
    
    @abstractmethod
    async def list_keys(self, pattern: str | None = None) -> list[str]:
        """列出文件键"""
    
    @abstractmethod
    def child(self, name: str) -> "PipelineStorage":
        """创建子目录存储"""

3.2 存储实现

FilePipelineStorage：

本地文件系统存储
适合单机开发和小规模部署
示例路径：./output/entities.parquet

BlobPipelineStorage：

Azure Blob Storage 存储
适合云端部署和大规模生产
支持高可用和备份
示例路径：container-name/output/entities.parquet

CosmosDBPipelineStorage：

Azure Cosmos DB 存储
适合需要全局分发的场景
支持多区域读写

3.3 配置示例

本地文件：

storage:
  type: file
  base_dir: "./output"

Azure Blob：

storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-output
  base_dir: "project1/output"

四、缓存层（Cache）

4.1 缓存接口

PipelineCache 接口：

class PipelineCache(ABC):
    @abstractmethod
    async def get(self, key: str) -> Any | None:
        """获取缓存值"""
    
    @abstractmethod
    async def set(self, key: str, value: Any) -> None:
        """设置缓存值"""
    
    @abstractmethod
    async def has(self, key: str) -> bool:
        """检查缓存是否存在"""
    
    @abstractmethod
    async def delete(self, key: str) -> None:
        """删除缓存"""
    
    @abstractmethod
    def child(self, name: str) -> "PipelineCache":
        """创建子缓存命名空间"""

4.2 缓存实现

JsonPipelineCache：

基于 JSON 文件的缓存
适合小规模缓存（< 10MB）
示例：./cache/llm_cache.json

FilePipelineCache（推荐）：

基于 Parquet 文件的缓存
适合大规模缓存（> 10MB）
高效的序列化和查询
示例：./cache/llm_cache.parquet

MemoryPipelineCache：

内存缓存
适合开发测试
进程退出后清空

NoopPipelineCache：

空操作缓存（不缓存）
适合调试或禁用缓存

4.3 缓存键生成

LLM 调用缓存键：

cache_key = gen_sha512_hash({
    "text": input_text,
    "prompt": prompt_template,
    "model": "gpt-4o-mini",
    "temperature": 0.0,
})

嵌入缓存键：

cache_key = gen_sha512_hash({
    "text": input_text,
    "model": "text-embedding-3-small",
})

4.4 配置示例

cache:
  type: file
  base_dir: "./cache"

五、配置最佳实践

5.1 开发环境配置

chunks:
  size: 300
  overlap: 50

extract_graph:
  concurrent_requests: 10
  entity_types: ["organization", "person"]

cache:
  type: memory

models:
  default:
    model: gpt-4o-mini

5.2 生产环境配置

chunks:
  size: 600
  overlap: 150

extract_graph:
  concurrent_requests: 50
  entity_types: ["organization", "person", "location", "event", "technology"]
  max_gleanings: 2

cache:
  type: file
  base_dir: "./cache"

storage:
  type: blob
  connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
  container_name: graphrag-output

models:
  default:
    model: gpt-4-turbo

5.3 配置验证

from graphrag.index.validate_config import validate_config_names

# 验证配置中的模型名称、字段名称等
validate_config_names(config)

本文档详细介绍了 GraphRAG 的配置系统、数据模型、存储层和缓存层。通过合理配置和使用这些模块，可以实现灵活、高效的知识图谱构建和查询。

GraphRAG-03-配置系统与数据模型#

一、配置系统（Config）#

1.1 模块概览#

1.2 配置加载流程#

1.3 核心 API#

load_config#

1.4 配置结构#

1.5 配置示例#

1.6 环境变量注入#

二、数据模型（Data Model）#

2.1 模块概览#

2.2 核心数据模型 UML#

2.3 Entity（实体）#

2.4 Relationship（关系）#

2.5 Community（社区）#

2.6 CommunityReport（社区报告）#

2.7 TextUnit（文本单元）#

2.8 Parquet Schema#

三、存储层（Storage）#

3.1 存储抽象接口#

3.2 存储实现#

3.3 配置示例#

四、缓存层（Cache）#

4.1 缓存接口#

4.2 缓存实现#

4.3 缓存键生成#

4.4 配置示例#

五、配置最佳实践#

5.1 开发环境配置#

5.2 生产环境配置#

5.3 配置验证#

GraphRAG-03-配置系统与数据模型

一、配置系统（Config）

1.1 模块概览

1.2 配置加载流程

1.3 核心 API

load_config

1.4 配置结构

1.5 配置示例

1.6 环境变量注入

二、数据模型（Data Model）

2.1 模块概览

2.2 核心数据模型 UML

2.3 Entity（实体）

2.4 Relationship（关系）

2.5 Community（社区）

2.6 CommunityReport（社区报告）

2.7 TextUnit（文本单元）

2.8 Parquet Schema

三、存储层（Storage）

3.1 存储抽象接口

3.2 存储实现

3.3 配置示例

四、缓存层（Cache）

4.1 缓存接口

4.2 缓存实现

4.3 缓存键生成

4.4 配置示例

五、配置最佳实践

5.1 开发环境配置

5.2 生产环境配置

5.3 配置验证