执行上下文感知的文本拆分 (Perform context-aware text splitting)

在向量存储中，文本拆分通常使用句子或其他分隔符将相关文本保持在一起。

但是许多文档（如Markdown文件）具有可以明确用于拆分的结构（标题）。

MarkdownHeaderTextSplitter允许用户根据指定的标题拆分Markdown文件。

这将导致保留元数据中来自的标题的块。

这与SelfQueryRetriever很好地配合使用。

首先，告诉检索器我们的拆分。

然后，根据文档结构进行查询（例如，“总结文档介绍”）。

只有来自文档该部分的块将被过滤并用于聊天/问答。

让我们在一个示例Notion页面上测试一下！

首先，按照这里的说明将页面下载为Markdown。

# 将Notion页面加载为markdown文件
from langchain.document_loaders import NotionDirectoryLoader

path = "../Notion_DB/"
loader = NotionDirectoryLoader(path)
docs = loader.load()
md_file = docs[0].page_content

# 让我们根据页面中的部分标题创建分组
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("###", "Section"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(md_file)

现在，在标题分组的文档上执行文本拆分。

# 定义我们的文本拆分器
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
all_splits = text_splitter.split_documents(md_header_splits)

这样我们就可以根据文档结构执行元数据过滤了。

让我们首先构建一个向量存储。

pip install chromadb

# 构建向量存储并保留元数据
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

让我们创建一个可以根据我们定义的元数据进行过滤的SelfQueryRetriever。

# 创建检索器
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# 定义我们的元数据
metadata_field_info = [
    AttributeInfo(
        name="Section",
        description="文本所属的文档部分",
        type="string or list[string]",
    ),
]
document_content_description = "文档的主要部分"

# 定义自查询检索器
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

我们可以看到，我们可以仅查询文档中“介绍”部分的文本！

# 测试
retriever.get_relevant_documents("总结文档介绍部分")

    query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None

    [Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),
     Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),
     Document(page_content='metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]

我们还可以查看文档的其他部分。

retriever.get_relevant_documents("总结文档测试部分")

    query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None

    [Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled%202.png)', metadata={'Section': 'Testing'}),
     Document(page_content='`SelfQueryRetriever` works well in [many cases](https://twitter.com/hwchase17/status/1656791488569954304/photo/1). For example, given [this test case](https://twitter.com/hwchase17/status/1656791488569954304?s=20):  \n![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled%201.png)  \nThe query can be nicely broken up into semantic query and metadata filter:  \n```python\nsemantic query: "prompt injection"', metadata={'Section': 'Testing'}),
     Document(page_content='Below, we can see detailed results from the app:  \n- Kor extraction is above to perform the transformation between query and metadata format ✅\n- Self-querying attempts to filter using the episode ID (`252`) in the query and fails 🚫\n- Baseline returns docs from 3 different episodes (one from `252`), confusing the answer 🚫', metadata={'Section': 'Testing'}),
     Document(page_content='will use in retrieval [here](https://github.com/langchain-ai/auto-evaluator/blob/main/streamlit/kor_retriever_lex.py).', metadata={'Section': 'Testing'})]

现在，我们可以创建意识到显式文档结构的聊天或问答应用程序。

保留文档结构以进行元数据过滤对于复杂或较长的文档非常有帮助。

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)
qa_chain.run("总结文档测试部分")

    query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None

    '文档的测试部分描述了对`SelfQueryRetriever`组件与基准模型的评估。评估是在一个测试用例上进行的，其中查询被分解为语义查询和元数据过滤器。结果显示，`SelfQueryRetriever`组件能够执行查询和元数据格式之间的转换，但无法使用查询中的剧集ID进行过滤。基准模型返回了来自三个不同剧集的文档，这使得答案混淆了。`SelfQueryRetriever`组件在许多情况下表现良好，并将在检索中使用。'