Skip to main content

检索器 Retrievers

info

请前往Integrations查看内置检索器与第三方工具的集成文档。

检索器是一个接口,根据非结构化查询返回文档。它比向量存储更通用。检索器不需要能够存储文档,只需要返回(或检索)文档。向量存储可以作为检索器的主干,但也有其他类型的检索器。

入门

LangChain中BaseRetriever类的公共API如下:

from abc import ABC, abstractmethod
from typing import Any, List
from langchain.schema import Document
from langchain.callbacks.manager import Callbacks

class BaseRetriever(ABC):
...
def get_relevant_documents(
self, query: str, *, callbacks: Callbacks = None, **kwargs: Any
) -> List[Document]:
"""检索与查询相关的文档。
Args:
query: 要查找相关文档的字符串
callbacks: 回调管理器或回调列表
Returns:
相关文档列表
"""
...

async def aget_relevant_documents(
self, query: str, *, callbacks: Callbacks = None, **kwargs: Any
) -> List[Document]:
"""异步获取与查询相关的文档。
Args:
query: 要查找相关文档的字符串
callbacks: 回调管理器或回调列表
Returns:
相关文档列表
"""
...

就是这么简单!您可以调用get_relevant_documents或异步的aget_relevant_documents方法来检索与查询相关的文档,其中“相关性”由您调用的具体检索器对象定义。

当然,我们还会帮助构建我们认为有用的检索器。我们主要关注的检索器类型是向量存储检索器。在本指南的其余部分,我们将重点介绍它。

为了理解向量存储检索器是什么,了解向量存储是什么很重要。所以让我们来看看。

默认情况下,LangChain使用Chroma作为索引和搜索嵌入的向量存储。为了完成本教程,我们首先需要安装chromadb

pip install chromadb

这个示例展示了对文档的问答。我们选择这个作为入门示例,因为它很好地结合了许多不同的元素(文本分割器、嵌入、向量存储),并展示了如何在链中使用它们。

对文档的问答包括四个步骤:

  1. 创建索引
  2. 从索引创建检索器
  3. 创建问答链
  4. 提问!

每个步骤都有多个子步骤和潜在的配置。在本教程中,我们主要关注(1)。我们将首先展示一行代码来完成此操作,然后解释实际发生了什么。

首先,让我们导入一些无论如何都会使用的常见类。

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

接下来,在通用设置中,让我们指定要使用的文档加载器。您可以在这里下载state_of_the_union.txt文件。

from langchain.document_loaders import TextLoader
loader = TextLoader('../state_of_the_union.txt', encoding='utf8')

一行代码创建索引

为了尽快开始,我们可以使用VectorstoreIndexCreator

from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator().from_loaders([loader])
    Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.

索引创建完成后,我们可以使用它来对数据进行问答!请注意,在底层,这实际上也进行了一些步骤,我们将在本指南中介绍。

query = "What did the president say about Ketanji Brown Jackson"
index.query(query)
    " The president said that Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."
query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)
    {'question': 'What did the president say about Ketanji Brown Jackson',
'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence, and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
'sources': '../state_of_the_union.txt'}

VectorstoreIndexCreator返回的是VectorStoreIndexWrapper,它提供了这些方便的queryquery_with_sources功能。如果我们只想直接访问向量存储,也可以这样做。

index.vectorstore
    <langchain.vectorstores.chroma.Chroma at 0x119aa5940>

如果我们想访问VectorstoreRetriever,可以使用以下代码:

index.vectorstore.as_retriever()
    VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x119aa5940>, search_kwargs={})

演练

好的,实际上发生了什么?这个索引是如何创建的?

很多魔法都隐藏在这个VectorstoreIndexCreator中。它在做什么?

文档加载后,有三个主要步骤:

  1. 将文档拆分为块
  2. 为每个文档创建嵌入
  3. 将文档和嵌入存储在向量存储中

让我们在代码中逐步进行解释。

documents = loader.load()

接下来,我们将把文档拆分为块。

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

然后,我们将选择要使用的嵌入。

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

现在,我们创建要用作索引的向量存储。

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)

这样就创建了索引。然后,我们将此索引公开为检索器接口。

retriever = db.as_retriever()

然后,与之前一样,我们创建一个链并使用它来回答问题!

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)
    " The President said that Judge Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He said she is a consensus builder and has received a broad range of support from organizations such as the Fraternal Order of Police and former judges appointed by Democrats and Republicans."

VectorstoreIndexCreator只是对所有这些逻辑的封装。它可以配置使用的文本分割器、嵌入和向量存储。例如,您可以进行以下配置:

index_creator = VectorstoreIndexCreator(
vectorstore_cls=Chroma,
embedding=OpenAIEmbeddings(),
text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
)

希望这样可以突出显示VectorstoreIndexCreator底层发生了什么。虽然我们认为创建索引的简单方法很重要,但我们也认为了解底层发生的事情很重要。