Skip to main content

Vectara

Vectara是一个用于构建GenAI应用程序的API平台。它提供了一个易于使用的API,用于由Vectara管理的文档索引和查询,并针对性能和准确性进行了优化。 有关如何使用API的更多信息,请参阅Vectara API文档

这个笔记本展示了如何使用与Vectara与langchain集成相关的功能。 请注意,与此类别中的许多其他集成不同,Vectara为Grounded Generation(也称为检索增强生成)提供了一个端到端的托管服务,其中包括:

  1. 从文档文件中提取文本并将其分块为句子的方法。
  2. 它自己的嵌入模型和向量存储-每个文本段都被编码为向量嵌入并存储在Vectara内部向量存储中。
  3. 一个查询服务,自动将查询编码为嵌入,并检索最相关的文本段(包括对混合搜索的支持)。

所有这些都在这个LangChain集成中得到支持。

设置

您需要一个Vectara帐户才能使用LangChain与Vectara。要开始,请按照以下步骤操作:

  1. 如果您还没有Vectara帐户,请注册一个Vectara帐户。完成注册后,您将获得一个Vectara客户ID。您可以通过点击Vectara控制台窗口右上角的姓名来找到您的客户ID。
  2. 在您的帐户中,您可以创建一个或多个语料库。每个语料库表示从输入文档摄取的文本数据存储区域。要创建一个语料库,请使用“创建语料库”按钮。然后,您为语料库提供一个名称和描述。您还可以定义过滤属性并应用一些高级选项。如果您点击创建的语料库,您可以在顶部看到其名称和语料库ID。
  3. 接下来,您需要创建用于访问语料库的API密钥。在语料库视图中,点击“授权”选项卡,然后点击“创建API密钥”按钮。给您的密钥命名,并选择您希望为密钥选择查询还是查询+索引。点击“创建”,现在您有一个活动的API密钥。请保密此密钥。

要使用LangChain与Vectara,您需要这三个值:客户ID、语料库ID和api_key。 您可以通过两种方式将它们提供给LangChain:

  1. 在环境中包含这三个变量:VECTARA_CUSTOMER_IDVECTARA_CORPUS_IDVECTARA_API_KEY

例如,您可以使用os.environ和getpass设置这些变量,如下所示:

import os
import getpass

os.environ["VECTARA_CUSTOMER_ID"] = getpass.getpass("Vectara Customer ID:")
os.environ["VECTARA_CORPUS_ID"] = getpass.getpass("Vectara Corpus ID:")
os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
  1. 将它们添加到Vectara vectorstore构造函数中:
vectorstore = Vectara(
vectara_customer_id=vectara_customer_id,
vectara_corpus_id=vectara_corpus_id,
vectara_api_key=vectara_api_key
)

从LangChain连接到Vectara

首先,让我们使用from_documents()方法摄取文档。 在这里,我们假设您已将VECTARA_CUSTOMER_ID、VECTARA_CORPUS_ID和query+indexing VECTARA_API_KEY添加为环境变量。

from langchain.embeddings import FakeEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Vectara
from langchain.document_loaders import TextLoader
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
vectara = Vectara.from_documents(
docs,
embedding=FakeEmbeddings(size=768),
doc_metadata={"speech": "state-of-the-union"},
)

Vectara的索引API提供了一个文件上传API,其中文件由Vectara直接处理-预处理、最佳分块和添加到Vectara向量存储中。 为了使用这个功能,我们添加了add_files()方法(以及from_files()方法)。

让我们看看它是如何工作的。我们选择了两个要上传的PDF文档:

  1. 金博士的“我有一个梦想”演讲
  2. 丘吉尔的“我们将在海滩上战斗”演讲
import tempfile
import urllib.request

urls = [
[
"https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf",
"I-have-a-dream",
],
[
"https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf",
"we shall fight on the beaches",
],
]
files_list = []
for url, _ in urls:
name = tempfile.NamedTemporaryFile().name
urllib.request.urlretrieve(url, name)
files_list.append(name)

docsearch: Vectara = Vectara.from_files(
files=files_list,
embedding=FakeEmbeddings(size=768),
metadatas=[{"url": url, "speech": title} for url, title in urls],
)

相似性搜索

使用Vectara的最简单场景是执行相似性搜索。

query = "What did the president say about Ketanji Brown Jackson"
found_docs = vectara.similarity_search(
query, n_sentence_context=0, filter="doc.speech = 'state-of-the-union'"
)
print(found_docs[0].page_content)
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

带有分数的相似性搜索

有时我们可能希望执行搜索,同时获得相关性分数,以了解特定结果的好坏程度。

query = "What did the president say about Ketanji Brown Jackson"
found_docs = vectara.similarity_search_with_score(
query, filter="doc.speech = 'state-of-the-union'"
)
document, score = found_docs[0]
print(document.page_content)
print(f"\nScore: {score}")
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Score: 0.4917977

现在让我们对我们上传的文件中的内容进行相似搜索。

query = "We must forever conduct our struggle"
found_docs = vectara.similarity_search_with_score(
query, filter="doc.speech = 'I-have-a-dream'"
)
print(found_docs[0])
print(found_docs[1])
    (Document(page_content='We must forever conduct our struggle on the high plane of dignity and discipline.', metadata={'section': '1'}), 0.7962591)
(Document(page_content='We must not allow our\ncreative protests to degenerate into physical violence. . . .', metadata={'section': '1'}), 0.25983918)

Vectara作为检索器

Vectara和其他所有LangChain向量存储一样,最常用的用法是作为LangChain检索器:

retriever = vectara.as_retriever()
retriever
    VectaraRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x12772caf0>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '0'})
query = "What did the president say about Ketanji Brown Jackson"
retriever.get_relevant_documents(query)[0]
    Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})