使用Activeloop的DeepLake进行QA
In this tutorial, we are going to use Langchain + Activeloop's Deep Lake with GPT4 to semantically search and ask questions over a group chat. 在本教程中,我们将使用Langchain + Activeloop的Deep Lake和GPT4来对群聊进行语义搜索和提问。
View a working demo here 查看演示链接
1. 安装所需的软件包
python3 -m pip install --upgrade langchain 'deeplake[enterprise]' openai tiktoken
2. 添加API密钥
import os
import getpass
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
)
from langchain.vectorstores import DeepLake
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
activeloop_token = getpass.getpass("Activeloop Token:")
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
os.environ["ACTIVELOOP_ORG"] = getpass.getpass("Activeloop Org:")
org_id = os.environ["ACTIVELOOP_ORG"]
embeddings = OpenAIEmbeddings()
dataset_path = "hub://" + org_id + "/data"
2. 创建示例数据
您可以使用ChatGPT生成一个包含三个朋友谈论他们一天的群聊对话的示例。引用真实的地点和虚构的名字。尽可能有趣和详细。
我已经在messages.txt
中生成了这样的聊天记录。我们可以保持简单,将其用作示例。
3. 导入聊天嵌入
我们将加载文本文件中的消息,将其分块并上传到ActiveLoop向量存储。
with open("messages.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
pages = text_splitter.split_text(state_of_the_union)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.create_documents(pages)
print(texts)
dataset_path = "hub://" + org + "/data"
embeddings = OpenAIEmbeddings()
db = DeepLake.from_documents(
texts, embeddings, dataset_path=dataset_path, overwrite=True
)
可选
: 您还可以使用Deep Lake的托管张量数据库作为托管服务并在那里运行查询。为此,需要在创建向量存储时指定运行时参数为{'tensor_db': True}。此配置使得可以在托管张量数据库上执行查询,而不是在客户端上执行。需要注意的是,此功能不适用于本地或内存中存储的数据集。如果已经在托管张量数据库之外创建了向量存储,则可以按照规定的步骤将其转移到托管张量数据库中。
# with open("messages.txt") as f:
# state_of_the_union = f.read()
# text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# pages = text_splitter.split_text(state_of_the_union)
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
# texts = text_splitter.create_documents(pages)
# print(texts)
# dataset_path = "hub://" + org + "/data"
# embeddings = OpenAIEmbeddings()
# db = DeepLake.from_documents(
# texts, embeddings, dataset_path=dataset_path, overwrite=True, runtime="tensor_db"
# )
4. 提问
现在我们可以提问并通过语义搜索获得答案:
db = DeepLake(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)
retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["k"] = 4
qa = RetrievalQA.from_chain_type(
llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=False
)
# What was the restaurant the group was talking about called?
query = input("Enter query:")
# The Hungry Lobster
ans = qa({"query": query})
print(ans)