Skip to main content

使用Activeloop的DeepLake进行QA

In this tutorial, we are going to use Langchain + Activeloop's Deep Lake with GPT4 to semantically search and ask questions over a group chat. 在本教程中,我们将使用Langchain + Activeloop的Deep Lake和GPT4来对群聊进行语义搜索和提问。

View a working demo here 查看演示链接

1. 安装所需的软件包

python3 -m pip install --upgrade langchain 'deeplake[enterprise]' openai tiktoken

2. 添加API密钥

import os
import getpass
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
)
from langchain.vectorstores import DeepLake
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
activeloop_token = getpass.getpass("Activeloop Token:")
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
os.environ["ACTIVELOOP_ORG"] = getpass.getpass("Activeloop Org:")

org_id = os.environ["ACTIVELOOP_ORG"]
embeddings = OpenAIEmbeddings()

dataset_path = "hub://" + org_id + "/data"

2. 创建示例数据

您可以使用ChatGPT生成一个包含三个朋友谈论他们一天的群聊对话的示例。引用真实的地点和虚构的名字。尽可能有趣和详细。

我已经在messages.txt中生成了这样的聊天记录。我们可以保持简单,将其用作示例。

3. 导入聊天嵌入

我们将加载文本文件中的消息,将其分块并上传到ActiveLoop向量存储。

with open("messages.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
pages = text_splitter.split_text(state_of_the_union)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.create_documents(pages)

print(texts)

dataset_path = "hub://" + org + "/data"
embeddings = OpenAIEmbeddings()
db = DeepLake.from_documents(
texts, embeddings, dataset_path=dataset_path, overwrite=True
)

可选: 您还可以使用Deep Lake的托管张量数据库作为托管服务并在那里运行查询。为此,需要在创建向量存储时指定运行时参数为{'tensor_db': True}。此配置使得可以在托管张量数据库上执行查询,而不是在客户端上执行。需要注意的是,此功能不适用于本地或内存中存储的数据集。如果已经在托管张量数据库之外创建了向量存储,则可以按照规定的步骤将其转移到托管张量数据库中。

# with open("messages.txt") as f:
# state_of_the_union = f.read()
# text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# pages = text_splitter.split_text(state_of_the_union)

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
# texts = text_splitter.create_documents(pages)

# print(texts)

# dataset_path = "hub://" + org + "/data"
# embeddings = OpenAIEmbeddings()
# db = DeepLake.from_documents(
# texts, embeddings, dataset_path=dataset_path, overwrite=True, runtime="tensor_db"
# )

4. 提问

现在我们可以提问并通过语义搜索获得答案:

db = DeepLake(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)

retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["k"] = 4

qa = RetrievalQA.from_chain_type(
llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=False
)

# What was the restaurant the group was talking about called?
query = input("Enter query:")

# The Hungry Lobster
ans = qa({"query": query})

print(ans)