DocArray
DocArray是一个多功能的开源工具,用于管理多模态数据。它允许您按照自己的需求来组织数据,并提供了使用各种文档索引后端进行存储和搜索的灵活性。更棒的是,您可以利用您的DocArray文档索引来创建一个DocArrayRetriever,并构建出色的Langchain应用程序!
这个笔记本分为两个部分。第一部分介绍了所有五种支持的文档索引后端。它提供了设置和索引每个后端的指导,并指导您如何构建一个用于查找相关文档的DocArrayRetriever。在第二部分中,我们将选择其中一种后端,并通过一个基本示例来说明如何使用它。
文档索引后端
from langchain.retrievers import DocArrayRetriever
from docarray import BaseDoc
from docarray.typing import NdArray
import numpy as np
from langchain.embeddings import FakeEmbeddings
import random
embeddings = FakeEmbeddings(size=32)
在构建索引之前,定义文档模式非常重要。这决定了您的文档将具有哪些字段以及每个字段将保存什么类型的数据。
在这个演示中,我们将创建一个有点随机的模式,包含'title'(字符串)、'title_embedding'(numpy数组)、'year'(整数)和'color'(字符串)。
class MyDoc(BaseDoc):
title: str
title_embedding: NdArray[32]
year: int
color: str
InMemoryExactNNIndex(内存中的精确最近邻索引)
InMemoryExactNNIndex(内存中的精确最近邻索引)将所有文档存储在内存中。对于小型数据集来说,这是一个很好的起点,您可能不想启动一个数据库服务器。
了解更多信息,请访问:https://docs.docarray.org/user_guide/storing/index_in_memory/
from docarray.index import InMemoryExactNNIndex
# 初始化索引
db = InMemoryExactNNIndex[MyDoc]()
# 索引数据
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# 可选地,您可以创建一个过滤查询
filter_query = {"year": {"$lte": 90}}
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# 查找相关文档
doc = retriever.get_relevant_documents("some query")
print(doc)
[Document(page_content='My document 56', metadata={'id': '1f33e58b6468ab722f3786b96b20afe6', 'year': 56, 'color': 'red'})]
HnswDocumentIndex
HnswDocumentIndex是一个轻量级的文档索引实现,完全在本地运行,最适合小到中型数据集。它将向量存储在磁盘上的hnswlib中,并将所有其他数据存储在SQLite中。
在这里了解更多信息:https://docs.docarray.org/user_guide/storing/index_hnswlib/
from docarray.index import HnswDocumentIndex
# 初始化索引
db = HnswDocumentIndex[MyDoc](work_dir="hnsw_index")
# 索引数据
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# 可选地,您可以创建一个过滤查询
filter_query = {"year": {"$lte": 90}}
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# 查找相关文档
doc = retriever.get_relevant_documents("some query")
print(doc)
[Document(page_content='My document 28', metadata={'id': 'ca9f3f4268eec7c97a7d6e77f541cb82', 'year': 28, 'color': 'red'})]
WeaviateDocumentIndex (Weaviate文档索引)
WeaviateDocumentIndex是建立在Weaviate向量数据库之上的文档索引。
在此处了解更多信息:https://docs.docarray.org/user_guide/storing/index_weaviate/
# 与其他后端相比,Weaviate后端有一个小差异。
# 在这里,您需要使用'is_embedding=True'来“标记”用于向量搜索的字段。
# 因此,让我们为Weaviate创建一个新的模式,以满足此要求。
from pydantic import Field
class WeaviateDoc(BaseDoc):
title: str
title_embedding: NdArray[32] = Field(is_embedding=True)
year: int
color: str
from docarray.index import WeaviateDocumentIndex
# 初始化索引
dbconfig = WeaviateDocumentIndex.DBConfig(host="http://localhost:8080")
db = WeaviateDocumentIndex[WeaviateDoc](db_config=dbconfig)
# 索引数据
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# 可选地,您可以创建一个过滤查询
filter_query = {"path": ["year"], "operator": "LessThanEqual", "valueInt": "90"}
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# 查找相关文档
doc = retriever.get_relevant_documents("some query")
print(doc)
[Document(page_content='My document 17', metadata={'id': '3a5b76e85f0d0a01785dc8f9d965ce40', 'year': 17, 'color': 'red'})]
ElasticDocIndex 弹性文档索引
ElasticDocIndex 是建立在 ElasticSearch 之上的文档索引。
了解更多信息,请访问:https://docs.docarray.org/user_guide/storing/index_elastic/
from docarray.index import ElasticDocIndex
# 初始化索引
db = ElasticDocIndex[MyDoc](
hosts="http://localhost:9200", index_name="docarray_retriever"
)
# 索引数据
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# 可选地,您可以创建一个过滤查询
filter_query = {"range": {"year": {"lte": 90}}}
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# 查找相关文档
doc = retriever.get_relevant_documents("some query")
print(doc)
[Document(page_content='My document 46', metadata={'id': 'edbc721bac1c2ad323414ad1301528a4', 'year': 46, 'color': 'green'})]
QdrantDocumentIndex
QdrantDocumentIndex是建立在Qdrant向量数据库之上的文档索引
在这里了解更多信息:https://docs.docarray.org/user_guide/storing/index_qdrant/
from docarray.index import QdrantDocumentIndex
from qdrant_client.http import models as rest
# 初始化索引
qdrant_config = QdrantDocumentIndex.DBConfig(path=":memory:")
db = QdrantDocumentIndex[MyDoc](qdrant_config)
# 索引数据
db.index(
[
MyDoc(
title=f"My document {i}",
title_embedding=embeddings.embed_query(f"query {i}"),
year=i,
color=random.choice(["red", "green", "blue"]),
)
for i in range(100)
]
)
# 可选地,您可以创建一个过滤查询
filter_query = rest.Filter(
must=[
rest.FieldCondition(
key="year",
range=rest.Range(
gte=10,
lt=90,
),
)
]
)
WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="title_embedding",
content_field="title",
filters=filter_query,
)
# 查找相关文档
doc = retriever.get_relevant_documents("some query")
print(doc)
[Document(page_content='My document 80', metadata={'id': '97465f98d0810f1f330e4ecc29b13d20', 'year': 80, 'color': 'blue'})]
使用HnswDocumentIndex进行电影检索
movies = [
{
"title": "Inception",
"description": "A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.",
"director": "Christopher Nolan",
"rating": 8.8,
},
{
"title": "The Dark Knight",
"description": "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",
"director": "Christopher Nolan",
"rating": 9.0,
},
{
"title": "Interstellar",
"description": "Interstellar explores the boundaries of human exploration as a group of astronauts venture through a wormhole in space. In their quest to ensure the survival of humanity, they confront the vastness of space-time and grapple with love and sacrifice.",
"director": "Christopher Nolan",
"rating": 8.6,
},
{
"title": "Pulp Fiction",
"description": "The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",
"director": "Quentin Tarantino",
"rating": 8.9,
},
{
"title": "Reservoir Dogs",
"description": "When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.",
"director": "Quentin Tarantino",
"rating": 8.3,
},
{
"title": "The Godfather",
"description": "An aging patriarch of an organized crime dynasty transfers control of his empire to his reluctant son.",
"director": "Francis Ford Coppola",
"rating": 9.2,
},
]
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
OpenAI API Key: ········
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from langchain.embeddings.openai import OpenAIEmbeddings
# 为电影文档定义模式
class MyDoc(BaseDoc):
title: str
description: str
description_embedding: NdArray[1536]
rating: float
director: str
embeddings = OpenAIEmbeddings()
# 获取"description"的嵌入,并创建文档
docs = DocList[MyDoc](
[
MyDoc(
description_embedding=embeddings.embed_query(movie["description"]), **movie
)
for movie in movies
]
)
from docarray.index import HnswDocumentIndex
# 初始化索引
db = HnswDocumentIndex[MyDoc](work_dir="movie_search")
# 添加数据
db.index(docs)
正常检索器 (Normal Retriever)
from langchain.retrievers import DocArrayRetriever
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
)
# 查找相关文档
doc = retriever.get_relevant_documents("关于梦的电影")
print(doc)
[Document(page_content='通过使用梦境共享技术窃取公司机密的小偷被赋予了将一个想法植入首席执行官脑海中的任务。', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': '盗梦空间', 'rating': 8.8, 'director': '克里斯托弗·诺兰'})]
带有过滤器的检索器 (Retriever with Filters)
from langchain.retrievers import DocArrayRetriever
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
filters={"director": {"$eq": "Christopher Nolan"}},
top_k=2,
)
# 查找相关文档
docs = retriever.get_relevant_documents("space travel")
print(docs)
[Document(page_content='《星际穿越》探索人类探索的边界,一群宇航员通过太空中的虫洞冒险。在他们为确保人类的生存而努力的过程中,他们面对着广阔的时空,并且面对着爱与牺牲。', metadata={'id': 'ab704cc7ae8573dc617f9a5e25df022a', 'title': '星际穿越', 'rating': 8.6, 'director': 'Christopher Nolan'}), Document(page_content='一个通过梦境共享技术窃取公司机密的小偷被赋予了在首席执行官的脑海中植入一个想法的任务。', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': '盗梦空间', 'rating': 8.8, 'director': 'Christopher Nolan'})]
使用MMR搜索的检索器(Retriever)
from langchain.retrievers import DocArrayRetriever
# 创建一个检索器
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
filters={"rating": {"$gte": 8.7}},
search_type="mmr",
top_k=3,
)
# 查找相关文档
docs = retriever.get_relevant_documents("动作电影")
print(docs)
[Document(page_content="两个杀手、一个拳击手、一个黑帮妻子和一对餐馆劫匪的生活在四个暴力和救赎的故事中交织在一起。", metadata={'id': 'e6aa313bbde514e23fbc80ab34511afd', 'title': '低俗小说', 'rating': 8.9, 'director': '昆汀·塔伦蒂诺'}), Document(page_content='一个通过梦境共享技术窃取企业机密的小偷被赋予了将一个想法植入首席执行官脑海中的任务。', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': '盗梦空间', 'rating': 8.8, 'director': '克里斯托弗·诺兰'}), Document(page_content='当被称为小丑的威胁对哥谭市民造成混乱和混乱时,蝙蝠侠必须接受他对抗不义的心理和身体能力的最大考验之一。', metadata={'id': '91dec17d4272041b669fd113333a65f7', 'title': '黑暗骑士', 'rating': 9.0, 'director': '克里斯托弗·诺兰'})]