Annoy (烦恼)
Annoy (
Approximate Nearest Neighbors Oh Yeah
) 是一个C++库,具有Python绑定,用于搜索与给定查询点接近的空间中的点。它还创建了大型的只读基于文件的数据结构,这些数据结构被映射到内存中,以便许多进程可以共享相同的数据。
本笔记本展示了如何使用与Annoy
向量数据库相关的功能。
注意:Annoy是只读的 - 一旦建立索引,就无法再添加任何嵌入!
如果您想逐步向VectorStore添加新条目,则最好选择其他方法!
#!pip install annoy
从文本创建VectorStore
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Annoy
embeddings_func = HuggingFaceEmbeddings()
texts = ["披萨很棒", "我喜欢沙拉", "我的车", "一只狗"]
# 默认度量方式是angular
vector_store = Annoy.from_texts(texts, embeddings_func)
# 允许自定义annoy参数,默认值为n_trees=100, n_jobs=-1, metric="angular"
vector_store_v2 = Annoy.from_texts(
texts, embeddings_func, metric="dot", n_trees=100, n_jobs=1
)
vector_store.similarity_search("食物", k=3)
[Document(page_content='披萨很棒', metadata={}),
Document(page_content='我喜欢沙拉', metadata={}),
Document(page_content='我的车', metadata={})]
# 分数是一个距离度量,所以越低越好
vector_store.similarity_search_with_score("食物", k=3)
[(Document(page_content='披萨很棒', metadata={}), 1.0944390296936035),
(Document(page_content='我喜欢沙拉', metadata={}), 1.1273186206817627),
(Document(page_content='我的车', metadata={}), 1.1580758094787598)]
从文档创建VectorStore
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
docs[:5]
[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source': '../../../state_of_the_union.txt'}),
Document(page_content='Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \n\nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \n\nLet each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \n\nPlease rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \n\nThroughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. \n\nThey keep moving. \n\nAnd the costs and the threats to America and the world keep rising. \n\nThat’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. \n\nThe United States is a member along with 29 other nations. \n\nIt matters. American diplomacy matters. American resolve matters.', metadata={'source': '../../../state_of_the_union.txt'}),
Document(page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. \n\nHe rejected repeated efforts at diplomacy. \n\nHe thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \n\nWe prepared extensively and carefully. \n\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \n\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \n\nWe countered Russia’s lies with truth. \n\nAnd now that he has acted the free world is holding him accountable. \n\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': '../../../state_of_the_union.txt'}),
Document(page_content='We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. \n\nTogether with our allies –we are right now enforcing powerful economic sanctions
通过现有嵌入创建 VectorStore
embs = embeddings_func.embed_documents(texts)
data = list(zip(texts, embs))
vector_store_from_embeddings = Annoy.from_embeddings(data, embeddings_func)
vector_store_from_embeddings.similarity_search_with_score("food", k=3)
[(Document(page_content='披萨很棒', metadata={}), 1.0944390296936035),
(Document(page_content='我喜欢沙拉', metadata={}), 1.1273186206817627),
(Document(page_content='我的车', metadata={}), 1.1580758094787598)]
通过嵌入进行搜索 (Search via embeddings)
motorbike_emb = embeddings_func.embed_query("motorbike")
vector_store.similarity_search_by_vector(motorbike_emb, k=3)
[Document(page_content='我的汽车', metadata={}),
Document(page_content='一只狗', metadata={}),
Document(page_content='披萨很棒', metadata={})]
vector_store.similarity_search_with_score_by_vector(motorbike_emb, k=3)
[(Document(page_content='我的汽车', metadata={}), 1.0870471000671387),
(Document(page_content='一只狗', metadata={}), 1.2095637321472168),
(Document(page_content='披萨很棒', metadata={}), 1.3254905939102173)]
通过文档存储ID进行搜索 (Search via docstore id)
vector_store.index_to_docstore_id
{0: '2d1498a8-a37c-4798-acb9-0016504ed798',
1: '2d30aecc-88e0-4469-9d51-0ef7e9858e6d',
2: '927f1120-985b-4691-b577-ad5cb42e011c',
3: '3056ddcf-a62f-48c8-bd98-b9e57a3dfcae'}
some_docstore_id = 0 # texts[0]
vector_store.docstore._dict[vector_store.index_to_docstore_id[some_docstore_id]]
Document(page_content='披萨很棒', metadata={})
# 相同的文档距离为0
vector_store.similarity_search_with_score_by_index(some_docstore_id, k=3)
[(Document(page_content='披萨很棒', metadata={}), 0.0),
(Document(page_content='我喜欢沙拉', metadata={}), 1.0734446048736572),
(Document(page_content='我的车', metadata={}), 1.2895267009735107)]
保存和加载
vector_store.save_local("my_annoy_index_and_docstore")
保存配置
loaded_vector_store = Annoy.load_local(
"my_annoy_index_and_docstore", embeddings=embeddings_func
)
# 相同的文档距离为0
loaded_vector_store.similarity_search_with_score_by_index(some_docstore_id, k=3)
[(Document(page_content='pizza is great', metadata={}), 0.0),
(Document(page_content='I love salad', metadata={}), 1.0734446048736572),
(Document(page_content='my car', metadata={}), 1.2895267009735107)]
从头开始构建 (Construct from scratch)
import uuid
from annoy import AnnoyIndex
from langchain.docstore.document import Document
from langchain.docstore.in_memory import InMemoryDocstore
metadatas = [{"x": "food"}, {"x": "food"}, {"x": "stuff"}, {"x": "animal"}]
# embeddings
embeddings = embeddings_func.embed_documents(texts)
# embedding dim
f = len(embeddings[0])
# index
metric = "angular"
index = AnnoyIndex(f, metric=metric)
for i, emb in enumerate(embeddings):
index.add_item(i, emb)
index.build(10)
# docstore
documents = []
for i, text in enumerate(texts):
metadata = metadatas[i] if metadatas else {}
documents.append(Document(page_content=text, metadata=metadata))
index_to_docstore_id = {i: str(uuid.uuid4()) for i in range(len(documents))}
docstore = InMemoryDocstore(
{index_to_docstore_id[i]: doc for i, doc in enumerate(documents)}
)
db_manually = Annoy(
embeddings_func.embed_query, index, metric, docstore, index_to_docstore_id
)
db_manually.similarity_search_with_score("eating!", k=3)
[(Document(page_content='pizza is great', metadata={'x': 'food'}),
1.1314140558242798),
(Document(page_content='I love salad', metadata={'x': 'food'}),
1.1668788194656372),
(Document(page_content='my car', metadata={'x': 'stuff'}), 1.226445198059082)]