Skip to main content

FAISS

Facebook AI Similarity Search (Faiss)是一个用于高效相似性搜索和密集向量聚类的库。它包含了在任意大小的向量集合中进行搜索的算法,甚至可以处理无法放入RAM的向量集合。它还包含了用于评估和参数调整的支持代码。

Faiss文档

这个笔记本展示了如何使用与FAISS向量数据库相关的功能。

pip install faiss-gpu # 适用于CUDA 7.5+支持的GPU。
# 或者
pip install faiss-cpu # 适用于CPU安装

我们想要使用OpenAIEmbeddings,所以我们需要获取OpenAI API密钥。

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# 如果需要初始化没有AVX2优化的FAISS,请取消下面一行的注释
# os.environ['FAISS_NO_AVX2'] = '1'
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.document_loaders import TextLoader

loader = TextLoader("../../../extras/modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

相似性搜索与分数

FAISS有一些特定的方法。其中之一是similarity_search_with_score,它允许您返回文档以及查询与它们之间的距离分数。返回的距离分数是L2距离。因此,分数越低越好。

docs_and_scores = db.similarity_search_with_score(query)
docs_and_scores[0]
    (Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),
0.36913747)

还可以使用similarity_search_by_vector对与给定嵌入向量相似的文档进行搜索,该方法接受嵌入向量作为参数,而不是字符串。

embedding_vector = embeddings.embed_query(query)
docs_and_scores = db.similarity_search_by_vector(embedding_vector)

保存和加载

您还可以保存和加载FAISS索引。这样做是有用的,这样您就不必每次使用时都重新创建它。

db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)
docs = new_db.similarity_search(query)
docs[0]
    Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})

序列化和反序列化为字节

您可以使用这些函数将FAISS索引pickle化。如果您使用的是90 MB的嵌入模型(sentence-transformers/all-MiniLM-L6-v2或任何其他模型),则结果pickle的大小将超过90 MB。模型的大小也包含在总体大小中。为了解决这个问题,使用下面的函数。这些函数只序列化FAISS索引,大小会小得多。如果您希望将索引存储在像SQL这样的数据库中,这可能会有所帮助。

pkl = db.serialize_to_bytes() # 序列化faiss索引
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.deserialize_from_bytes(embeddings = embeddings, serialized = pkl) # 加载索引

合并

您还可以合并两个FAISS向量存储

db1 = FAISS.from_texts(["foo"], embeddings)
db2 = FAISS.from_texts(["bar"], embeddings)
db1.docstore._dict
    {'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={})}
db2.docstore._dict
    {'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}
db1.merge_from(db2)
db1.docstore._dict
    {'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={}),
'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}

带过滤的相似性搜索

FAISS向量存储还可以支持过滤,因为FAISS本身不支持过滤,所以我们必须手动实现。首先获取比k更多的结果,然后进行过滤。您可以根据元数据过滤文档。在调用任何搜索方法时,还可以设置fetch_k参数,以设置要在过滤之前获取多少个文档。以下是一个小例子:

from langchain.schema import Document

list_of_documents = [
Document(page_content="foo", metadata=dict(page=1)),
Document(page_content="bar", metadata=dict(page=1)),
Document(page_content="foo", metadata=dict(page=2)),
Document(page_content="barbar", metadata=dict(page=2)),
Document(page_content="foo", metadata=dict(page=3)),
Document(page_content="bar burr", metadata=dict(page=3)),
Document(page_content="foo", metadata=dict(page=4)),
Document(page_content="bar bruh", metadata=dict(page=4)),
]
db = FAISS.from_documents(list_of_documents, embeddings)
results_with_scores = db.similarity_search_with_score("foo")
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
    Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15

现在我们进行相同的查询调用,但我们只过滤page = 1的结果

results_with_scores = db.similarity_search_with_score("foo", filter=dict(page=1))
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
    Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906

使用max_marginal_relevance_search也可以做类似的搜索。

results = db.max_marginal_relevance_search("foo", filter=dict(page=1))
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
    Content: foo, Metadata: {'page': 1}
Content: bar, Metadata: {'page': 1}

以下是在调用similarity_search时如何设置fetch_k参数的示例。通常,您希望fetch_k参数 >> k参数。这是因为fetch_k参数是在过滤之前获取的文档数量。如果将fetch_k设置为较低的数字,可能无法获取足够的文档进行过滤。

results = db.similarity_search("foo", filter=dict(page=1), k=1, fetch_k=4)
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
    Content: foo, Metadata: {'page': 1}

删除

您还可以删除id。请注意,要删除的id应该是docstore中的id。

db.delete([db.index_to_docstore_id[0]])
    True
# 现在已经丢失
0 in db.index_to_docstore_id
    False