PGVector
PGVector是一个开源的用于
Postgres
的向量相似性搜索工具
它支持:
- 精确和近似最近邻搜索
- L2距离、内积和余弦距离
本笔记本展示了如何使用Postgres向量数据库(PGVector
)。
请参阅安装说明。
# 安装所需的包
pip install pgvector
pip install openai
pip install psycopg2-binary
pip install tiktoken
Requirement already satisfied: pgvector in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (0.1.8)
Requirement already satisfied: numpy in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from pgvector) (1.24.3)
Requirement already satisfied: openai in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (0.27.7)
Requirement already satisfied: requests>=2.20 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from openai) (2.28.2)
Requirement already satisfied: tqdm in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from openai) (4.65.0)
Requirement already satisfied: aiohttp in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from openai) (3.8.4)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.20->openai) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.20->openai) (2023.5.7)
Requirement already satisfied: attrs>=17.3.0 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from aiohttp->openai) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from aiohttp->openai) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from aiohttp->openai) (1.3.1)
Requirement already satisfied: psycopg2-binary in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (2.9.6)
Requirement already satisfied: tiktoken in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (0.4.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from tiktoken) (2023.5.5)
Requirement already satisfied: requests>=2.26.0 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from tiktoken) (2.28.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /Users/joyeed/langchain/langchain/.venv/lib/python3.9/site-packages (from requests>=2.26.0->tiktoken) (2023.5.7)
我们想要使用OpenAIEmbeddings
,所以我们需要获取OpenAI API密钥。
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
OpenAI API Key:········
## 加载环境变量 (Loading Environment Variables)
```python
from typing import List, Tuple
from dotenv import load_dotenv
load_dotenv()
False
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.pgvector import PGVector
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
# PGVector 需要数据库的连接字符串。
CONNECTION_STRING = "postgresql+psycopg2://harrisonchase@localhost:5432/test3"
# # 或者,您可以从环境变量创建连接字符串。
# import os
# CONNECTION_STRING = PGVector.connection_string_from_db_params(
# driver=os.environ.get("PGVECTOR_DRIVER", "psycopg2"),
# host=os.environ.get("PGVECTOR_HOST", "localhost"),
# port=int(os.environ.get("PGVECTOR_PORT", "5432")),
# database=os.environ.get("PGVECTOR_DATABASE", "postgres"),
# user=os.environ.get("PGVECTOR_USER", "postgres"),
# password=os.environ.get("PGVECTOR_PASSWORD", "postgres"),
# )
使用欧几里得距离进行相似性搜索(默认)
# PGVector模块将尝试创建一个以集合名称命名的表。
# 因此,请确保集合名称是唯一的,并且用户具有创建表的权限。
COLLECTION_NAME = "state_of_the_union_test"
db = PGVector.from_documents(
embedding=embeddings,
documents=docs,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
)
query = "总统对Ketanji Brown Jackson说了什么"
docs_with_score = db.similarity_search_with_score(query)
for doc, score in docs_with_score:
print("-" * 80)
print("得分: ", score)
print(doc.page_content)
print("-" * 80)
--------------------------------------------------------------------------------
得分: 0.18460171628856903
今晚。我呼吁参议院:通过《自由投票法案》。通过《约翰·刘易斯选举权法案》。而且,趁机通过《披露法案》,这样美国人就可以知道谁在资助我们的选举。
今晚,我想向一个致力于为这个国家服务的人表示敬意:司法部长斯蒂芬·布雷耶——一位陆军退伍军人、宪法学者和即将退休的美国最高法院法官。布雷耶法官,感谢您的服务。
作为总统,最重要的宪法责任之一就是提名人选担任美国最高法院法官。
我在4天前做到了这一点,当时我提名了巡回上诉法院法官Ketanji Brown Jackson。她是我们国家最顶尖的法律智慧之一,将继续布雷耶法官的卓越传统。
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
得分: 0.18460171628856903
今晚。我呼吁参议院:通过《自由投票法案》。通过《约翰·刘易斯选举权法案》。而且,趁机通过《披露法案》,这样美国人就可以知道谁在资助我们的选举。
今晚,我想向一个致力于为这个国家服务的人表示敬意:司法部长斯蒂芬·布雷耶——一位陆军退伍军人、宪法学者和即将退休的美国最高法院法官。布雷耶法官,感谢您的服务。
作为总统,最重要的宪法责任之一就是提名人选担任美国最高法院法官。
我在4天前做到了这一点,当时我提名了巡回上诉法院法官Ketanji Brown Jackson。她是我们国家最顶尖的法律智慧之一,将继续布雷耶法官的卓越传统。
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
得分: 0.18470284560586236
今晚。我呼吁参议院:通过《自由投票法案》。通过《约翰·刘易斯选举权法案》。而且,趁机通过《披露法案》,这样美国人就可以知道谁在资助我们的选举。
今晚,我想向一个致力于为这个国家服务的人表示敬意:司法部长斯蒂芬·布雷耶——一位陆军退伍军人、宪法学者和即将退休的美国最高法院法官。布雷耶法官,感谢您的服务。
作为总统,最重要的宪法责任之一就是提名人选担任美国最高法院法官。
我在4天前做到了这一点,当时我提名了巡回上诉法院法官Ketanji Brown Jackson。她是我们国家最顶尖的法律智慧之一,将继续布雷耶法官的卓越传统。
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
得分: 0.21730864082247825
一位曾在私人执业中担任高级诉讼律师。一位前联邦公共辩护人。来自公立学校教育工作者和警察家庭。一个共识的建设者。自她被提名以来,她得到了广泛的支持——从警察兄弟会到由民主党和共和党任命的前法官。
如果我们要推进自由和正义,我们需要确保边境安全并修复移民制度。
我们可以两者兼顾。在我们的边境,我们安装了新技术,如先进的扫描仪,以更好地检测毒品走私。
我们与墨西哥和危地马拉建立了联合巡逻,以抓捕更多的人口贩运者。
我们正在设立专门的移民法官,以便逃离迫害和暴力的家庭能够更快地得到审理。
我们正在确保承诺并支持南美和中美的合作伙伴,以接纳更多的难民并保护他们的边境。
--------------------------------------------------------------------------------
使用vectorstore
上面,我们从头开始创建了一个vectorstore。然而,通常我们希望使用一个已存在的vectorstore来进行工作。 为了做到这一点,我们可以直接初始化它。
store = PGVector(
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
embedding_function=embeddings,
)
添加文档 (Add documents)
我们可以将文档添加到现有的向量存储中。
store.add_documents([Document(page_content="foo")])
['048c2e14-1cf3-11ee-8777-e65801318980']
docs_with_score = db.similarity_search_with_score("foo")
docs_with_score[0]
(Document(page_content='foo', metadata={}), 3.3203430005457335e-09)
docs_with_score[1]
(Document(page_content='前私人执业的顶级诉讼律师。前联邦公共辩护人。来自公立学校教育工作者和警察的家庭。一个共识的建设者。自从她被提名以来,她得到了广泛的支持——从警察协会到民主党和共和党任命的前法官。\n\n如果我们要推进自由和正义,我们需要保护边境并修复移民系统。\n\n我们可以两者兼顾。在我们的边境,我们安装了新技术,如先进的扫描仪,以更好地检测毒品走私。\n\n我们与墨西哥和危地马拉建立了联合巡逻,以抓捕更多的人口贩运者。\n\n我们正在设立专门的移民法官,以便那些逃离迫害和暴力的家庭能够更快地得到审理。\n\n我们正在确保承诺并支持南美和中美的合作伙伴,以接纳更多的难民并保护他们自己的边境。', metadata={'source': '../../../state_of_the_union.txt'}),
0.2404395365581814)
覆盖一个向量存储
如果您有一个现有的集合,可以通过执行from_documents
并设置pre_delete_collection=True
来覆盖它。
db = PGVector.from_documents(
documents=docs,
embedding=embeddings,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
pre_delete_collection=True,
)
docs_with_score = db.similarity_search_with_score("foo")
docs_with_score[0]
(Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../../state_of_the_union.txt'}),
0.2404115088144465)
使用VectorStore作为检索器 (Using a VectorStore as a Retriever)
retriever = store.as_retriever()
print(retriever)
tags=None metadata=None vectorstore=<langchain.vectorstores.pgvector.PGVector object at 0x29f94f880> search_type='similarity' search_kwargs={}