Skip to main content

ClickHouse向量搜索

ClickHouse是最快、资源利用率最高的开源数据库,用于实时应用和分析,具有完整的SQL支持和各种函数,可帮助用户编写分析查询。最近添加的数据结构和距离搜索函数(如L2Distance)以及近似最近邻搜索索引使得ClickHouse可以用作高性能和可扩展的向量数据库,用于存储和搜索带有SQL的向量。

本笔记本展示了如何使用与ClickHouse向量搜索相关的功能。

设置环境

使用docker设置本地clickhouse服务器(可选)

docker run -d -p 8123:8123 -p9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.4.2.11

设置clickhouse客户端驱动程序

pip install clickhouse-connect

我们想使用OpenAIEmbeddings,所以我们需要获取OpenAI API密钥。

import os
import getpass

if not os.environ["OPENAI_API_KEY"]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Clickhouse, ClickhouseSettings
from langchain.document_loaders import TextLoader

loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
for d in docs:
d.metadata = {"some": "metadata"}
settings = ClickhouseSettings(table="clickhouse_vector_search_example")
docsearch = Clickhouse.from_documents(docs, embeddings, config=settings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
    Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 2801.49it/s]
print(docs[0].page_content)
    Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

获取连接信息和数据模式

print(str(docsearch))
    default.clickhouse_vector_search_example @ localhost:8123

username: None

Table Schema:
---------------------------------------------------
|id |Nullable(String) |
|document |Nullable(String) |
|embedding |Array(Float32) |
|metadata |Object('json') |
|uuid |UUID |
---------------------------------------------------

Clickhouse表模式

如果表不存在,默认情况下将自动创建Clickhouse表。高级用户可以使用优化的设置预先创建表。对于具有分片的分布式Clickhouse集群,表引擎应配置为Distributed

print(f"Clickhouse Table DDL:\n\n{docsearch.schema}")
    Clickhouse Table DDL:

CREATE TABLE IF NOT EXISTS default.clickhouse_vector_search_example(
id Nullable(String),
document Nullable(String),
embedding Array(Float32),
metadata JSON,
uuid UUID DEFAULT generateUUIDv4(),
CONSTRAINT cons_vec_len CHECK length(embedding) = 1536,
INDEX vec_idx embedding TYPE annoy(100,'L2Distance') GRANULARITY 1000
) ENGINE = MergeTree ORDER BY uuid SETTINGS index_granularity = 8192

过滤

您可以直接访问ClickHouse SQL的where语句。您可以编写遵循标准SQL的WHERE子句。

注意:请注意SQL注入,此接口不能直接由最终用户调用。

如果您在设置中自定义了column_map,则可以使用以下过滤器进行搜索:

from langchain.vectorstores import Clickhouse, ClickhouseSettings
from langchain.document_loaders import TextLoader

loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

for i, d in enumerate(docs):
d.metadata = {"doc_id": i}

docsearch = Clickhouse.from_documents(docs, embeddings)
    Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 6939.56it/s]
meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores(
"What did the president say about Ketanji Brown Jackson?",
k=4,
where_str=f"{meta}.doc_id<10",
)
for d, dist in output:
print(dist, d.metadata, d.page_content[:20] + "...")
    0.6779101415357189 {'doc_id': 0} Madam Speaker, Madam...
0.6997970363474885 {'doc_id': 8} And so many families...
0.7044504914336727 {'doc_id': 1} Groups of citizens b...
0.7053558702165094 {'doc_id': 6} And I’m taking robus...

删除数据

docsearch.drop()