Skip to main content

ElasticSearch

Elasticsearch是一个分布式的、RESTful的搜索和分析引擎。它提供了一个分布式、多租户能力的全文搜索引擎,具有HTTP Web界面和无模式的JSON文档。

本笔记本展示了如何使用与Elasticsearch数据库相关的功能。

安装

请查看Elasticsearch安装说明

要连接到不需要登录凭据的Elasticsearch实例,将Elasticsearch URL和索引名称与嵌入对象一起传递给构造函数。

示例:

        from langchain import ElasticVectorSearch
from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()
elastic_vector_search = ElasticVectorSearch(
elasticsearch_url="http://localhost:9200",
index_name="test_index",
embedding=embedding
)

要连接到需要登录凭据的Elasticsearch实例,包括Elastic Cloud,请使用Elasticsearch URL格式https://username:password@es_host:9243。例如,要连接到Elastic Cloud,请使用所需的身份验证详细信息创建Elasticsearch URL,并将其作为命名参数elasticsearch_url传递给ElasticVectorSearch构造函数。

您可以通过登录到Elastic Cloud控制台https://cloud.elastic.co,选择您的部署,并导航到“部署”页面来获取Elastic Cloud URL和登录凭据。

要获取默认“elastic”用户的Elastic Cloud密码:

  1. 登录到Elastic Cloud控制台https://cloud.elastic.co
  2. 转到“Security” > “Users”
  3. 找到“elastic”用户并单击“Edit”
  4. 单击“Reset password”
  5. 按照提示重置密码

Elastic Cloud URL的格式为https://username:password@cluster_id.region_id.gcp.cloud.es.io:9243。

示例:

        from langchain import ElasticVectorSearch
from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

elastic_host = "cluster_id.region_id.gcp.cloud.es.io"
elasticsearch_url = f"https://username:password@{elastic_host}:9243"
elastic_vector_search = ElasticVectorSearch(
elasticsearch_url=elasticsearch_url,
index_name="test_index",
embedding=embedding
)
pip install elasticsearch
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
    OpenAI API Key: ········

示例

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch
from langchain.document_loaders import TextLoader
from langchain.document_loaders import TextLoader

loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
db = ElasticVectorSearch.from_documents(
docs, embeddings, elasticsearch_url="http://localhost:9200"
)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
    In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen.

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

ElasticKnnSearch类

ElasticKnnSearch实现了允许在Elasticsearch中存储向量和文档以供近似kNN搜索使用的功能

pip install langchain elasticsearch
from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch
from langchain.embeddings import ElasticsearchEmbeddings
import elasticsearch
# 初始化ElasticsearchEmbeddings
model_id = "<model_id_from_es>"
dims = dim_count
es_cloud_id = "ESS_CLOUD_ID"
es_user = "es_user"
es_password = "es_pass"
test_index = "<index_name>"
# input_field = "your_input_field" # if different from 'text_field'
# 生成嵌入对象
embeddings = ElasticsearchEmbeddings.from_credentials(
model_id,
# input_field=input_field,
es_cloud_id=es_cloud_id,
es_user=es_user,
es_password=es_password,
)
# 初始化ElasticKnnSearch
knn_search = ElasticKnnSearch(
es_cloud_id=es_cloud_id,
es_user=es_user,
es_password=es_password,
index_name=test_index,
embedding=embeddings,
)

测试添加向量

# 测试`add_texts`方法
texts = ["Hello, world!", "Machine learning is fun.", "I love Python."]
knn_search.add_texts(texts)

# 测试`from_texts`方法
new_texts = [
"This is a new text.",
"Elasticsearch is powerful.",
"Python is great for data analysis.",
]
knn_search.from_texts(new_texts, dims=dims)

使用查询向量构建进行knn搜索的测试

# 测试带有model_id和query_text的`knn_search`方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2)
print(f"kNN搜索结果,查询为'{query}':{knn_result}")
print(
f"顶部命中的'text'字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'"
)

# 测试`hybrid_search`方法
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(query=query, model_id=model_id, k=2)
print(f"混合搜索结果,查询为'{query}':{hybrid_result}")
print(
f"顶部命中的'text'字段值为:'{hybrid_result['hits']['hits'][0]['_source']['text']}'"
)

使用预生成向量进行knn搜索的测试

# 为测试生成嵌入
query_text = "Hello"
query_embedding = embeddings.embed_query(query_text)
print(
f"嵌入的长度:{len(query_embedding)}\n嵌入的前两个元素:{query_embedding[:2]}"
)

# 测试knn搜索
knn_result = knn_search.knn_search(query_vector=query_embedding, k=2)
print(
f"顶部命中的'text'字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'"
)

# 测试混合搜索-需要同时提供query_text和query_vector
knn_result = knn_search.knn_hybrid_search(
query_vector=query_embedding, query=query_text, k=2
)
print(
f"顶部命中的'text'字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'"
)

测试source选项

# 测试带有model_id和query_text的`knn_search`方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2, source=False)
assert not "_source" in knn_result["hits"]["hits"][0].keys()

# 测试`hybrid_search`方法
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(
query=query, model_id=model_id, k=2, source=False
)
assert not "_source" in hybrid_result["hits"]["hits"][0].keys()

测试fields选项

# 测试带有model_id和query_text的`knn_search`方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2, fields=["text"])
assert "text" in knn_result["hits"]["hits"][0]["fields"].keys()

# 测试`hybrid_search`方法
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(
query=query, model_id=model_id, k=2, fields=["text"]
)
assert "text" in hybrid_result["hits"]["hits"][0]["fields"].keys()

使用es客户端连接而不是cloud_id进行测试

# 创建Elasticsearch连接
es_connection = Elasticsearch(
hosts=["https://es_cluster_url:port"], basic_auth=("user", "password")
)
# 使用es_connection实例化ElasticsearchEmbeddings
embeddings = ElasticsearchEmbeddings.from_es_connection(
model_id,
es_connection,
)
# 初始化ElasticKnnSearch
knn_search = ElasticKnnSearch(
es_connection=es_connection, index_name=test_index, embedding=embeddings
)
# 测试带有model_id和query_text的`knn_search`方法
query = "Hello"
knn_result = knn_search.knn_search(query=query, model_id=model_id, k=2)
print(f"kNN搜索结果,查询为'{query}':{knn_result}")
print(
f"顶部命中的'text'字段值为:'{knn_result['hits']['hits'][0]['_source']['text']}'"
)