Skip to main content

网页搜索检索器 WebResearchRetriever

给定一个查询,这个检索器将会:

  • 构建一组相关的谷歌搜索
  • 对每个搜索进行查询
  • 加载所有结果的URL
  • 然后在整合的页面内容上嵌入并执行与查询的相似性搜索
from langchain.retrievers.web_research import WebResearchRetriever

API 参考:

简单用法

指定用于生成Google搜索查询的LLM。

import os  
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models.openai import ChatOpenAI
from langchain.utilities import GoogleSearchAPIWrapper

# Vectorstore
vectorstore = Chroma(embedding_function=OpenAIEmbeddings(),persist_directory="./chroma_db_oai")

# LLM
llm = ChatOpenAI(temperature=0)

# Search
os.environ["GOOGLE_CSE_ID"] = "xxx"
os.environ["GOOGLE_API_KEY"] = "xxx"
search = GoogleSearchAPIWrapper()

API参考:

# 初始化
web_research_retriever = WebResearchRetriever.from_llm(
vectorstore=vectorstore,
llm=llm,
search=search,
)

使用RetrievalQAWithSourcesChain来检索文档并提供引用。

from langchain.chains import RetrievalQAWithSourcesChain  
user_input = "LLM Powered Autonomous Agents如何工作?"
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever)
result = qa_chain({"question": user_input})
result

API参考:

获取页面: 100%|###################################################################################################################################| 1/1 [00:00<00:00,  3.33it/s]  

{'question': 'LLM Powered Autonomous Agents如何工作?',
'answer': 'LLM Powered Autonomous Agents通过使用LLM(大型语言模型)作为代理的核心控制器来工作。它还包括规划、记忆和工具使用等几个关键组件。该代理系统被设计为一个强大的通用问题解决器。',
'sources': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}

在这里,我们使用get_relevant_documents方法返回文档。

# 运行
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.web_research").setLevel(logging.INFO)
user_input = "What is Task Decomposition in LLM Powered Autonomous Agents?"
docs = web_research_retriever.get_relevant_documents(user_input)
    INFO:langchain.retrievers.web_research:Generating questions for Google Search ...
INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?', 'text': LineList(lines=['1. How do LLM powered autonomous agents utilize task decomposition?\n', '2. Can you explain the concept of task decomposition in LLM powered autonomous agents?\n', '3. What role does task decomposition play in the functioning of LLM powered autonomous agents?\n', '4. Why is task decomposition important for LLM powered autonomous agents?\n'])}
INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How do LLM powered autonomous agents utilize task decomposition?\n', '2. Can you explain the concept of task decomposition in LLM powered autonomous agents?\n', '3. What role does task decomposition play in the functioning of LLM powered autonomous agents?\n', '4. Why is task decomposition important for LLM powered autonomous agents?\n']
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1." , "What are the subgoals for achieving XYZ?'}]
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1." , "What are the subgoals for achieving XYZ?" , (2)\xa0...'}]
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... In a LLM-powered autonomous agent system, LLM functions as the ... Task decomposition can be done (1) by LLM with simple prompting like\xa0...'}]
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Agent System Overview In a LLM-powered autonomous agent system, ... Task decomposition can be done (1) by LLM with simple prompting like\xa0...'}]
INFO:langchain.retrievers.web_research:New URLs to load: []

使用检索到的文档进行问答,我们可以使用load_qa_chain

from langchain.chains.question_answering import load_qa_chain  
chain = load_qa_chain(llm, chain_type="stuff")
output = chain({"input_documents": docs, "question": user_input},return_only_outputs=True)
output['output_text']

API参考:

    'Task decomposition in LLM-powered autonomous agents refers to the process of breaking down a complex task into smaller, more manageable subgoals. This allows the agent to efficiently handle and execute the individual steps required to complete the overall task. By decomposing the task, the agent can prioritize and organize its actions, making it easier to plan and execute the necessary steps towards achieving the desired outcome.'

更灵活

使用自定义提示和输出解析来传递LLM链

import os
import re
from typing import List
from langchain.chains import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers.pydantic import PydanticOutputParser

# LLMChain
search_prompt = PromptTemplate(
input_variables=["question"],
template="""You are an assistant tasked with improving Google search
results. Generate FIVE Google search queries that are similar to
this question. The output should be a numbered list of questions and each
should have a question mark at the end: {question}""",
)

class LineList(BaseModel):
"""List of questions."""

lines: List[str] = Field(description="Questions")

class QuestionListOutputParser(PydanticOutputParser):
"""Output parser for a list of numbered questions."""

def __init__(self) -> None:
super().__init__(pydantic_object=LineList)

def parse(self, text: str) -> LineList:
lines = re.findall(r"\d+\..*?\n", text)
return LineList(lines=lines)

llm_chain = LLMChain(
llm=llm,
prompt=search_prompt,
output_parser=QuestionListOutputParser(),
)

API参考:

# 初始化
web_research_retriever_llm_chain = WebResearchRetriever(
vectorstore=vectorstore,
llm_chain=llm_chain,
search=search,
)

# 运行
docs = web_research_retriever_llm_chain.get_relevant_documents(user_input)
INFO:langchain.retrievers.web_research:Generating questions for Google Search ...
INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?', 'text': LineList(lines=['1. How do LLM powered autonomous agents use task decomposition?\n', '2. Why is task decomposition important for LLM powered autonomous agents?\n', '3. Can you explain the concept of task decomposition in LLM powered autonomous agents?\n', '4. What are the benefits of task decomposition in LLM powered autonomous agents?\n'])}
INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How do LLM powered autonomous agents use task decomposition?\n', '2. Why is task decomposition important for LLM powered autonomous agents?\n', '3. Can you explain the concept of task decomposition in LLM powered autonomous agents?\n', '4. What are the benefits of task decomposition in LLM powered autonomous agents?\n']
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1." , "What are the subgoals for achieving XYZ?'}]
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1." , "What are the subgoals for achieving XYZ?" , (2)\xa0...'}]
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1." , "What are the subgoals for achieving XYZ?'}]
INFO:langchain.retrievers.web_research:Searching for relevat urls ...
INFO:langchain.retrievers.web_research:Search results: [{'title': "LLM Powered Autonomous Agents | Lil'Log", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1." , "What are the subgoals for achieving XYZ?'}]
INFO:langchain.retrievers.web_research:New URLs to load: ['https://lilianweng.github.io/posts/2023-06-23-agent/']
INFO:langchain.retrievers.web_research:Grabbing most relevant splits from urls ...
Fetching pages: 100%|###################################################################################################################################| 1/1 [00:00<00:00, 6.32it/s]
len(docs)
1

本地运行

指定在本地运行的LLM和嵌入方式(例如,在您的笔记本电脑上运行)

from langchain.llms import LlamaCpp
from langchain.embeddings import GPT4AllEmbeddings
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

n_gpu_layers = 1 # Metal set to 1 is enough.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llama = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=4096, # Context window
max_tokens=1000, # Max tokens to generate
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
)

vectorstore_llama = Chroma(embedding_function=GPT4AllEmbeddings(),persist_directory="./chroma_db_llama")

我们提供了StreamingStdOutCallbackHandler(),因此模型的输出(例如生成的问题)会被流式传输。

我们还启用了日志记录,因此我们也可以在日志中看到它们。

from langchain.chains import RetrievalQAWithSourcesChain
# Initialize
web_research_retriever = WebResearchRetriever.from_llm(
vectorstore=vectorstore_llama,
llm=llama,
search=search,
)

# Run
user_input = "What is Task Decomposition in LLM Powered Autonomous Agents?"
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llama,retriever=web_research_retriever)
result = qa_chain({"question": user_input})
result