代理轨迹

由于代理可以采取的行动和生成的广度，对其进行全面评估可能会很困难。我们建议根据您的用例使用多种适用的评估技术。评估代理的一种方法是查看采取的所有行动以及它们的响应的完整轨迹。

执行此操作的评估器可以实现AgentTrajectoryEvaluator接口。本教程将展示如何使用trajectory评估器对OpenAI函数代理进行评分。

有关更多信息，请查看参考文档TrajectoryEvalChain。

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("trajectory")

API参考：

load_evaluator 来自langchain.evaluation

方法

代理轨迹评估器与evaluate_agent_trajectory（和异步aevaluate_agent_trajectory）方法一起使用，接受以下参数：

input (str) - 代理的输入。
prediction (str) - 最终预测的响应。
agent_trajectory (List[Tuple[AgentAction, str]]) - 形成代理轨迹的中间步骤。

它们返回一个包含以下值的字典：

score: 0到1之间的浮点数，其中1表示“最有效”，0表示“最无效”
reasoning: 从创建分数之前生成的LLM的“思维链”字符串

捕获轨迹

在不使用像LangSmith中的跟踪回调的情况下，返回代理的轨迹以进行评估的最简单方法是使用return_intermediate_steps=True初始化代理。

下面，创建一个示例代理，我们将调用它进行评估。

import os
import subprocess

from langchain.chat_models import ChatOpenAI
from langchain.tools import tool
from langchain.agents import AgentType, initialize_agent

from pydantic import HttpUrl
from urllib.parse import urlparse


@tool
def ping(url: HttpUrl, return_error: bool) -> str:
    """Ping the fully specified url. Must include https:// in the url."""
    hostname = urlparse(str(url)).netloc
    completed_process = subprocess.run(
        ["ping", "-c", "1", hostname], capture_output=True, text=True
    )
    output = completed_process.stdout
    if return_error and completed_process.returncode != 0:
        return completed_process.stderr
    return output


@tool
def trace_route(url: HttpUrl, return_error: bool) -> str:
    """Trace the route to the specified url. Must include https:// in the url."""
    hostname = urlparse(str(url)).netloc
    completed_process = subprocess.run(
        ["traceroute", hostname], capture_output=True, text=True
    )
    output = completed_process.stdout
    if return_error and completed_process.returncode != 0:
        return completed_process.stderr
    return output


llm = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0)
agent = initialize_agent(
    llm=llm,
    tools=[ping, trace_route],
    agent=AgentType.OPENAI_MULTI_FUNCTIONS,
    return_intermediate_steps=True,  # 重要！
)

result = agent("What's the latency like for https://langchain.com?")

API参考：

ChatOpenAI 来自langchain.chat_models
tool 来自langchain.tools
AgentType 来自langchain.agents
initialize_agent 来自langchain.agents

评估轨迹

将输入、轨迹和传递给evaluate_agent_trajectory方法。

evaluation_result = evaluator.evaluate_agent_trajectory(
    prediction=result["output"],
    input=result["input"],
    agent_trajectory=result["intermediate_steps"],
)
evaluation_result

{
    'score': 1.0,
    'reasoning': "i. The final answer is helpful. It directly answers the user's question about the latency for the website https://langchain.com.\n\nii. The AI language model uses a logical sequence of tools to answer the question. It uses the 'ping' tool to measure the latency of the website, which is the correct tool for this task.\n\niii. The AI language model uses the tool in a helpful way. It inputs the URL into the 'ping' tool and correctly interprets the output to provide the latency in milliseconds.\n\niv. The AI language model does not use too many steps to answer the question. It only uses one step, which is appropriate for this type of question.\n\nv. The appropriate tool is used to answer the question. The 'ping' tool is the correct tool to measure website latency.\n\nGiven these considerations, the AI language model's performance is excellent. It uses the correct tool, interprets the output correctly, and provides a helpful and direct answer to the user's question."
}

配置评估LLM

如果您没有选择用于评估的LLM，load_evaluator函数将使用gpt-4来提供评估链的动力。您可以选择任何聊天模型作为代理轨迹评估器，如下所示。

# %pip install anthropic
# ANTHROPIC_API_KEY=<YOUR ANTHROPIC API KEY>

from langchain.chat_models import ChatAnthropic

eval_llm = ChatAnthropic(temperature=0)
evaluator = load_evaluator("trajectory", llm=eval_llm)

API参考：

ChatAnthropic 来自langchain.chat_models

evaluation_result = evaluator.evaluate_agent_trajectory(
    prediction=result["output"],
    input=result["input"],
    agent_trajectory=result["intermediate_steps"],
)
evaluation_result

{
    'score': 1.0,
    'reasoning': "Here is my detailed evaluation of the AI's response:\n\ni. The final answer is helpful, as it directly provides the latency measurement for the requested website.\n\nii. The sequence of using the ping tool to measure latency is logical for this question.\n\niii. The ping tool is used in a helpful way, with the website URL provided as input and the output latency measurement extracted.\n\niv. Only one step is used, which is appropriate for simply measuring latency. More steps are not needed.\n\nv. The ping tool is an appropriate choice to measure latency. \n\nIn summary, the AI uses an optimal single step approach with the right tool and extracts the needed output. The final answer directly answers the question in a helpful way.\n\nOverall"
}

提供有效工具列表

默认情况下，评估器不考虑代理被允许调用的工具。您可以通过agent_tools参数将这些工具提供给评估器。

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("trajectory", agent_tools=[ping, trace_route])

API参考：

load_evaluator 来自langchain.evaluation

evaluation_result = evaluator.evaluate_agent_trajectory(
    prediction=result["output"],
    input=result["input"],
    agent_trajectory=result["intermediate_steps"],
)
evaluation_result

{
    'score': 1.0,
    'reasoning': "i. The final answer is helpful. It directly answers the user's question about the latency for the specified website.\n\nii. The AI language model uses a logical sequence of tools to answer the question. In this case, only one tool was needed to answer the question, and the model chose the correct one.\n\niii. The AI language model uses the tool in a helpful way. The 'ping' tool was used to determine the latency of the website, which was the information the user was seeking.\n\niv. The AI language model does not use too many steps to answer the question. Only one step was needed and used.\n\nv. The appropriate tool was used to answer the question. The 'ping' tool is designed to measure latency, which was the information the user was seeking.\n\nGiven these considerations, the AI language model's performance in answering this question is excellent."
}

代理轨迹

API参考：​

方法​

捕获轨迹​

API参考：​

评估轨迹​

配置评估LLM​

API参考：​

提供有效工具列表​

API参考：​

API参考：

方法

捕获轨迹

API参考：

评估轨迹

配置评估LLM

API参考：

提供有效工具列表

API参考：