Skip to main content

自定义轨迹评估器

您可以通过继承AgentTrajectoryEvaluator类并重写_evaluate_agent_trajectory(和_aevaluate_agent_action)方法来创建自己的自定义轨迹评估器。

在这个例子中,您将创建一个简单的轨迹评估器,它使用LLM来确定是否有任何不必要的操作。

from typing import Any, Optional, Sequence, Tuple
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.schema import AgentAction
from langchain.evaluation import AgentTrajectoryEvaluator

class StepNecessityEvaluator(AgentTrajectoryEvaluator):
"""评估预测字符串的困惑度。"""

def __init__(self) -> None:
llm = ChatOpenAI(model="gpt-4", temperature=0.0)
template = """在回答{input}时,以下步骤中是否有任何不必要的步骤?请在新行上以单个"Y"表示是或"N"表示否给出判决。

数据
------
步骤:{trajectory}
------

判决:"""
self.chain = LLMChain.from_string(llm, template)

def _evaluate_agent_trajectory(
self,
*,
prediction: str,
input: str,
agent_trajectory: Sequence[Tuple[AgentAction, str]],
reference: Optional[str] = None,
**kwargs: Any,
) -> dict:
vals = [
f"{i}: Action=[{action.tool}] returned observation = [{observation}]"
for i, (action, observation) in enumerate(agent_trajectory)
]
trajectory = "\n".join(vals)
response = self.chain.run(dict(trajectory=trajectory, input=input), **kwargs)
decision = response.split("\n")[-1].strip()
score = 1 if decision == "Y" else 0
return {"score": score, "value": decision, "reasoning": response}

API 参考:

上面的例子将在语言模型预测到任何操作是不必要的时返回得分1,如果所有操作都被预测为必要,则返回得分0。它将字符串'decision'作为'value'返回,并将生成的其余文本作为'reasoning',以便您审查决策。

您可以调用此评估器来评估您的代理轨迹的中间步骤。

evaluator = StepNecessityEvaluator()

evaluator.evaluate_agent_trajectory(
prediction="The answer is pi",
input="What is today?",
agent_trajectory=[
(
AgentAction(tool="ask", tool_input="What is today?", log=""),
"tomorrow's yesterday",
),
(
AgentAction(tool="check_tv", tool_input="Watch tv for half hour", log=""),
"bzzz",
),
],
)

输出结果为:

{'score': 1, 'value': 'Y', 'reasoning': 'Y'}