Skip to main content

自定义成对评估器

您可以通过继承PairwiseStringEvaluator类并重写_evaluate_string_pairs方法(如果要异步使用评估器,则还需重写_aevaluate_string_pairs方法)来创建自己的成对字符串评估器。

在这个例子中,您将创建一个简单的自定义评估器,它只返回第一个预测的分词“单词”比第二个预测的多的情况。

您可以查看PairwiseStringEvaluator接口的参考文档以获取更多信息。

from typing import Optional, Any  
from langchain.evaluation import PairwiseStringEvaluator

class LengthComparisonPairwiseEvalutor(PairwiseStringEvaluator):
"""
自定义评估器,用于比较两个字符串。
"""

def _evaluate_string_pairs(
self,
*,
prediction: str,
prediction_b: str,
reference: Optional[str] = None,
input: Optional[str] = None,
**kwargs: Any,
) -> dict:
score = int(len(prediction.split()) > len(prediction_b.split()))
return {"score": score}

API 参考:

evaluator = LengthComparisonPairwiseEvalutor()  

evaluator.evaluate_string_pairs(
prediction="The quick brown fox jumped over the lazy dog.",
prediction_b="The quick brown fox jumped over the dog.",
)

输出结果为:

{'score': 1}  

基于LLM的示例

上面的示例只是为了说明API,实际上并不太有用。下面,使用一个带有自定义指令的LLM来形成一个类似于内置的PairwiseStringEvalChain的简单偏好评分器。我们将使用ChatAnthropic作为评估器链。

# %pip install anthropic  
# %env ANTHROPIC_API_KEY=YOUR_API_KEY

from typing import Optional, Any
from langchain.evaluation import PairwiseStringEvaluator
from langchain.chat_models import ChatAnthropic
from langchain.chains import LLMChain

class CustomPreferenceEvaluator(PairwiseStringEvaluator):
"""
自定义评估器,使用自定义的LLMChain来比较两个字符串。
"""

def __init__(self) -> None:
llm = ChatAnthropic(model="claude-2", temperature=0)
self.eval_chain = LLMChain.from_string(
llm,
"""Which option is preferred? Do not take order into account. Evaluate based on accuracy and helpfulness. If neither is preferred, respond with C. Provide your reasoning, then finish with Preference: A/B/C

Input: How do I get the path of the parent directory in python 3.8?
Option A: You can use the following code:

import os
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

Option B: You can use the following code:

from pathlib import Path
Path(__file__).absolute().parent

Reasoning: Both options return the same result. However, since option B is
more concise and easily understand, it is preferred. Preference: B

Which option is preferred? Do not take order into account. Evaluate based on accuracy and helpfulness. If neither is preferred, respond with C. Provide your reasoning, then finish with Preference: A/B/C Input: {input} Option A: {prediction} Option B: {prediction_b} Reasoning:""", )
""")

@property
def requires_input(self) -> bool:
return True

@property
def requires_reference(self) -> bool:
return False

def _evaluate_string_pairs(
self,
*,
prediction: str,
prediction_b: str,
reference: Optional[str] = None,
input: Optional[str] = None,
**kwargs: Any,
) -> dict:
result = self.eval_chain(
{
"input": input,
"prediction": prediction,
"prediction_b": prediction_b,
"stop": ["Which option is preferred?"],
},
**kwargs,
)

response_text = result["text"]
reasoning, preference = response_text.split("Preference:", maxsplit=1)
preference = preference.strip()
score = 1.0 if preference == "A" else (0.0 if preference == "B" else None)
return {"reasoning": reasoning.strip(), "value": preference, "score": score}

API 参考:

evaluator = CustomPreferenceEvaluator()  

evaluator.evaluate_string_pairs(
input="How do I import from a relative directory?",
prediction="use importlib! importlib.import_module('.my_package', '.')",
prediction_b="from .sibling import foo",
)

输出结果为:

{'reasoning': 'Option B is preferred over option A for importing from a relative directory, because it is more straightforward and concise.\n\nOption A uses the importlib module, which allows importing a module by specifying the full name as a string. While this works, it is less clear compared to option B.\n\nOption B directly imports from the relative path using dot notation, which clearly shows that it is a relative import. This is the recommended way to do relative imports in Python.\n\nIn summary, option B is more accurate and helpful as it uses the standard Python relative import syntax.',  
'value': 'B',
'score': 0.0}
# 将requires_input设置为返回True,以添加额外的验证,以避免在提供给链条的数据不足时返回分数。

try:
evaluator.evaluate_string_pairs(
prediction="use importlib! importlib.import_module('.my_package', '.')",
prediction_b="from .sibling import foo",
)
except ValueError as e:
print(e)

输出结果为:

CustomPreferenceEvaluator requires an input string.