自定义字符串评估器
您可以通过继承StringEvaluator
类并实现_evaluate_strings
(以及_aevaluate_strings
用于异步支持)方法来创建自己的自定义字符串评估器。
在这个示例中,您将使用HuggingFace的evaluate库创建一个困惑度评估器。 困惑度是衡量生成的文本在用于计算指标的模型中的预测能力的度量。
# %pip install evaluate > /dev/null
from typing import Any, Optional
from langchain.evaluation import StringEvaluator
from evaluate import load
class PerplexityEvaluator(StringEvaluator):
"""评估预测字符串的困惑度。"""
def __init__(self, model_id: str = "gpt2"):
self.model_id = model_id
self.metric_fn = load(
"perplexity", module_type="metric", model_id=self.model_id, pad_token=0
)
def _evaluate_strings(
self,
*,
prediction: str,
reference: Optional[str] = None,
input: Optional[str] = None,
**kwargs: Any,
) -> dict:
results = self.metric_fn.compute(
predictions=[prediction], model_id=self.model_id
)
ppl = results["perplexities"][0]
return {"score": ppl}
API 参考:
- StringEvaluator 来自
langchain.evaluation
evaluator = PerplexityEvaluator()
evaluator.evaluate_strings(prediction="The rains in Spain fall mainly on the plain.")
Using pad_token, but it is not set yet.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
0%| | 0/1 [00:00<?, ?it/s]
{'score': 190.3675537109375}
# 由于LangChain是在发布'gpt-2'之后引入的,并且在以下上下文中从未使用过,因此困惑度要高得多。
evaluator.evaluate_strings(prediction="The rains in Spain fall mainly on LangChain.")
Using pad_token, but it is not set yet.
0%| | 0/1 [00:00<?, ?it/s]
{'score': 1982.0709228515625}