多智能体模拟环境：宠物园 (Multi-Agent Simulated Environment: Petting Zoo)

在这个例子中，我们展示了如何使用模拟环境定义多智能体模拟。就像我们之前使用Gymnasium定义的单智能体示例一样，我们创建了一个智能体-环境循环，其中环境是外部定义的。主要区别在于，我们现在使用多个智能体来实现这种交互循环。我们将使用Petting Zoo库，它是Gymnasium的多智能体对应库。

安装 `pettingzoo` 和其他依赖项

pip install pettingzoo pygame rlcard

导入模块 (Import modules)

import collections
import inspect
import tenacity

from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    HumanMessage,
    SystemMessage,
)
from langchain.output_parsers import RegexParser

`GymnasiumAgent` 体育馆代理

在这里，我们重现了从我们的Gymnasium示例中定义的GymnasiumAgent。如果经过多次重试后它没有采取有效的动作，它将简单地采取随机动作。

class GymnasiumAgent:
    @classmethod
    def get_docs(cls, env):
        return env.unwrapped.__doc__

    def __init__(self, model, env):
        self.model = model
        self.env = env
        self.docs = self.get_docs(env)

        self.instructions = """
你的目标是最大化你的回报，即你所获得的奖励的总和。
我会给你一个观察值、奖励、终止标志、截断标志和迄今为止的回报，格式如下：

观察值: <observation>
奖励: <reward>
终止: <termination>
截断: <truncation>
回报: <sum_of_rewards>

你将以以下格式回应一个动作：

动作: <action>

你需要用你实际的动作替换<action>。
除此之外，不要做任何其他操作，只需返回动作。
"""
        self.action_parser = RegexParser(
            regex=r"Action: (.*)", output_keys=["action"], default_output_key="action"
        )

        self.message_history = []
        self.ret = 0

    def random_action(self):
        action = self.env.action_space.sample()
        return action

    def reset(self):
        self.message_history = [
            SystemMessage(content=self.docs),
            SystemMessage(content=self.instructions),
        ]

    def observe(self, obs, rew=0, term=False, trunc=False, info=None):
        self.ret += rew

        obs_message = f"""
观察值: {obs}
奖励: {rew}
终止: {term}
截断: {trunc}
回报: {self.ret}
        """
        self.message_history.append(HumanMessage(content=obs_message))
        return obs_message

    def _act(self):
        act_message = self.model(self.message_history)
        self.message_history.append(act_message)
        action = int(self.action_parser.parse(act_message.content)["action"])
        return action

    def act(self):
        try:
            for attempt in tenacity.Retrying(
                stop=tenacity.stop_after_attempt(2),
                wait=tenacity.wait_none(),  # 重试之间没有等待时间
                retry=tenacity.retry_if_exception_type(ValueError),
                before_sleep=lambda retry_state: print(
                    f"ValueError occurred: {retry_state.outcome.exception()}, 正在重试..."
                ),
            ):
                with attempt:
                    action = self._act()
        except tenacity.RetryError as e:
            action = self.random_action()
        return action

主循环 (Main loop)

def main(agents, env):
    env.reset()

    for name, agent in agents.items():
        agent.reset()

    for agent_name in env.agent_iter():
        observation, reward, termination, truncation, info = env.last()
        obs_message = agents[agent_name].observe(
            observation, reward, termination, truncation, info
        )
        print(obs_message)
        if termination or truncation:
            action = None
        else:
            action = agents[agent_name].act()
        print(f"Action: {action}")
        env.step(action)
    env.close()

`PettingZooAgent` 宠物园代理

PettingZooAgent 继承了 GymnasiumAgent，用于多代理设置。主要区别如下：

PettingZooAgent 接受一个 name 参数，用于在多个代理中进行标识
get_docs 函数的实现方式不同，因为 PettingZoo 仓库的结构与 Gymnasium 仓库不同

class PettingZooAgent(GymnasiumAgent):
    @classmethod
    def get_docs(cls, env):
        return inspect.getmodule(env.unwrapped).__doc__

    def __init__(self, name, model, env):
        super().__init__(model, env)
        self.name = name

    def random_action(self):
        action = self.env.action_space(self.name).sample()
        return action

石头，剪刀，布 (Rock, Paper, Scissors)

我们现在可以使用 PettingZooAgent 运行一个多智能体石头，剪刀，布游戏的模拟。

from pettingzoo.classic import rps_v2

env = rps_v2.env(max_cycles=3, render_mode="human")
agents = {
    name: PettingZooAgent(name=name, model=ChatOpenAI(temperature=1), env=env)
    for name in env.possible_agents
}
main(agents, env)

    
    观察: 3
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 1
    
    观察: 3
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 1
    
    观察: 1
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 2
    
    观察: 1
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 1
    
    观察: 1
    奖励: 1
    终止: False
    截断: False
    返回值: 1
            
    动作: 0
    
    观察: 2
    奖励: -1
    终止: False
    截断: False
    返回值: -1
            
    动作: 0
    
    观察: 0
    奖励: 0
    终止: False
    截断: True
    返回值: 1
            
    动作: None
    
    观察: 0
    奖励: 0
    终止: False
    截断: True
    返回值: -1
            
    动作: None

`ActionMaskAgent` 动作掩码代理

一些 PettingZoo 环境提供了一个 action_mask 来告诉代理哪些动作是有效的。ActionMaskAgent 是 PettingZooAgent 的子类，它使用来自 action_mask 的信息来选择动作。

class ActionMaskAgent(PettingZooAgent):
    def __init__(self, name, model, env):
        super().__init__(name, model, env)
        self.obs_buffer = collections.deque(maxlen=1)

    def random_action(self):
        obs = self.obs_buffer[-1]
        action = self.env.action_space(self.name).sample(obs["action_mask"])
        return action

    def reset(self):
        self.message_history = [
            SystemMessage(content=self.docs),
            SystemMessage(content=self.instructions),
        ]

    def observe(self, obs, rew=0, term=False, trunc=False, info=None):
        self.obs_buffer.append(obs)
        return super().observe(obs, rew, term, trunc, info)

    def _act(self):
        valid_action_instruction = "根据 `action_mask` 中不为0的索引生成一个有效的动作，遵循动作格式规则。"
        self.message_history.append(HumanMessage(content=valid_action_instruction))
        return super()._act()

井字游戏 (Tic-Tac-Toe)

这是一个使用 ActionMaskAgent 的井字游戏的示例。

from pettingzoo.classic import tictactoe_v3

env = tictactoe_v3.env(render_mode="human")
agents = {
    name: ActionMaskAgent(name=name, model=ChatOpenAI(temperature=0.2), env=env)
    for name in env.possible_agents
}
main(agents, env)

    
    观察: {'observation': array([[[0, 0],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 0
         |     |     
      X  |  -  |  -  
    _____|_____|_____
         |     |     
      -  |  -  |  -  
    _____|_____|_____
         |     |     
      -  |  -  |  -  
         |     |     
    
    观察: {'observation': array([[[0, 1],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 1
         |     |     
      X  |  -  |  -  
    _____|_____|_____
         |     |     
      O  |  -  |  -  
    _____|_____|_____
         |     |     
      -  |  -  |  -  
         |     |     
    
    观察: {'observation': array([[[1, 0],
            [0, 1],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 2
         |     |     
      X  |  -  |  -  
    _____|_____|_____
         |     |     
      O  |  -  |  -  
    _____|_____|_____
         |     |     
      X  |  -  |  -  
         |     |     
    
    观察: {'observation': array([[[0, 1],
            [1, 0],
            [0, 1]],
    
           [[0, 0],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 1, 1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 3
         |     |     
      X  |  O  |  -  
    _____|_____|_____
         |     |     
      O  |  -  |  -  
    _____|_____|_____
         |     |     
      X  |  -  |  -  
         |     |     
    
    观察: {'observation': array([[[1, 0],
            [0, 1],
            [1, 0]],
    
           [[0, 1],
            [0, 0],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 4
         |     |     
      X  |  O  |  -  
    _____|_____|_____
         |     |     
      O  |  X  |  -  
    _____|_____|_____
         |     |     
      X  |  -  |  -  
         |     |     
    
    观察: {'observation': array([[[0, 1],
            [1, 0],
            [0, 1]],
    
           [[1, 0],
            [0, 1],
            [0, 0]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 5
         |     |     
      X  |  O  |  -  
    _____|_____|_____
         |     |     
      O  |  X  |  -  
    _____|_____|_____
         |     |     
      X  |  O  |  -  
         |     |     
    
    观察: {'observation': array([[[1, 0],
            [0, 1],
            [1, 0]],
    
           [[0, 1],
            [1, 0],
            [0, 1]],
    
           [[0, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回值: 0
            
    动作: 6
         |     |     
      X  |  O  |  X  
    _____|_____|_____
         |     |     
      O  |  X  |  -  
    _____|_____|_____
         |     |     
      X  |  O  |  -  
         |     |     
    
    观察: {'observation': array([[[0, 1],
            [1, 0],
            [0, 1]],
    
           [[1, 0],
            [0, 1],
            [1, 0]],
    
           [[0, 1],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int8)}
    奖励: -1
    终止: True
    截断: False
    返回值: -1
            
    动作: None
    
    观察: {'observation': array([[[1, 0],
            [0, 1],
            [1, 0]],
    
           [[0, 1],
            [1, 0],
            [0, 1]],
    
           [[1, 0],
            [0, 0],
            [0, 0]]], dtype=int8), 'action_mask': array([0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int8)}
    奖励: 1
    终止: True
    截断: False
    返回值: 1
            
    动作: None

德州扑克无限制版(Texas Hold'em No Limit)

这是一个使用ActionMaskAgent的德州扑克无限制版游戏的示例。

from pettingzoo.classic import texas_holdem_no_limit_v6

env = texas_holdem_no_limit_v6.env(num_players=4, render_mode="human")
agents = {
    name: ActionMaskAgent(name=name, model=ChatOpenAI(temperature=0.2), env=env)
    for name in env.possible_agents
}
main(agents, env)

    
    观察: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
           0., 0., 2.], dtype=float32), 'action_mask': array([1, 1, 0, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 1
    
    观察: {'observation': array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
           0., 0., 2.], dtype=float32), 'action_mask': array([1, 1, 0, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 1
    
    观察: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 1., 2.], dtype=float32), 'action_mask': array([1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 1
    
    观察: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 2., 2.], dtype=float32), 'action_mask': array([1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 0
    
    观察: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
           0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 2., 2.], dtype=float32), 'action_mask': array([1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 2
    
    观察: {'observation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0.,
           0., 2., 6.], dtype=float32), 'action_mask': array([1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 2
    
    观察: {'observation': array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
           0., 2., 8.], dtype=float32), 'action_mask': array([1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 3
    
    观察: {'observation': array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
            0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,
            0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
            1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
            6., 20.], dtype=float32), 'action_mask': array([1, 1, 1, 1, 1], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 4
    
    观察: {'observation': array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   1.,   1.,
             0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   8., 100.],
          dtype=float32), 'action_mask': array([1, 1, 0, 0, 0], dtype=int8)}
    奖励: 0
    终止: False
    截断: False
    返回: 0
            
    动作: 4
    [警告]: 违规动作，游戏终止，当前玩家失败。 
    obs['action_mask'] 包含了所有可选择的合法动作的掩码。
    
    观察: {'observation': array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   1.,   1.,
             0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   8., 100.],
          dtype=float32), 'action_mask': array([1, 1, 0, 0, 0], dtype=int8)}
    奖励: -1.0
    终止: True
    截断: True
    返回: -1.0
            
    动作: None
    
    观察: {'observation': array([  0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   1.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   1.,   0.,
             0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,  20., 100.],
          dtype=float32), 'action_mask': array([1, 1, 0, 0, 0], dtype=int8)}
    奖励: 0
    终止: True
    截断: True
    返回: 0
            
    动作: None
    
    观察: {'observation': array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,
             1.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   1.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 100., 100.],
          dtype=float32), 'action_mask': array([1, 1, 0, 0, 0], dtype=int8)}
    奖励: 0
    终止: True
    截断: True
    返回: 0
            
    动作: None
    
    观察: {'observation': array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   1.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   1.,   0.,
             0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   2., 100.],
          dtype=float32), 'action_mask': array([1, 1, 0, 0, 0], dtype=int8)}
    奖励: 0
    终止: True
    截断: True
    返回: 0
            
    动作: None

多智能体模拟环境：宠物园 (Multi-Agent Simulated Environment: Petting Zoo)

安装 pettingzoo 和其他依赖项​

导入模块 (Import modules)​

GymnasiumAgent 体育馆代理​

主循环 (Main loop)​

PettingZooAgent 宠物园代理​

石头，剪刀，布 (Rock, Paper, Scissors)​

ActionMaskAgent 动作掩码代理​

井字游戏 (Tic-Tac-Toe)​

德州扑克无限制版(Texas Hold'em No Limit)​