algos package

Submodules

algos.cfr module

The implementation of Counterfactual regret minimization (CFR) algorithm. (Zinkevich et al. “Regret minimization in games with incomplete information” 2008)

CFR is a self-play algorithm that assumes that the joint strategy of all players in the game is public knowledge. The algorithm iteratively updates the regret of each player’s action at each information set. The average strategy of each player in CFR provably converges to a Nash equilibrium.

Counterfactual regrets are calculated by searching down the game tree, taking into account the reach probabilities.

Reach probabilities are factored in since the nodes in one’s player infostate might belong to different infostates of their opponent. Therefore, not every node (history) in the infostate is equally likely to be reached, so the reach probs should be taken into account to skew the action distribution at that infostate appropriately.

The cfr formulation looks similar to the advantage function in RL Q-learning.

class algos.cfr.CounterFactualRegretMinimizerPlayer(name, player_id)[source]

Bases: imperfecto.algos.player.Player

CounterFactualRegretMinimizerPlayer.

Parameters
  • name (str) – name of the player.

  • player_id (int) – id of the player.

player_id

the player id of this player in the game.

Type

int

cum_regrets

map from infostate to a regret vector of size n_actions.

Type

dict

strategy_sum

map from infostate to the cumulative strategy at that infostate until now. Useful for calculating the average strategy over many iters.

Type

dict

strategy

map from instostate to np.array of shape (n_actions,).

Type

dict

update_strategy(history, player_id)[source]

Update the strategy of the player at the end of the game.

Parameters
  • history (Sequence[IntEnum]) – The history of the game, which must be a terminal node.

  • player_id (int) – The id of the player to update (i.e., my id).

get_avg_strategies()[source]

Returns the average strategy of the player at each infostate.

Return type

dict

Returns

map from infostate to the average strategy at that infostate.

class algos.cfr.counterfactualRegretMinimizerTrainer(Game, players, n_iters=100)[source]

Bases: object

CFR with chance-sampling.

CFR (Zinkevich et al. “Regret minimization in games with incomplete information” 2008) is a self-play algorithm that assumes that the joint strategy of all players in the game is public knowledge. The algorithm iteratively updates the regret of each player’s action at each information set. The average strategy of each player in CFR provably converges to a Nash equilibrium.

Parameters
game

the game to train on.

Type

ExtensiveFormGame

n_iters

the number of iterations to run CFR for.

Type

int

cfr(history, reach_probs)[source]

Counterfactual regret minimization update step at the current node.

CFR is quite similar to advantage function in classic Q-learning. For example,
  • node_utils ~ V(s)

  • counterfactual_values ~ Q(s, a)

  • regret ~ advantage function: Q(s, a) - V(s)

Parameters
  • history (List[IntEnum]) – list of actions taken so far (current node)

  • reach_probs (ndarray) – array of shape game.n_players. The reach probabilities for the current infostate (current node) played by the players’ joint strategy.

Return type

ndarray

Returns

the utility (expected payoff) of the current node for each player.

train()[source]

Train the game using CFR for n_iters iterations.

Return type

None

algos.player module

A collection of player classes.

Includes:
  • Player

  • FixedPolicyPlayer

class algos.player.Player(name='')[source]

Bases: abc.ABC

An abstract player class.

name

The name of the player.

Type

str

strategy

map from instostate to np.array of shape (n_actions,).

Type

dict

act(infostate)[source]

Returns the action to take given an infostate.

Parameters

infostate (str) – The infostate to take the action in.

Return type

int

Returns

The action to take.

abstract update_strategy(history, player_id)[source]

Update the strategy of the player at the end of the game.

Parameters
  • history (Sequence[Enum]) – The history of the game, which must be a terminal node.

  • player_id (int) – The id of the player to update (i.e., my id).

Return type

None

property game
property strategy

A dict with key = infostate and value = np.array of shape (n_actions,).

class algos.player.NormalFormPlayer(name='')[source]

Bases: algos.player.Player, abc.ABC

property strategy

A np.array of shape (n_actions,).

act(infostate)[source]

Returns the action to take given an infostate.

Parameters

infostate (str) – The infostate to take the action in.

Return type

int

Returns

The action to take.

class algos.player.FixedPolicyPlayer(name, strategy)[source]

Bases: algos.player.Player

A player with a given fixed strategy.

Parameters
  • name (str) – The name of the player.

  • strategy (dict) – The fixed strategy of the player.

name

The name of the player.

Type

str

strategy

The fixed strategy of the player.

Type

dict

update_strategy(history, player_id)[source]

Update the strategy of the player at the end of the game.

Parameters
  • history (Sequence[Enum]) – The history of the game, which must be a terminal node.

  • player_id (int) – The id of the player to update (i.e., my id).

Return type

None

algos.regret_matching module

A class of regret matching player for N-player normal form games. See the paper “a simple adaptive procedure leading to correlated equilibrium” (2000) by Sergiu Hart and Andreu Mas-Colell.

The algorithm works by maintaining a cumulative regret vector for each action. The current regret is the difference between the counterfactual reward and the reward of the actual action taken.

The strategy is action distribution proportional to the positive entries of the cumulative regret vector.

class algos.regret_matching.RegretMatchingPlayer(name, n_actions)[source]

Bases: imperfecto.algos.player.NormalFormPlayer

Regret Matching (Hart and Mas-Colell 2000) Player works by maintaining a cumulative regret vector for each action.

The current regret is the difference between the counterfactual reward and the reward of the actual action taken.

The policy is action distribution proportional to the positive entries of the cumulative regret vector.

Parameters
  • name (str) – name of the player.

  • n_actions (int) – number of actions.

name

name of the player.

Type

str

n_actions

number of actions.

Type

int

cum_regrets

cumulative regret vector of shape (n_actions,).

Type

np.array

static regret_matching_strategy(regrets)[source]

Return the regret matching policy.

Parameters

regrets (ndarray) – cumulative regrets vector.

Return type

ndarray

Returns

action distribution according to regret-matching policy.

update_strategy(history, player_id)[source]

Update the cumulative regret vector. This will update the strategy of the player as the strategy is computed from the cumulative regret vector.

Parameters
  • history (List[IntEnum]) – history of the game.

  • player_id (int) – player id.

Return type

None

Module contents