algos package¶
Submodules¶
algos.cfr module¶
The implementation of Counterfactual regret minimization (CFR) algorithm. (Zinkevich et al. “Regret minimization in games with incomplete information” 2008)
CFR is a self-play algorithm that assumes that the joint strategy of all players in the game is public knowledge. The algorithm iteratively updates the regret of each player’s action at each information set. The average strategy of each player in CFR provably converges to a Nash equilibrium.
Counterfactual regrets are calculated by searching down the game tree, taking into account the reach probabilities.
Reach probabilities are factored in since the nodes in one’s player infostate might belong to different infostates of their opponent. Therefore, not every node (history) in the infostate is equally likely to be reached, so the reach probs should be taken into account to skew the action distribution at that infostate appropriately.
The cfr formulation looks similar to the advantage function in RL Q-learning.
- class algos.cfr.CounterFactualRegretMinimizerPlayer(name, player_id)[source]¶
Bases:
imperfecto.algos.player.Player
CounterFactualRegretMinimizerPlayer.
- strategy_sum¶
map from infostate to the cumulative strategy at that infostate until now. Useful for calculating the average strategy over many iters.
- Type
- class algos.cfr.counterfactualRegretMinimizerTrainer(Game, players, n_iters=100)[source]¶
Bases:
object
CFR with chance-sampling.
CFR (Zinkevich et al. “Regret minimization in games with incomplete information” 2008) is a self-play algorithm that assumes that the joint strategy of all players in the game is public knowledge. The algorithm iteratively updates the regret of each player’s action at each information set. The average strategy of each player in CFR provably converges to a Nash equilibrium.
- Parameters
Game (Type[ExtensiveFormGame]) – the game class to be trained.
players (
Sequence
[CounterFactualRegretMinimizerPlayer
]) – the players in the game, each of typeCounterFactualRegretMinimizerPlayer
.n_iters (
int
) – number of iterations to run the algorithm.
- game¶
the game to train on.
- Type
- cfr(history, reach_probs)[source]¶
Counterfactual regret minimization update step at the current node.
- CFR is quite similar to advantage function in classic Q-learning. For example,
node_utils ~ V(s)
counterfactual_values ~ Q(s, a)
regret ~ advantage function: Q(s, a) - V(s)
- Parameters
- Return type
- Returns
the utility (expected payoff) of the current node for each player.
algos.player module¶
A collection of player classes.
- Includes:
Player
FixedPolicyPlayer
- class algos.player.Player(name='')[source]¶
Bases:
abc.ABC
An abstract player class.
- abstract update_strategy(history, player_id)[source]¶
Update the strategy of the player at the end of the game.
- property game¶
- property strategy¶
A dict with key = infostate and value = np.array of shape (n_actions,).
- class algos.player.NormalFormPlayer(name='')[source]¶
Bases:
algos.player.Player
,abc.ABC
- property strategy¶
A np.array of shape (n_actions,).
- class algos.player.FixedPolicyPlayer(name, strategy)[source]¶
Bases:
algos.player.Player
A player with a given fixed strategy.
algos.regret_matching module¶
A class of regret matching player for N-player normal form games. See the paper “a simple adaptive procedure leading to correlated equilibrium” (2000) by Sergiu Hart and Andreu Mas-Colell.
The algorithm works by maintaining a cumulative regret vector for each action. The current regret is the difference between the counterfactual reward and the reward of the actual action taken.
The strategy is action distribution proportional to the positive entries of the cumulative regret vector.
- class algos.regret_matching.RegretMatchingPlayer(name, n_actions)[source]¶
Bases:
imperfecto.algos.player.NormalFormPlayer
Regret Matching (Hart and Mas-Colell 2000) Player works by maintaining a cumulative regret vector for each action.
The current regret is the difference between the counterfactual reward and the reward of the actual action taken.
The policy is action distribution proportional to the positive entries of the cumulative regret vector.
- cum_regrets¶
cumulative regret vector of shape (n_actions,).
- Type
np.array