Designing algorithms for Multi-Agent Reinforcement Learning (MARL) in imperfect-information games — scenarios where players act sequentially and cannot see each other’s private information, like poker — has historically relied on manual iteration. Researchers identify weighting schemes, discounting rules, and equilibrium solvers through intuition and trial-and-error. Google DeepMind researchers proposes AlphaEvolve, an LLM-powered evolutionary coding agent that replaces that manual process with automated search.
The research team applies this framework to two established paradigms: Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO). In both cases, the system discovers new algorithm variants that perform competitively against or better than existing hand-designed state-of-the-art baselines. All experiments were run using the OpenSpiel framework.
Background: CFR AND PSRO
CFR is an iterative algorithm that decomposes regret minimization across information sets. At each iteration it accumulates ‘counterfactual regret’ — how much a player would have gained by playing differently — and derives a new policy proportional to positive accumulated regret. Over many iterations, the time-averaged strategy converges to a Nash Equilibrium (NE). Variants like DCFR (Discounted CFR) and PCFR+ (Predictive CFR+) improve convergence by applying specific discounting or predictive update rules, all developed through manual design.
PSRO operates at a higher level of abstraction. It maintains a population of policies for each player, builds a payoff tensor (the meta-game) by computing expected utilities for every combination of population policies, and then uses a meta-strategy solver to produce a probability distribution over the population. Best responses are trained against that distribution and added to the population iteratively. The meta-strategy solver — how the population distribution is computed — is the central design choice that the paper targets for automated discovery. All experiments use an exact best response oracle (computed via value iteration) and exact payoff values for all meta-game entries, removing Monte Carlo sampling noise from the results.
THE AlphaEvolve FRAMEWORK
AlphaEvolve is a distributed evolutionary system that uses LLMs to mutate source code rather than numeric parameters. The process: a population is initialized with a standard implementation (CFR+ as the seed for CFR experiments; Uniform as the seed for both PSRO solver classes). At each generation, a parent algorithm is selected based on fitness; its source code is passed to an LLM (Gemini 2.5 Pro) with a prompt to modify it; the resulting candidate is evaluated on proxy games; valid candidates are added to the population. AlphaEvolve supports multi-objective optimization — if multiple fitness metrics are defined, one is randomly selected per generation to guide parent sampling.
The fitness signal is negative exploitability after K iterations, evaluated on a fixed set of training games: 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liars Dice. Final evaluation is done on a separate test set of larger, unseen games.
For CFR, the evolvable search space consists of three Python classes: RegretAccumulator, PolicyFromRegretAccumulator, and PolicyAccumulator. These govern regret accumulation, current policy derivation, and average policy accumulation respectively. The interface is expressive enough to represent all known CFR variants as special cases. For PSRO, the evolvable components are TrainMetaStrategySolverand EvalMetaStrategySolver— the meta-strategy solvers used during oracle training and during exploitability evaluation.
Discovered Algorithm 1: VAD-CFR
The evolved CFR variant is Volatility-Adaptive Discounted CFR (VAD-CFR). Rather than the linear averaging and static discounting used in the CFR family, the search produced three distinct mechanisms:
- Volatility-adaptive discounting. Instead of fixed discount factors α and β applied to cumulative regrets (as in DCFR), VAD-CFR tracks the volatility of the learning process using an Exponential Weighted Moving Average (EWMA) of the instantaneous regret magnitude. When volatility is high, discounting increases so the algorithm forgets unstable history faster; when volatility drops it retains more history. The EWMA decay factor is 0.1, with base α = 1.5 and base β = −0.1.
- Asymmetric instantaneous boosting. Positive instantaneous regrets are multiplied by a factor of 1.1 before being added to cumulative regrets. This asymmetry is applied to the instantaneous update, not the accumulated history, making the algorithm more reactive to currently good actions.
- Hard warm-start with regret-magnitude weighting. Policy averaging is postponed entirely until iteration 500. The regret accumulation process continues normally during this phase. Once accumulation begins, policies are weighted by a combination of temporal weight and instantaneous regret magnitude — prioritizing high-information iterations when constructing the average strategy. The 500-iteration threshold was generated by the LLM without knowledge of the 1000-iteration evaluation horizon.
VAD-CFR is benchmarked against standard CFR, CFR+, Linear CFR (LCFR), DCFR, PCFR+, DPCFR+, and HS-PCFR+(30) across 1000 iterations with K = 1000. Exploitability is computed exactly. On the full 11-game evaluation, VAD-CFR matches or surpasses state-of-the-art performance in 10 of the 11 games, with 4-player Kuhn Poker as the sole exception.
ALSO DISCOVERED: AOD-CFR An earlier trial on a different training set (2-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, 4-sided Liars Dice) produced a second variant, Asymmetric Optimistic Discounted CFR (AOD-CFR). It uses a linear schedule for discounting cumulative regrets (α transitions from 1.0 → 2.5 over 500 iterations, β from 0.5 → 0.0), sign-dependent scaling of instantaneous regret, trend-based policy optimism via an Exponential Moving Average of cumulative regrets, and polynomial policy averaging with an exponent γ scaling from 1.0 → 5.0. The research team reports it achieves competitive performance using more conventional mechanisms than VAD-CFR.
Discovered Algorithm 2: SHOR-PSRO
The evolved PSRO variant is Smoothed Hybrid Optimistic Regret PSRO (SHOR-PSRO). The search produced a hybrid meta-solver that constructs a meta-strategy by linearly blending two components at every internal solver iteration:
- σ_ORM (Optimistic Regret Matching): Provides regret-minimization stability. Gains are computed, optionally normalized and diversity-adjusted, then used to update cumulative regrets via regret matching. A momentum term is applied to payoff gains.
- σ_Softmax (Smoothed Best Pure Strategy): A Boltzmann distribution over pure strategies biased toward high-payoff modes. A temperature parameter controls concentration — lower temperature means the distribution is more concentrated on the best pure strategy.
σ_hybrid = (1 − λ) · σ_ORM + λ · σ_Softmax
The training-time solver uses a dynamic annealing schedule over the outer PSRO iterations. The blending factor λ anneals from 0.3 → 0.05 (shifting from greedy exploitation toward equilibrium finding), the diversity bonus decays from 0.05 → 0.001 (enabling early population exploration then late-stage refinement), and the softmax temperature drops from 0.5 → 0.01. The number of internal solver iterations also scales with population size. The training solver returns the time-averaged strategy across internal iterations for stability.
The evaluation-time solver uses fixed parameters: λ = 0.01, diversity bonus = 0.0, temperature = 0.001. It runs more internal iterations (base 8000, scaling with population size) and returns the last-iterate strategy rather than the average, for a reactive, low-noise exploitability estimate. This training/evaluation asymmetry was itself a product of the search, not a human design choice.
SHOR-PSRO is benchmarked against Uniform, Nash (via linear program for 2-player games), AlphaRank, Projected Replicator Dynamics (PRD), and Regret Matching (RM), using K = 100 PSRO iterations. On the full 11-game evaluation, SHOR-PSRO matches or surpasses state-of-the-art performance in 8 of the 11 games.
Experimental Setup
The evaluation protocol separates training and test games to assess generalization. The training set for both CFR and PSRO experiments consists of 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liars Dice. The test set used in the main body of the paper consists of 4-player Kuhn Poker, 3-player Leduc Poker, 5-card Goofspiel, and 6-sided Liars Dice — larger and more complex variants not seen during evolution. A full sweep across 11 games is included in the appendix. Algorithms are fixed after training-phase discovery before test evaluation begins.
Key Takeaways
- AlphaEvolve automates algorithm design — instead of tuning hyperparameters, it evolves the actual Python source code of MARL algorithms using Gemini 2.5 Pro as the mutation operator, discovering entirely new update rules rather than variations of existing ones.
- VAD-CFR replaces static discounting with volatility-awareness — it tracks instantaneous regret magnitude via EWMA and adjusts its discount factors dynamically, plus delays policy averaging entirely until iteration 500, a threshold the LLM found without being told the evaluation horizon was 1000 iterations.
- SHOR-PSRO automates the exploration-to-exploitation transition — by annealing a blending factor between Optimistic Regret Matching and a Softmax best-pure-strategy component over training, it removes the need to manually tune when a PSRO meta-solver should shift from population diversity to equilibrium refinement.
- Generalization is tested, not assumed — both algorithms are developed on one set of four games and evaluated on a separate set of larger, unseen games. VAD-CFR holds up in 10 of 11 games; SHOR-PSRO in 8 of 11, with no re-tuning between training and test.
- The discovered mechanisms are non-intuitive by design — things like a hard warm-start at iteration 500, asymmetric boosting of positive regrets by exactly 1.1, and separate training/evaluation solver configurations are not the kind of choices human researchers typically arrive at, which is this research’s core argument for automated search over this design space.
Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Michal Sutter
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

