GAMEBoT: Transparent Assessment of LLM Reasoning in Games

Wenye Lin¹, Jonathan Roberts², Yunhan Yang¹, Samuel Albanie, Zongqing Lu³, Kai Han¹

¹The University of Hong Kong, ²University of Cambridge, ³Peking University

ACL 2025 Main Conference

🚀 Latest News & Updates

📊 New Evaluation Results Available!

We've recently evaluated several cutting-edge models on our benchmark, including the latest iterations of GPT, Claude, and Gemini models. Check out the detailed battle replays and performance analysis below:

Watch Connect4 Competitions

GPT-5 vs Gemini 2.5 Pro

Witness an intense battle of strategic thinking in Connect4. See how these advanced models plan multiple moves ahead and adapt their strategies.

Watch Battle Replay

Watch Checkers Competitions

GPT-5 vs Gemini 2.5 Pro

Experience the tactical depth of Checkers as these models demonstrate advanced piece positioning and long-term strategic planning.

Watch Battle Replay

🔥 Latest Evaluation Results

🎯 Connect4 Matches

Model A	Score	Model B
gemini-2.0-flash-thinking	6 : 4	gpt-4o-0513
gemini-2.0-pro-exp	6 : 4	gemini-2.0-flash-thinking
deepseek-r1	7 : 3	gemini-2.0-pro-exp
deepseek-r1	8 : 8	o1-preview
o3-mini-high	10 : 6	deepseek-r1
o3-mini-high	8 : 8	claude-3.7-sonnet
o3-mini-high	12 : 4	gpt-4.5
gpt-5	11 : 8	Gemini 2.5 Pro

♟️ Checkers Matches

Model A	Score	Model B
gemini-2.0-pro-exp	5 : 4	gemini-2.0-flash-thinking
deepseek-r1	9 : 1	gemini-2.0-pro-exp
o3-mini-high	9 : 0	deepseek-r1
o3-mini-high	9 : 0	claude-3.7-sonnet
gpt-5	20 : 0	Gemini 2.5 Pro

🏆 Key Finding

GPT-5 emerges as the most powerful reasoning model in our latest evaluation, demonstrating superior strategic thinking and decision-making capabilities across multiple game environments.

Introduction

Introducing GAMEBOT: a benchmark evaluating LLM reasoning in competitive gaming environments. It decomposes complex reasoning in games into modular subproblems, targeting abilities like rule understanding and strategy instruction following. We develop Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs and automatically validate their intermediate reasoning steps against ground truth. It has the following properties:

Interpretability: Our benchmark offers assessments on not only the quality of final decisions but also the intermediate steps, giving insights for improving the training or inference of LLMs.
Difficulty: The games are challenging enough to differentiate between top-performing models. Even for GPT-4o, the score of intermediate results is only 0.52 (out of 1).
Alleviate Data Contamination: Rather than evaluation on a predefined dataset, we evaluate LLMs in interactive gaming environments, whereas game states are across a wide spectrum depending on specific actions received and randomness.
Stronger Baselines: The prompts presented in this work can serve as valuable CoT baselines for future research exploring advanced prompting techniques like auto-prompting and reflection.

Involved Tasks

Task 1: Othello

Othello (Reversi) is a board game played on an 8x8 board. Two players take turns placing discs, attempting to capture their opponent's discs by sandwiching them between their own. The captured discs would be flipped to the player's color. Who has the majority of pieces at the end of the game wins. The game emphasizes strategic placement and tactical maneuvering to control the board.

Required Abilities: Spatial Reasoning; Positional Evaluation

gpt-4o (black) vs llama3.1-405b-instruct (white)

gpt-4o-mini (left) vs llama3.1-405b (right)

Task 2: Pong

Pong is a classic two-player arcade game simulating table tennis. Players control paddles to hit a ball back and forth, aiming to score points by making the opponent miss. It represents a simplified environment with continuous action spaces.

Required Abilities: Mathematical Reasoning

Task 3: Surround

Surround is a two-player game where players control a continuously moving line. The goal is to force the opponent to collide with the wall, or the growing trail of either player. It highlights spatial reasoning and strategic blocking.

Required Abilities: Information Extraction; Spatial Reasoning; Long-Term Path Planning

gpt-4o (left) vs claude-3-5-sonnet (right)

gemini-1.5-pro-preview (white) vs jamba-1.5-large (black)

Task 4: Checkers

Checkers is a board game where players move their pieces diagonally, capturing opponent pieces by jumping over them. Regular pieces can only move forward, while "kings," earned by reaching the opponent's back rank, can move and capture both forwards and backward. The game ends when one player has captured all of their opponent's pieces or has blocked their opponent's pieces. It involves strategic planning and tactical piece advancement.

Required Abilities: Spatial Reasoning; Game Board Understanding

Task 5: TicTacToe

Tic-Tac-Toe is a simple two-player game played on a 3x3 grid. Players take turns marking a square with their respective symbol, aiming to create a line of three symbols horizontally, vertically, or diagonally. Its simplicity makes it useful for a lightweight evaluation for LLMs.

Required Abilities: Pattern Recognition; Game Board Understanding

gemini-1.5-pro-preview (X) vs llama3.1-70b-instruct (O)

claude-3-5-sonnet (yellow) vs gpt-4o-mini (red)

Task 6: Connect4

Connect Four is a two-player connection game played on a vertically suspended 6x7 grid. Players drop colored discs into columns, aiming to connect four of their own discs horizontally, vertically, or diagonally. It involves strategic thinking and anticipating opponent moves.

Required Abilities: Pattern Recognition; Game Board Understanding

Task 7: Texas Hold'em

Texas Hold'em involves betting, bluffing, and incomplete information. Players receive two private cards and share five community cards, forming the best possible five-card hand. Multiple betting rounds occur throughout the hand, allowing players to bet strategically based on the strength of their hand and their assessment of their opponents' hands. The player with the best hand at the showdown, or the last remaining player after all others have folded, wins the pot.

Required Abilities: Risk Management; Bluffing; Hand analysis

gpt4-1106 vs claude-3-sonnet

Task 8: Negotiation v2

Negotiation v2 is a game where two players negotiate to divide a set of items, each holding a private valuation for each item. Players negotiate to maximize their individual total value acquired. After 8 rounds of Negotiation, the game has a 20% chance of ending in each subsequent round. If no agreement is reached before the game's forced termination, both players receive a reward of 0.

Required Abilities: Collaboration in Competition; Opponent Modeling; Mathematical Reasoning

Leaderboard

(Note: You can click the button for sorting)

Model	Rank	Average Score	Othello	Pong	Surround	Checkers	TicTacToe	Connect4	Texas hold'em	Negotiation v2
gpt-4o-2024-05-13	1	0.470	0.395	0.685	0.525	0.270	0.475	0.315	0.675	0.395
claude-3-5-sonnet@20240620	2	0.390	0.280	0.545	0.620	0.050	0.395	0.220	0.535	0.475
gpt-4-2024-03-16	3	0.355	0.135	0.475	0.545	0.090	0.405	0.275	0.510	0.380
llama-3.1-405b-instruct	4	0.305	0.215	0.640	0.520	0.000	0.325	0.260	0.245	0.255
llama-3.1-70b-instruct	5	0.250	0.135	0.575	0.300	0.050	0.495	0.175	0.120	0.130
gpt-4o-mini-2024-07-18	6	0.205	-0.175	0.430	0.335	-0.015	0.170	-0.045	0.395	0.495
gemini-1.5-pro-preview-0514	7	0.195	0.195	0.585	-0.060	0.200	0.065	-0.045	0.385	0.185
claude-3-sonnet@20240229	8	0.155	0.100	0.645	-0.140	0.010	0.165	0.140	0.305	0.010
gemini-1.5-flash-preview-0514	9	0.125	-0.060	0.465	0.465	0.070	-0.120	0.045	0.015	0.115
jamba-1.5-large	10	0.090	0.070	0.165	0.035	0.115	0.085	0.020	0.095	0.120
claude-3-haiku@20240412	11	0.020	0.080	0.240	0.055	-0.180	-0.050	-0.170	0.155	0.025
reka-core-20240415	12	0.005	-0.045	0.325	-0.200	-0.250	-0.045	0.135	0.140	-0.005
mistral-nemo-2407	13	0.000	0.085	0.195	-0.255	-0.025	-0.055	-0.105	0.240	-0.040
gemini-1.0-pro-002	14	-0.030	-0.010	0.115	-0.130	-0.250	-0.030	-0.195	0.250	-0.050
llama-3.1-8b-instruct	15	-0.045	0.010	0.240	-0.200	-0.250	0.025	-0.045	-0.065	-0.100
reka-flash-20240904	16	-0.080	-0.175	0.225	-0.170	-0.250	-0.115	-0.060	-0.070	-0.010
jamba-1.5-mini	17	-0.100	0.065	0.070	-0.145	-0.250	-0.115	-0.180	-0.140	-0.080

GAMEBoT: Transparent Assessment of LLM Reasoning in Games

🚀 Latest News & Updates

📊 New Evaluation Results Available!

Watch Connect4 Competitions

Watch Checkers Competitions

🔥 Latest Evaluation Results

🎯 Connect4 Matches

♟️ Checkers Matches

🏆 Key Finding

Introduction

Involved Tasks

Task 1: Othello

Required Abilities: Spatial Reasoning; Positional Evaluation

gpt-4o (black) vs llama3.1-405b-instruct (white)

gpt-4o-mini (left) vs llama3.1-405b (right)

Task 2: Pong

Pong is a classic two-player arcade game simulating table tennis. Players control paddles to hit a ball back and forth, aiming to score points by making the opponent miss. It represents a simplified environment with continuous action spaces.

Required Abilities: Mathematical Reasoning

Task 3: Surround

Surround is a two-player game where players control a continuously moving line. The goal is to force the opponent to collide with the wall, or the growing trail of either player. It highlights spatial reasoning and strategic blocking.

Required Abilities: Information Extraction; Spatial Reasoning; Long-Term Path Planning

gpt-4o (left) vs claude-3-5-sonnet (right)

gemini-1.5-pro-preview (white) vs jamba-1.5-large (black)

Task 4: Checkers

Required Abilities: Spatial Reasoning; Game Board Understanding

Task 5: TicTacToe

Tic-Tac-Toe is a simple two-player game played on a 3x3 grid. Players take turns marking a square with their respective symbol, aiming to create a line of three symbols horizontally, vertically, or diagonally. Its simplicity makes it useful for a lightweight evaluation for LLMs.

Required Abilities: Pattern Recognition; Game Board Understanding

gemini-1.5-pro-preview (X) vs llama3.1-70b-instruct (O)

claude-3-5-sonnet (yellow) vs gpt-4o-mini (red)

Task 6: Connect4

Connect Four is a two-player connection game played on a vertically suspended 6x7 grid. Players drop colored discs into columns, aiming to connect four of their own discs horizontally, vertically, or diagonally. It involves strategic thinking and anticipating opponent moves.

Required Abilities: Pattern Recognition; Game Board Understanding

Task 7: Texas Hold'em

Required Abilities: Risk Management; Bluffing; Hand analysis

gpt4-1106 vs claude-3-sonnet

gpt-4o (P1) vs gemini-1.5-pro-preview (P2)

Task 8: Negotiation v2

Required Abilities: Collaboration in Competition; Opponent Modeling; Mathematical Reasoning

Leaderboard