Introduction

Introducing GAMEBOT: a benchmark evaluating LLM reasoning in competitive gaming environments. It decomposes complex reasoning in games into modular subproblems, targeting abilities like rule understanding and strategy instruction following. We develop Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs and automatically validate their intermediate reasoning steps against ground truth. It has the following properties:

  • Interpretability: Our benchmark offers assessments on not only the quality of final decisions but also the intermediate steps, giving insights for improving the training or inference of LLMs.
  • Difficulty: The games are challenging enough to differentiate between top-performing models. Even for GPT-4o, the score of intermediate results is only 0.52 (out of 1).
  • Alleviate Data Contamination: Rather than evaluation on a predefined dataset, we evaluate LLMs in interactive gaming environments, whereas game states are across a wide spectrum depending on specific actions received and randomness.
  • Stronger Baselines: The prompts presented in this work can serve as valuable CoT baselines for future research exploring advanced prompting techniques like auto-prompting and reflection.

Involved Tasks

Task 1: Othello

Othello (Reversi) is a board game played on an 8x8 board. Two players take turns placing discs, attempting to capture their opponent's discs by sandwiching them between their own. The captured discs would be flipped to the player's color. Who has the majority of pieces at the end of the game wins. The game emphasizes strategic placement and tactical maneuvering to control the board.

Required Abilities: Spatial Reasoning; Positional Evaluation

gpt-4o (black) vs llama3.1-405b-instruct (white)

gpt-4o-mini (left) vs llama3.1-405b (right)

Task 2: Pong

Pong is a classic two-player arcade game simulating table tennis. Players control paddles to hit a ball back and forth, aiming to score points by making the opponent miss. It represents a simplified environment with continuous action spaces.

Required Abilities: Mathematical Reasoning

Task 3: Surround

Surround is a two-player game where players control a continuously moving line. The goal is to force the opponent to collide with the wall, or the growing trail of either player. It highlights spatial reasoning and strategic blocking.

Required Abilities: Information Extraction; Spatial Reasoning; Long-Term Path Planning

gpt-4o (left) vs claude-3-5-sonnet (right)

gemini-1.5-pro-preview (white) vs jamba-1.5-large (black)

Task 4: Checkers

Checkers is a board game where players move their pieces diagonally, capturing opponent pieces by jumping over them. Regular pieces can only move forward, while "kings," earned by reaching the opponent's back rank, can move and capture both forwards and backward. The game ends when one player has captured all of their opponent's pieces or has blocked their opponent's pieces. It involves strategic planning and tactical piece advancement.

Required Abilities: Spatial Reasoning; Game Board Understanding

Task 5: TicTacToe

Tic-Tac-Toe is a simple two-player game played on a 3x3 grid. Players take turns marking a square with their respective symbol, aiming to create a line of three symbols horizontally, vertically, or diagonally. Its simplicity makes it useful for a lightweight evaluation for LLMs.

Required Abilities: Pattern Recognition; Game Board Understanding

gemini-1.5-pro-preview (X) vs llama3.1-70b-instruct (O)

claude-3-5-sonnet (yellow) vs gpt-4o-mini (red)

Task 6: Connect4

Connect Four is a two-player connection game played on a vertically suspended 6x7 grid. Players drop colored discs into columns, aiming to connect four of their own discs horizontally, vertically, or diagonally. It involves strategic thinking and anticipating opponent moves.

Required Abilities: Pattern Recognition; Game Board Understanding

Task 7: Texas Hold'em

Texas Hold'em involves betting, bluffing, and incomplete information. Players receive two private cards and share five community cards, forming the best possible five-card hand. Multiple betting rounds occur throughout the hand, allowing players to bet strategically based on the strength of their hand and their assessment of their opponents' hands. The player with the best hand at the showdown, or the last remaining player after all others have folded, wins the pot.

Required Abilities: Risk Management; Bluffing; Hand analysis

gpt4-1106 vs claude-3-sonnet

Task 8: Negotiation v2

Negotiation v2 is a game where two players negotiate to divide a set of items, each holding a private valuation for each item. Players negotiate to maximize their individual total value acquired. After 8 rounds of Negotiation, the game has a 20% chance of ending in each subsequent round. If no agreement is reached before the game's forced termination, both players receive a reward of 0.

Required Abilities: Collaboration in Competition; Opponent Modeling; Mathematical Reasoning

Leaderboard

(Note: You can click the button for sorting)

Model Rank Average Score Othello Pong Surround Checkers TicTacToe Connect4 Texas hold'em Negotiation v2
gpt-4o-2024-05-13 1 0.470 0.395 0.685 0.525 0.270 0.475 0.315 0.675 0.395
claude-3-5-sonnet@20240620 2 0.390 0.280 0.545 0.620 0.050 0.395 0.220 0.535 0.475
gpt-4-2024-03-16 3 0.355 0.135 0.475 0.545 0.090 0.405 0.275 0.510 0.380
llama-3.1-405b-instruct 4 0.305 0.215 0.640 0.520 0.000 0.325 0.260 0.245 0.255
llama-3.1-70b-instruct 5 0.250 0.135 0.575 0.300 0.050 0.495 0.175 0.120 0.130
gpt-4o-mini-2024-07-18 6 0.205 -0.175 0.430 0.335 -0.015 0.170 -0.045 0.395 0.495
gemini-1.5-pro-preview-0514 7 0.195 0.195 0.585 -0.060 0.200 0.065 -0.045 0.385 0.185
claude-3-sonnet@20240229 8 0.155 0.100 0.645 -0.140 0.010 0.165 0.140 0.305 0.010
gemini-1.5-flash-preview-0514 9 0.125 -0.060 0.465 0.465 0.070 -0.120 0.045 0.015 0.115
jamba-1.5-large 10 0.090 0.070 0.165 0.035 0.115 0.085 0.020 0.095 0.120
claude-3-haiku@20240412 11 0.020 0.080 0.240 0.055 -0.180 -0.050 -0.170 0.155 0.025
reka-core-20240415 12 0.005 -0.045 0.325 -0.200 -0.250 -0.045 0.135 0.140 -0.005
mistral-nemo-2407 13 0.000 0.085 0.195 -0.255 -0.025 -0.055 -0.105 0.240 -0.040
gemini-1.0-pro-002 14 -0.030 -0.010 0.115 -0.130 -0.250 -0.030 -0.195 0.250 -0.050
llama-3.1-8b-instruct 15 -0.045 0.010 0.240 -0.200 -0.250 0.025 -0.045 -0.065 -0.100
reka-flash-20240904 16 -0.080 -0.175 0.225 -0.170 -0.250 -0.115 -0.060 -0.070 -0.010
jamba-1.5-mini 17 -0.100 0.065 0.070 -0.145 -0.250 -0.115 -0.180 -0.140 -0.080