Introducing GAMEBOT: a benchmark evaluating LLM reasoning in competitive gaming environments. It decomposes complex reasoning in games into modular subproblems, targeting abilities like rule understanding and strategy instruction following. We develop Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs and automatically validate their intermediate reasoning steps against ground truth. It has the following properties:
(Note: You can click the button for sorting)
Model | Rank | Average Score | Othello | Pong | Surround | Checkers | TicTacToe | Connect4 | Texas hold'em | Negotiation v2 |
---|---|---|---|---|---|---|---|---|---|---|
gpt-4o-2024-05-13 | 1 | 0.470 | 0.395 | 0.685 | 0.525 | 0.270 | 0.475 | 0.315 | 0.675 | 0.395 |
claude-3-5-sonnet@20240620 | 2 | 0.390 | 0.280 | 0.545 | 0.620 | 0.050 | 0.395 | 0.220 | 0.535 | 0.475 |
gpt-4-2024-03-16 | 3 | 0.355 | 0.135 | 0.475 | 0.545 | 0.090 | 0.405 | 0.275 | 0.510 | 0.380 |
llama-3.1-405b-instruct | 4 | 0.305 | 0.215 | 0.640 | 0.520 | 0.000 | 0.325 | 0.260 | 0.245 | 0.255 |
llama-3.1-70b-instruct | 5 | 0.250 | 0.135 | 0.575 | 0.300 | 0.050 | 0.495 | 0.175 | 0.120 | 0.130 |
gpt-4o-mini-2024-07-18 | 6 | 0.205 | -0.175 | 0.430 | 0.335 | -0.015 | 0.170 | -0.045 | 0.395 | 0.495 |
gemini-1.5-pro-preview-0514 | 7 | 0.195 | 0.195 | 0.585 | -0.060 | 0.200 | 0.065 | -0.045 | 0.385 | 0.185 |
claude-3-sonnet@20240229 | 8 | 0.155 | 0.100 | 0.645 | -0.140 | 0.010 | 0.165 | 0.140 | 0.305 | 0.010 |
gemini-1.5-flash-preview-0514 | 9 | 0.125 | -0.060 | 0.465 | 0.465 | 0.070 | -0.120 | 0.045 | 0.015 | 0.115 |
jamba-1.5-large | 10 | 0.090 | 0.070 | 0.165 | 0.035 | 0.115 | 0.085 | 0.020 | 0.095 | 0.120 |
claude-3-haiku@20240412 | 11 | 0.020 | 0.080 | 0.240 | 0.055 | -0.180 | -0.050 | -0.170 | 0.155 | 0.025 |
reka-core-20240415 | 12 | 0.005 | -0.045 | 0.325 | -0.200 | -0.250 | -0.045 | 0.135 | 0.140 | -0.005 |
mistral-nemo-2407 | 13 | 0.000 | 0.085 | 0.195 | -0.255 | -0.025 | -0.055 | -0.105 | 0.240 | -0.040 |
gemini-1.0-pro-002 | 14 | -0.030 | -0.010 | 0.115 | -0.130 | -0.250 | -0.030 | -0.195 | 0.250 | -0.050 |
llama-3.1-8b-instruct | 15 | -0.045 | 0.010 | 0.240 | -0.200 | -0.250 | 0.025 | -0.045 | -0.065 | -0.100 |
reka-flash-20240904 | 16 | -0.080 | -0.175 | 0.225 | -0.170 | -0.250 | -0.115 | -0.060 | -0.070 | -0.010 |
jamba-1.5-mini | 17 | -0.100 | 0.065 | 0.070 | -0.145 | -0.250 | -0.115 | -0.180 | -0.140 | -0.080 |