📊 New Evaluation Results Available!
We've recently evaluated several cutting-edge models on our benchmark, including the latest iterations of GPT, Claude, and Gemini models. Check out the detailed battle replays and performance analysis below:
Watch Connect4 Competitions
GPT-5 vs Gemini 2.5 Pro
Witness an intense battle of strategic thinking in Connect4. See how these advanced models plan multiple moves ahead and adapt their strategies.
Watch Battle ReplayWatch Checkers Competitions
GPT-5 vs Gemini 2.5 Pro
Experience the tactical depth of Checkers as these models demonstrate advanced piece positioning and long-term strategic planning.
Watch Battle Replay🔥 Latest Evaluation Results
🎯 Connect4 Matches
Model A | Score | Model B |
---|---|---|
gemini-2.0-flash-thinking | 6 : 4 | gpt-4o-0513 |
gemini-2.0-pro-exp | 6 : 4 | gemini-2.0-flash-thinking |
deepseek-r1 | 7 : 3 | gemini-2.0-pro-exp |
deepseek-r1 | 8 : 8 | o1-preview |
o3-mini-high | 10 : 6 | deepseek-r1 |
o3-mini-high | 8 : 8 | claude-3.7-sonnet |
o3-mini-high | 12 : 4 | gpt-4.5 |
gpt-5 | 11 : 8 | Gemini 2.5 Pro |
♟️ Checkers Matches
Model A | Score | Model B |
---|---|---|
gemini-2.0-pro-exp | 5 : 4 | gemini-2.0-flash-thinking |
deepseek-r1 | 9 : 1 | gemini-2.0-pro-exp |
o3-mini-high | 9 : 0 | deepseek-r1 |
o3-mini-high | 9 : 0 | claude-3.7-sonnet |
gpt-5 | 20 : 0 | Gemini 2.5 Pro |
🏆 Key Finding
GPT-5 emerges as the most powerful reasoning model in our latest evaluation, demonstrating superior strategic thinking and decision-making capabilities across multiple game environments.