Can LLMs Play Chess? I've Tested 13 Models (GPT-4o, Claude 3.5, Gemini 1.5 etc.)

UPD September 15, 2025: Reasoning models opened a new chapter in Chess performance, the most recent models, such as GPT-5, can play reasonable chess, even beating an average chess.com player. Recently LLM Chess eval has been extended to include games against different skill levels of Komodo Dragon 1 engine which is Elo rated at chess.com. All models from the leaderboard received their Elo ratings based on chess.com pool of players by extrapolating their performance against Dragon chess engine.

UPD January 25, 2025: Deepseek R1 is another model that broke the ceiling of zero wins showing meaningful chess performance -> https://dev.to/maximsaplin/deepseek-r1-vs-openai-o1-1ijm

UPD January 24, 2025: I have increased the number of simulations and a few of the non-reasoning models scored occasional wins (Claude 3.5, GPT-4o), yet it doesn't seem they actually play chess, maintaining negative material diff (except Grok-2, it had positive diff yet struggled with instruction following).

UPD January 16, 2025: o1-mini and o1-preview are for sure special models, they are the first ones to demonstrate meaningful chess games scoring a wins -> https://maxim-saplin.github.io/llm_chess/

No, they can not play chess if you expect them to win rather than merely move pieces on the board. I ran multiple simulations putting LLMs against a random player, and there was no single win for LLM.

Here's the full LLM Chess Leaderboard. A shorter version follows...

Player Draws ▼ Wins

gpt-4-turbo-2024-04-09 93.33% 0.00%

gpt-4o-2024-08-06 90.00% 0.00%

gpt-4o-2024-05-13 83.33% 0.00%

anthropic.claude-v3-5-sonnet 73.33% 0.00%

gpt-4o-mini-2024-07-18 60.00% 0.00%

llama-3-70b-instruct-awq 50.00% 0.00%

gemini-1.5-pro-preview-0409 36.67% 0.00%

gemini-1.5-flash-001 33.33% 0.00%

gemma-2-27b-it@q6_k_l 26.67% 0.00%

gemma-2-9b-it-8bit 13.33% 0.00%

anthropic.claude-v3-haiku 0.00% 0.00%

gpt-35-turbo-0125 0.00% 0.00%

llama-3.1-8b-instruct-fp16 0.00% 0.00%

Player	Draws ▼	Wins
gpt-4-turbo-2024-04-09	93.33%	0.00%
gpt-4o-2024-08-06	90.00%	0.00%
gpt-4o-2024-05-13	83.33%	0.00%
anthropic.claude-v3-5-sonnet	73.33%	0.00%
gpt-4o-mini-2024-07-18	60.00%	0.00%
llama-3-70b-instruct-awq	50.00%	0.00%
gemini-1.5-pro-preview-0409	36.67%	0.00%
gemini-1.5-flash-001	33.33%	0.00%
gemma-2-27b-it@q6_k_l	26.67%	0.00%
gemma-2-9b-it-8bit	13.33%	0.00%
anthropic.claude-v3-haiku	0.00%	0.00%
gpt-35-turbo-0125	0.00%	0.00%
llama-3.1-8b-instruct-fp16	0.00%	0.00%

While doing the practical exercises from Deeplearning's "AI Agentic Design Patterns with AutoGen" course, Lesson #4 (Tool Use) caught my attention. Two AI agents played chess and interacted via tools with a chess board moving pieces on it.

An idea popped in my head: It would be great to host a chess competition between different models, run multiple games, calculate win rates and Elo scores, and create a leaderboard saying that "Model A ranks #1 with Elo 3560, Model B is #2..."

After bumping up the turn limit to 200 (the exercises had it set to 3) I was surprised that the game kept going without any of the players winning. Trying to put one side at a disadvantage (adding to the system prompt "You are a beginner chess player...") didn't help much... The stronger player was still not able to make a checkmate.

That's when I decided to switch to testing an LLM against what seemed to be an even weaker opponent - a Random Player. I.e. it is a bot that interacts with the chess board in a similar fashion as LLM agent, yet it doesn't put any effort into calculating the best move. It merely asks from the board the list of legal moves (it can take given the current position) and randomly picks one.

I ended up creating LLM Chess project and simulated hundreds of games between Random Player and LLM. I evaluated chess and instruction following proficiency of 13 chat models. The results are present in the aforementioned Leaderboard.

Gen AI Skepticism

The results were surprising to me. Yet they come at a time of LLM skepticism at its highest since ChatGPT rocked in 2022.

People call out limited value from chatbots, companies struggle to find practical applications, and the impact on the global economy is not consequential. LLMs keep hallucinating and reliability is a key concern for wider adoption across the use cases. Imagine a world where database engine returned correct results for 7 out of 10 SELECT queries...

Ex-OpenAI Ilya Sutskever had recently said that LLM scaling has plateaued. LLM performance convergence can be seen as a trend of 2024.

LLM Memorize, They Do not Reason

The creator of the Arc Challenge, a benchmark that is "easy for humans, but hard for AI" argues that LLMs can memorize answers but they can not truly reason.

A recent study by Apple researchers (GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models) has recently been discussed a lot - it supports the above claim.

What they have shown was that even with the smallest changes to the original problems in GSM8K (a popular LLM benchmark used to evaluate their math capabilities) the scores drop significantly.

Imagine that GSM8K had a problem like this:

Alice had 5 apples and Bob had 3. What was the total number of apples they had?

🍏🍏🍏🍏🍏 + 🍏🍏🍏 = ?

When the authors changed the numbers (e.g. Alice had 4 apples and Bob had 2) OR when some irrelevant info was added to the problems (e.g. Alice is Bob's sister) they saw a significant drop in score.

Yann LeCun has compared LLM to giant look-up tables and the tech works as long as there's a huge corpus of data covering all the cases (which is not feasible):

The problem is that there is a long tail, this is an issue that a lot of people have realized in social networks and stuff like that, which is there’s a very, very long tail of things that people will ask and you can fine tune the system for the 80% or whatever of the things that most people will ask. And then this long tail is so large that you’re not going to be able to fine tune the system for all the conditions. And in the end, the system ends up being a giant lookup table essentially, which is not really what you want, you want systems that can reason, certainly that can plan.

To wrap up this rant about LLMs not being capable of true reasoning, here's a study from 2023 - The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" - the one that made headlines discussing how ChatGPT happened to know the name of Tom Cruise's mom yet not knowing the famous son of her.

Thoughts

While working on the project and running those multiple simulations I think I have built my own intuition of LLMs being "just text generators". I have reconsidered the notion of "in-distribution", i.e. if the model has seen certain data - it might surprise you, if not - it might fail miserably. That was an exercise that allowed me to sense the limits of the technology.

Seeing in logs how a model could get into a failure mode no able to pick the right move (while being in the middle of the game successfully lasting until move #74) out of a given list of legal moves (e.g. 10 comma-separated codes like 'e2e4'), cause it (likely) met some unique combination of tokens and LLMs could not resist but pick a familiar, in distribution code which resembled one of the legal moves (and skipping the move code that was likely a very rare token not met in pre-training).

When I tried better prompts, e.g. asking LLMs to reflect before making a move, the results did not improve, yet now LLMs covered their decision process with some fancy verbosity.

--------------------------------------------------------------------------------
Proxy (to Player_Black):

Before deciding on the next move you can reflect on your current situation, write down notes, and evaluate.
    Here are a few recommendations that you can follow to make a better move decision:
    - Shortlist the most valuable next moves
    - Consider how they affect the situation
    - What could be the next moves from your opponent in each case
    - Is there any strategy fitting the situation and your choice of moves
    - Rerank the shortlisted moves based on the previous steps


--------------------------------------------------------------------------------
Player_Black (to Proxy):

**Reflection:**

1. **Shortlist Valuable Moves:**
   - **g8f6**: Develops the knight, controls the center, and targets e4.
   - **h7h6**: Prepares to fianchetto the bishop or prevents a knight pin on g5.
   - **c7c5**: Challenges the center and prepares to expand on the queenside.

2. **Immediate Effects:**
   - **g8f6**: Develops a piece, brings the knight nearer to the center.
   - **h7h6**: Solidifies kingside but is more of a precautionary move.
   - **c7c5**: Direct action against the center, preparing for future breaks.

3. **Opponent's Responses:**
   - **g8f6**: White might continue developing with Nc3 or push e4.
   - **h7h6**: White might ignore it and proceed with normal development.
   - **c7c5**: White could opt for dxc5 or develop with Nc3.

4. **Strategy Considerations:**
   - Aim to control the center and set up a solid defensive position.
   - Balance between development and maintaining structural integrity.

5. **Rerank Shortlisted Moves:**
   - **1. g8f6**: Best for development and central control.
   - **2. c7c5**: Strong challenge to the center.
   - **3. h7h6**: Defensive but adds limited value immediately.

With that in mind, I’ll proceed with the development strategy.

Action: `make_move g8f6`

Just like very often you leave a chatbot with a sense of emptiness, having read a lot of BS and not getting concrete answers to your questions.

Somehow the analogy from last year, trying to characterize an LLM chatbot in simple terms, still holds.

Prompting LLM to Make a Move

Proxy (a bot controlling the chess board) used hard-coded prompts and arranged a short dialog (max 10 turns) with an LLM asking it to make a move.

Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move ' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.

Player_Black (to Proxy):
get_current_board

Proxy (to Player_Black):
♞ ♝ ♛ ♚ ♝ ♞ ♜ ♜
♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
· · · · · · · ·
♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙
♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖


Player_Black (to Proxy):
get_legal_moves

Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5

Player_Black (to Proxy):
make_move e7e5

Proxy (to Player_Black):
Move made, switching player

Can the LLM Chess be Yet Another "Hard for LLMs" benchmark?

I am curious to see if any of the chat LLMs can score wins against a random player while preserving good performance across different tasks. I suspect that might be a challenge given the enormous combination of moves and positions. If LLM is just a look-up table, as noted by LeCun, saturating LLM Chess might be a challenge.

Maxim Saplin @maximsaplin