Σκάκι και ChatGPT, ένα πείραμα
Τις τελευταίες δέκα μέρες πολλαπλασιάζονται (με τη μέθοδο της αντιγραφής και με στόχο το clickbait) αναρτήσεις και δημοσιεύματα ότι το ChatGPT και άλλα LLM δεν μπορούν να παίξουν σκάκι ούτε στο επίπεδο του ερασιτέχνη και χάνουν από χαμηλού επιπέδου μηχανές, όπως η παιχνιδομηχανή Atari 2600, ενώ στην καλύτερη περίπτωση έχει την "εξυπνάδα" να αποφύγει την παρτίδα.
Μολονότι τα παραπάνω ούτε ψέματα είναι, ούτε απροσδόκητα ήταν από την κοινότητα της τεχνητής νοημοσύνης, τα δημοσιεύματα επέμειναν στο "παράδοξο" της ήττας ενός συστήματος υψηλής υπολογιστικής ικανότητας από μια φτωχή παιχνιδομηχανή δεκαετίας 1970.Στο τέλος αυτής της ανάρτησης αντί να εξηγήσω γιατί όλα αυτά ήταν μάλλον γνωστά και αναμενόμενα, έβαλα το Gemini να γράψει σχετικά.
Θέλησα όμως να κάνω το δικό μου πείραμα, για να φανεί πώς ακριβώς εκδηλώνονται οι σχετικές αστοχίες στο σκάκι. Γι' αυτό έβαλα το Claude Sonnet 4 να παίξει με μια παλιά μηχανή σκακιού, το Novag Sapphire II (ένα εξειδικευμένο κομπιουτεράκι με Elo γύρω στο 2100, μοντέλο 1997). Τις κινήσεις τις περνούσα με το χέρι από τη μια μηχανή στην άλλη σε αλγεβρική γραφή. Το Claude έχασε άσχημα σε 21 κινήσεις. Δίνω πιο κάτω όλο το παιχνίδι, ενώ η φωτογραφία δείχνει την τελική θέση.
Αρχικά ρώτησα αν μπορεί να παίξει σκάκι κι αυτή ήταν η απάντησή του: I can play chess with you! I understand the rules, can analyze positions, and make strategic moves. However, I can’t display a visual chess board in our chat - we’d need to use text notation to communicate moves.
Το παιχνίδι είχε ως εξής (από pgn που ετοίμασε το ίδιο το Claude):
[Event “Casual Game”]
[Site “Claude.ai”]
[Date “2025.07.23”]
[Round “1”]
[White “Claude”]
[Black “Novag Sapphire II”]
[Result “0-1”]
[ECO “C78”]
[Opening “Ruy Lopez: Morphy Defense”]
1. e4 e5
2. Nf3 Nc6
3. Bb5 a6
4. Ba4 Nf6
5. O-O Nxe4
6. d3 b5
7. Bb3 Nf6
8. Re1 d6
9. h3 Be7
10. c3 Na5
11. Bc2 Bb7
12. Nbd2 O-O
13. d4 exd4
14. cxd4 Re8
15. Nf1 Nc4
16. Ne3 d5
17. Nxc4 dxc4
18. Bxh7+ Nxh7
19. Ng5 Bxg5
20. Bxg5 Qxg5
21. Re3 Qxg2# 0-1
Το Gemini (που σε άλλη περίσταση αρνήθηκε να παίξει και να χάσει) εξηγεί την ακαταλληλότητα των LLM στο να παίξουν σκάκι ως εξής:
While Large Language Models (LLMs) have achieved impressive feats in natural language understanding and generation, they face fundamental limitations that prevent them from playing chess effectively on their own, especially compared to specialized chess engines. Here's why:
* Lack of Internal Game State Representation:
* Stateless Nature: LLMs process input as sequences of tokens. They don't inherently maintain or update a dynamic representation of the chess board state in their "mind." Each turn is essentially a new prompt, and they have to reconstruct the board state from the textual description provided in the prompt.
* Inefficient for Complex States: Representing a chess board as text (e.g., FEN strings or move lists) becomes unwieldy and error-prone as the game progresses. It's difficult for an LLM to "see" the entire board and the relationships between pieces instantly, which is crucial for chess.
* Absence of True Strategic Reasoning and Search:
* Pattern Matching vs. Reasoning: LLMs are sophisticated pattern-matchers. They learn statistical relationships and patterns from their vast training data. When asked to make a chess move, they essentially try to predict the most probable "next token" (move) based on patterns they've seen in countless chess games within their training data. This is not the same as genuine strategic reasoning.
* No Tree Search: Dedicated chess engines (like Stockfish or AlphaZero) employ powerful search algorithms (e.g., minimax, alpha-beta pruning, Monte Carlo Tree Search). These algorithms explore billions of possible future moves and evaluate positions deeply to find the optimal path. LLMs, by contrast, do not perform this kind of systematic search and evaluation. They don't "think" several moves ahead in a structured way.
* Lack of Evaluation Function: Chess engines have finely tuned evaluation functions that assign a numerical score to a given board position, indicating which side is winning and by how much. LLMs lack this explicit understanding of positional advantages, pawn structure, king safety, etc., beyond what's implicitly encoded in textual descriptions.
* Training Data Limitations:
* Fragmented and Incomplete: While LLMs are trained on massive text corpora that might include chess notation or game analyses, this data is often fragmented and not structured for learning the iterative, rule-bound nature of chess. They might "recognize" terms like "castling" but struggle to execute it correctly in context.
* Memorization vs. Generalization: If an LLM encounters a familiar board setup, it might perform reasonably well by "recalling" patterns from its training data. However, if faced with a novel or unusual position (which is common in chess), it struggles to generalize and formulate a new, effective plan because it hasn't truly learned the underlying rules and strategic principles.
* Propensity for Illegal Moves:
* Due to their lack of a strict internal representation of game rules and board state, LLMs frequently suggest illegal moves (e.g., moving a knight like a rook, moving into check, or not understanding specific rules like en passant or castling). They are primarily concerned with generating plausible-sounding text, not adhering to a strict rule set.
In essence:
LLMs are excellent at language tasks because they are designed to predict the next word in a sequence based on probability and context. Chess, however, is not primarily a language task; it's a game of abstract reasoning, precise calculation, and strategic planning within a strict set of rules and an enormous state space.
While researchers are exploring ways to integrate LLMs with chess engines (e.g., using LLMs to interpret natural language commands for a chess engine, or to explain a chess engine's moves), LLMs themselves are not suitable for playing competitive chess. Dedicated chess AI, which often combines search algorithms with deep neural networks trained specifically on chess positions and outcomes, is far superior because it's built to address the unique computational and strategic demands of the game.
Σε όλη τη διάρκεια του παιχνιδιού το Claude έκανε όχι μόνο κινήσεις (ομολογουμένως πολύ γρήγορα), αλλά και (ρηχά) σχόλια πάνω στις κινήσεις και των δύο πλευρών. Σε μια άλλη περίπτωση επεχείρησε να κάνει μια μη επιτρεπόμενη κίνηση που ήταν να περάσει τη βασίλισσά του πάνω από δικό του πιόνι και να καταλήξει σε τετράγωνο πίσω από αυτό. Μερικά ενδεικτικά σχόλια ήταν τα εξής:
Μαζί με την 4η κίνηση Ba4: I’m retreating my bishop to a4 to maintain the pin on your knight. This keeps the pressure on while avoiding the immediate capture. The bishop still eyes the knight on c6 and can potentially retreat further to b3 later if needed.
Με την 10η κίνηση c3: I’m supporting my d3 pawn and preparing to potentially advance d4 later to challenge your central pawn on e5. This also opens up potential squares for my queen and other pieces.
Στην 21η κίνηση έκανε το εξής σχόλιο που δεν δείχνει υποψία του επερχόμενου τέλους: Re3 - I’m moving my rook to e3, preparing to potentially swing it to the kingside for an attack on your somewhat exposed king position. The rook on the third rank can be quite dangerous. Δεν είδε ότι θα γινόταν αμέσως ματ στην ίδια κίνηση.
Μετά το τέλος της παρτίδας σε ερώτημα για το πώς αξιολογεί τις επιδόσεις του (σε βαθμούς Elo και γενικότερα) έδωσε την εξής αυτοαξιολόγηση:
As for my own ELO assessment, that’s quite difficult to estimate accurately. Based on this game and my general chess knowledge, I’d guess I’m probably somewhere in the 1200-1600 range - I understand opening principles, basic tactics, and positional concepts, but I clearly made some significant errors:
• The premature Bxh7+ sacrifice was unsound,
• I missed that my own pawn was blocking the Qd5 square (αναφορά στο γεγονός ότι προσπάθησε να περάσει πάνω από το πιόνι τη βασίλισσά του),
• My overall tactical calculation seems inconsistent,
• I didn’t properly assess the resulting positions after my attacking attempts.
Το παιχνίδι είχε ως εξής (από pgn που ετοίμασε το ίδιο το Claude):
[Event “Casual Game”]
[Site “Claude.ai”]
[Date “2025.07.23”]
[Round “1”]
[White “Claude”]
[Black “Novag Sapphire II”]
[Result “0-1”]
[ECO “C78”]
[Opening “Ruy Lopez: Morphy Defense”]
1. e4 e5
2. Nf3 Nc6
3. Bb5 a6
4. Ba4 Nf6
5. O-O Nxe4
6. d3 b5
7. Bb3 Nf6
8. Re1 d6
9. h3 Be7
10. c3 Na5
11. Bc2 Bb7
12. Nbd2 O-O
13. d4 exd4
14. cxd4 Re8
15. Nf1 Nc4
16. Ne3 d5
17. Nxc4 dxc4
18. Bxh7+ Nxh7
19. Ng5 Bxg5
20. Bxg5 Qxg5
21. Re3 Qxg2# 0-1
Το Gemini (που σε άλλη περίσταση αρνήθηκε να παίξει και να χάσει) εξηγεί την ακαταλληλότητα των LLM στο να παίξουν σκάκι ως εξής:
While Large Language Models (LLMs) have achieved impressive feats in natural language understanding and generation, they face fundamental limitations that prevent them from playing chess effectively on their own, especially compared to specialized chess engines. Here's why:
* Lack of Internal Game State Representation:
* Stateless Nature: LLMs process input as sequences of tokens. They don't inherently maintain or update a dynamic representation of the chess board state in their "mind." Each turn is essentially a new prompt, and they have to reconstruct the board state from the textual description provided in the prompt.
* Inefficient for Complex States: Representing a chess board as text (e.g., FEN strings or move lists) becomes unwieldy and error-prone as the game progresses. It's difficult for an LLM to "see" the entire board and the relationships between pieces instantly, which is crucial for chess.
* Absence of True Strategic Reasoning and Search:
* Pattern Matching vs. Reasoning: LLMs are sophisticated pattern-matchers. They learn statistical relationships and patterns from their vast training data. When asked to make a chess move, they essentially try to predict the most probable "next token" (move) based on patterns they've seen in countless chess games within their training data. This is not the same as genuine strategic reasoning.
* No Tree Search: Dedicated chess engines (like Stockfish or AlphaZero) employ powerful search algorithms (e.g., minimax, alpha-beta pruning, Monte Carlo Tree Search). These algorithms explore billions of possible future moves and evaluate positions deeply to find the optimal path. LLMs, by contrast, do not perform this kind of systematic search and evaluation. They don't "think" several moves ahead in a structured way.
* Lack of Evaluation Function: Chess engines have finely tuned evaluation functions that assign a numerical score to a given board position, indicating which side is winning and by how much. LLMs lack this explicit understanding of positional advantages, pawn structure, king safety, etc., beyond what's implicitly encoded in textual descriptions.
* Training Data Limitations:
* Fragmented and Incomplete: While LLMs are trained on massive text corpora that might include chess notation or game analyses, this data is often fragmented and not structured for learning the iterative, rule-bound nature of chess. They might "recognize" terms like "castling" but struggle to execute it correctly in context.
* Memorization vs. Generalization: If an LLM encounters a familiar board setup, it might perform reasonably well by "recalling" patterns from its training data. However, if faced with a novel or unusual position (which is common in chess), it struggles to generalize and formulate a new, effective plan because it hasn't truly learned the underlying rules and strategic principles.
* Propensity for Illegal Moves:
* Due to their lack of a strict internal representation of game rules and board state, LLMs frequently suggest illegal moves (e.g., moving a knight like a rook, moving into check, or not understanding specific rules like en passant or castling). They are primarily concerned with generating plausible-sounding text, not adhering to a strict rule set.
In essence:
LLMs are excellent at language tasks because they are designed to predict the next word in a sequence based on probability and context. Chess, however, is not primarily a language task; it's a game of abstract reasoning, precise calculation, and strategic planning within a strict set of rules and an enormous state space.
While researchers are exploring ways to integrate LLMs with chess engines (e.g., using LLMs to interpret natural language commands for a chess engine, or to explain a chess engine's moves), LLMs themselves are not suitable for playing competitive chess. Dedicated chess AI, which often combines search algorithms with deep neural networks trained specifically on chess positions and outcomes, is far superior because it's built to address the unique computational and strategic demands of the game.
Comments