Did Niemann Cheat?
In the wake of the current chess controversy, a number of grandmasters publicly revealed that Hans Niemann’s recent rise into the 2700 club has been the topic of rumors for a while. Hans gained over 200 Elo points since 2020, which is unprecedented at this level. Is Hans just an incredible genius, or could it be that he is cheating somehow? Defenders of Niemann will point to the fact that he has played more over-the-board games that most of his peers, a fact that may play into this incredible ratings gain. Detractors, however, will note that Niemann admitted to cheating at least twice online. In fact, Chess.com banned Niemann and stated publicly that the frequency of his misdeeds far exceeds his claims, casting doubt on the veracity of the prodigies’ other proclamations.
Many players and enthusiasts have taken to analyzing Niemann’s play with engines, or looking into the details of his rating gain. Data scientist Caleb Wetherell from Pawnalyze.com investigated the suspects’ chess record based on the Elo ratings of each individual match. In chess, a player’s skill is measured by their Elo rating, which increases when the player wins, and decreases when they lose. The expected probabilities of winning and losing can be calculated based on the Elo formula, or, as in Wetherell’s case, based on a custom machine learning model. By simulating Niemann’s matchups thousands of times, Monte Carlo-style, we obtain a distribution of probable outcomes, given the ratings of players at the time of play.
Wetherell’s article on the Niemann case is interesting, and GM Hikaru Nakamura briefly showed the results on his stream. This is also where I first saw it, and immediately checked out the research. It seems to show that Niemann’s performance is almost impossibly good, given the expected results of his games.
Of course, Wetherell is careful to point out the caveats of these results, the biggest of which is that we make an implicit assumption that Elo is in fact an accurate measure of skill. In reality, Elo is only an accurate measure of skill if two conditions are met: the sample size is adequate, and the player’s Elo has stabilized around a value. As long as a player’s Elo is increasing, it is by definition an inaccurate measure of skill. Among young and rising chess players, their rating is often a weak indication of potential, and this is especially true during the Covid-19 Pandemic, where over-the-board tournaments significantly decreased in frequency. Many observers have commented on the wave of “post-covid youngsters”, who are smashing expectations and playing far above their rating level.
If we are to draw any conclusions from these simulations, then we can’t just look at Niemann in isolation, but compare him to his peers. And that is what I’ve attempted to do, building on Wetherell’s foundation. In order to preserve the integrity of the experiment (and because I’m lazy) I’ve taken the liberty of using Pawnalyzes’ predictive modelling, with a few modifications. I shall only be looking at classical chess games, played since January 2020, the data for which I got from the Chessable Mega 2022 Database (so blame them if games are missing!).
Niemann’s Expected Rating Gains
First, I recreated Wetherell’s results, to check if everything was working correctly. Note that my distribution is slightly different, as I’m looking at all games starting in 2020, while the previous analysis was done starting in 2021. Based on this simulation, we can see that the expected Mean rating gain for Niemann is 214, with a Standard Deviation of 6.49. This makes his actual gain of 228 points in this period quite astounding. In fact, the probability of him achieving the result he did is approximately 1.5%.
In isolation, this may look very fishy indeed, but before we jump to the beckoning anal bead computer shaped implication of this bell curve, let’s look at his peers.
While Niemann tops the list of overall gainers in the last three years, the competition is not far behind, and impressive strides have been made by other world-champion hopefuls. It should, however, be noted that Niemann’s rise is even more impressive given the generally higher rating range than his peers.
As we can see, the expected results and real results in Elo gains vary wildly amongst this group of upcoming players, and Niemann’s result is, by comparison, less dramatic. The relationship between what we should expect, and what actually happens shows that the Elo of these players is not generally a good measure of their chess ability. On average, these players scored 33 points, or a whopping 6.6 Standard Deviations, outside of the expected result.
Subtracting their expected results from their actual results gives us a distribution of difference, in which Niemann is no longer an outlier, but a comfortably average player.
Given this statistic, and ignoring for a moment the small sample size upon which this distribution is based, we can calculate the probability of Niemann’s rating in relationship to the expected outcome of his matches. The chance of Niemann scoring above his expected Elo in the way he did is 58%. That is, it is more likely than not, given the general behaviour of Elo-expectation-to-reality of his peer group, that Niemann’s outcome is unremarkable.
Whether or not Niemann cheated over the board is an open question that cannot be definitely answered by this type of analysis, but what we can say is that his results are not as crazy as a simple Elo simulation might suggest, given his peer group. Others have already pointed out that nothing in Niemann’s games is particularly suspicious either, when analyzed with computers. I think at this point, the preponderance of the evidence suggests that Niemann did not achieve his impressive ratings-rise by cheating.
From a moral point of view, I should also reaffirm an important principle of justice: we must always presume innocence until guilt can be definitively established. More importantly perhaps, I would suggest that a world in which the guilty sometimes go unpunished is preferable to the world in which the innocent are crucified. The recent chess world drama shows that public judgement can swing wildly before all facts are known, and I suggest we all calm down and wait a bit before potentially destroying someone’s career.
The above statistics make a lot of assumptions, but I believe I should mention the biggest ones. First, I cannot guarantee the quality of the data. While Chessbase is considered the gold standard of chess databases, it may be that some games are not included, thereby skewing the simulated results of some players. Second, I rely on the accuracy of Wetherell’s LightGBM model, which I trained using the same specification he uses (Caissabase data). Perhaps someone would like to expand on this research and replicate the results to test their validity.