Queens Gambit Chalange

Share This Post

Hi,

This is #queensgambitchallenge, organized by Maven Analytics.

The dataset provided is data for 20,000+ chess games played on Lichess, including moves, victor, rating, opening details and more: https://www.mavenanalytics.io/data-playground

So far, the response to this challenge is great, and we have seen a lot of great dashboards and phenomenal visualizations that inspired me.

Analyzing the data and looking at the reports of the other participants, I noticed a couple of illogicalities (problems with data, calculated KPIs and etc) that I wanted to point out in my report.

The text would be too long for the LinkedIn post, so I opted for the LinkedIn article option. And this article could be my contribution to this post from Maven’s Aaron Parry: https://www.linkedin.com/posts/amp83_queensgambitchallenge-data-analytics-activity-6756212502528315392-PF66

This article will be divided into two parts:

Data cleaning and ad-hoc analysis
Report/dashboard

DISCLAIMER: It’s important to know that I am a chess amateur. I do play it occasionally. I do know some basic stuff but I am not familiar with all chess theory concepts, regulations and notation. If you notice mistakes, please, let me know. 🙂

1. DATA CLEANING and AD-HOC ANALYSIS

Because of familiarity with the data provided and because of the lack of time, will do the shortened procedure – after loading it to excel / power query, I immediately start exploring data by categories (columns) and asking myself two sets of questions:

What do you see here? What kind of data/data types? What are the ranges? Are there unique values, are there duplicates and should they be there?
What do you expect to see here, based on the knowledge of the topic?

The structure of the table consists of 17 columns: game_id, rated, turns, victory_status, winner, time_increment, white_id, white_rating, black_id, black_rating, moves, opening_code, opening_moves, opening_fullname, opening_shortname, opening_response, opening_variation

COLUMN: “game_id” – When I get columns like “IDs”, first on my mind is to check if this could be some kind of unique key for each row and to check if there are some duplicates in this column.

20.058 unique IDs provided, no duplicates, everything looks good. Possible use in the report: unique key

2. COLUMN: “rated” – Only two unique values that are repeated: TRUE and FALSE. Indicates if the game is rated or not, furthermore, player ratings are changed bases on that. So far so good. Possible use in the report: filtering

3. COLUMN: “turns” – This represents a cumulative number of moves for the white and black player. In some reports, we can see that number of moves is stated as a number of turns and this is not correct.

The first move is when white moves a figure, then the black player responds – that is the first move for both players, but it has 2 turns.

Let see “official” chess notation:

In this match, there are 20 moves by the white player and 20 moves by the black player.

In sum, they make 40 turns.

I found this important, because of a few more columns whose analysis “depends” on these values.

Like column: time_increment, moves, opening_moves

It’s very important that column heading has a meaningful and correct name for each data type, that describes them the best.

If we start calling something “turn” then it should be called “turn” in every other column and not “move”.

So, to conclude, when in the column “turns” find value: “349” it means that white player made 175 moves, and black player 174 moves. Possible use in the report: filtering, calculations and segmentation of match length.

4. COLUMN: “victory_status” – Only four unique values that are repeated: Out of Time, Resign, Mate, Draw. Describes match ending. Possible use in the report: filtering

5. COLUMN: “winner” – Only three unique values that are repeated: White, Black, and Draw. Describes match ending. Possible use in the report: filtering

6. COLUMN: “time_increment” – I saw a lot of different analysis of this attribute, and really think that they are incorrect. (Ouups :)) KPIs like “average match time“, or “average added time” are not correct, and they are hard/impossible to calculate from this dataset.

As far as I know, the meaning is this: “15+2″ means that white and black player get 15 minutes each to play all of their moves and that they get 2 seconds after each move (two seconds are added to players after every move).

15+15 = 15 minutes for each player to finish all moves, and 15 seconds are added to each player clock after each move.

If we create a match with a computer on https://lichess.org/, Minutes per side:5, Increment in seconds: 5

So, for this match, time_increment value would be “5+5”. Max match time would be around 10 minutes – “5 minutes for all moves for the white player” + “5 minutes for all moves for the black player” + “number of turns” x “5 seconds”.

From the provided data set, we can’t say “average match time/duration” is around 13 minutes (not correct) or “average added time/duration” is around 5 seconds (also not correct).

We can say “average max match length“. We need to transform column “time_increment”, use delimiter “+”, and create five additional columns (or do calculations in one):

1. minutes_for_all_moves_per_player – each player has XY minutes for the match – all moves

2. added_seconds_per_move_per_player – each player has XY second added to the clock, after each move – these two new columns are the product of transformed “time_increment” column

3. max_minutes – Maximum match length in minutes, “minutes_for_all_moves_per_player” * 2 players

4. actualy_added_seconds – Number of white moves + a number of black moves gives us the number of turns, and we have that in column “turns“. So, the value from column “turns” * “added_seconds_per_move_per_player” gives us added time in seconds.

5. max_match_length – Here we can calculate the max time of the match in minutes. max_minutes + (actualy_added_seconds / 60)

And we can only calculate MAXIMUM match length not ACTUAL match length because we don’t know the time difference between every turn of a white and black player.

Doing this transformation and calculation in raw data – gives us an AVERAGE MAX LENGTH per GAME = 32,46 minutes (not actual match length) and AVERAGE ADDED SECONDS per GAME = 288,63 seconds.

!!!THIS IS DONE AS EXAMPLE IN RAW DATA, BUT I THINK THAT DATASET SHOULD BE REDUCED AND SOME OF the DATA DELETED. So, in the end, these calculations will be different. I do hope that my chess knowledge is good so far and that I am making no mistakes. Please, feel free to correct me. 🙂

I will jump through the next few columns as there is nothing special there.

7. COLUMN: “white_id” – Player ID playing as white

8. COLUMN: “white_rating” – Actual player’s rating playing as white

9. COLUMN: “black_id” – Player ID playing as black

10. COLUMN: “black_rating” – Actual player’s rating playing as black

12. COLUMN: “opening_code” – The ECO Codes is a classification system for the chess openings moves

14. COLUMN: “opening_fullname” – Nothing special here.

15. COLUMN: “opening_shortname” – Nothing special here.

16. COLUMN: “opening_response” – Nothing special here.

17. COLUMN: “opening_variation” – Nothing special here.

13. COLUMN: “opening_moves” – This one is tricky, is it “moves” or “turns”?

In dataset game_id=1, opening_code=D10, opening_moves=5

Are there 5 moves by white and 5 moves by black, so 10 turns? Or “5” indicates 3 moves by white and 2 moves by black?

If we lookup ECO Codes on the web:

Info from the web is identical from our dataset, code D10 is recognized as Slav defence, exchange variation, but the highlighted text shows us that there are actually 5 turns not 5 moves. (3 moves by white + 2 moves by black)

Why is this important? Because we have values for column “turns” and with correct values for column “opening_moves” we can look up some interesting ratios regarding match length and openings.

11. COLUMN: “moves” – The column that got me all suspicious, and made me dig deeper and deeper. 🙂 Hate this column. 🙂

Okay, when you see column like this, full of chess notation/players moves there is the well-known road. We can transform data/column, use delimiter space ” “, try to analyse every move, give it some attribute, like was that white’s move or black’s move, is it special move, is it capturing another piece, which figure does it involve and so on…maybe to transpose data (there comes excel’s limit of the number of columns – cant put 20.000 raws in transpose position as columns)

But while I was on raw data, I wanted to check exact duplicates in column “moves”.

When we have data set with more than 20.000 raws (games) and it’s about chess… you have to expect that some matches will have exact same moves, from the beginning to the end. Especially when it comes to the shorter matches, up to, maybe, 20 moves (40 turns). Openings and responses are well known, especially among high rated players, and many matches are played automatically.

So, exact matches in “moves” columns are not so strange to see. Let’s sort it by column “turns“, from smallest to largest:

Again, nothing strange, we have one or two moves and as a result “Out of time”, “Resign”, “Draw”. Column “moves” is actually a string, and for every duplicated value it’s conditionally formatted to be red.

But, as my mouse start scrolling through this red column, I am starting to dislike this column. 🙂 There are identical moves with more than 50 turns, and even more…

Working with conditional formatting in excel with so many rows is a really messy thing. It’s not shading all duplicated values, it is very laggy, and usually crashes excel, so everything is double-checked in power query 🙂

Let’s sort column “turns“, from largest to smallest, and watch out for values in column “moves“:

game_id 11555 = game_id 13860 (349 turns = 175 white + 174 black) values in column “moves” are identical.

game_id 5423 = game_id 18327 (210 turns = 105 white + 105 black) values in column “moves” are identical.

game_id 13140 = game_id 14758 (198 turns = 99 white + 99 black) values in column “moves” are identical.

Conclusion – there is nothing special when you have an identical chess game while you have a small number of moves, but when you have more than 100 turns (more than 50 moves by each player) then it is worth digging deeper.

It is still possible to have identical moves in matches, and we have to check values in other columns. For example, when we find duplicated values in column “moves” it’s worth checking are those players the same players or not, with the same rating or not etc…

So, if we transform raw data a little bit more, and add another column “control” where we can CONCATENATE all fields from the table except column “game_id” we will get a unique string which will be key for finding exact duplicates through all cells.

There are 946 duplicate values in this dataset. These are identical values from 16 columns. So we have 946 duplicated rows. Values in 16 columns are the same, except game_id value.

What comes from this? Well, I have some theories:

Maven’s Enrique Ruiz did this on purpose. 🙂 Copied data, changed game_id for that raws to see what we will do with that and how detailed can we get in our analysis. (gotcha!) 🙂 EPILOGUE: Remove duplicated values.
There was some problem extracting data from lichess.org. EPILOGUE: Remove duplicated values.
Some players do not play by FAIR PLAY. I found this the biggest problem. It’s possible to have a chess friend, to schedule with him a chess game and to repeat the same moves and to play (rated, unrated) new games over and over again. Building a player’s rating with that methodology is “problematic” and it’s giving fake stats on every column. And even if this is is the case, we can analyse game_id 11555 = game_id 13860 (349 turns = 175 white + 174 black). We have two rated games, played at a different time (from 11555 to 13860) and still the same ratings from both players. EPILOGUE: Remove duplicated values.

So far I removed 946 duplicated values from the dataset.

But problems just keep coming. 🙂

Player “drakon1” and “chiggen” are playing RATED games, they have a couple moves variations e3 d6 / e3 e6 / e4 e6 / and after that they RESIGN, and we can notice how drakon1’s rating has changed from 856 to 926.

Just look at “thebestofthebad” and “projetoxadrez“.

There are so many matches that were played this way, and I do not know is it legible to call it cheating? 🙂

Every “Resign” and “Out of time” within the first 10 moves / 20 turns, should be inspected and removed if similar behaviour is noticed, at least in ranked matches.

But, due to the lack of time, I will leave it this way. I am satisfied with the already removed 946 games. 🙂

IMPORTANT – WHAT IF THIS WAS A REAL JOB and not a fun challenge? 🙂

I would sit down with the client and present the findings to him. The client must define the scope and purpose of the analysis.

If the purpose of the analysis is to test chess theory: openings and responses, frequency, moves analysis etc… then I would strongly recommend deleting this data.
If the purpose of the analysis is something else like total time played / time spent on the website, or some other metrics or just the way how to fake rating 🙂 … then everything must be as is, nothing to delete in the dataset.

That’s all for Data cleaning and ad-hoc analysis. 🙂

2. REPORT / DASHBOARD DESIGN

From my point of view, the starting point always should be playing with the data and finding interesting facts. After exploring the dataset, I start putting ideas and points of interest to the board.

Will be using standard graphs for simplified work. All of these facts can be easily calculated and presented, so I will only show here a little bit different approach for some visualizations.

Using a standard column chart with Pareto line is always good to distinguish data series with the greatest impact.

There is a significant rating gap between players and it would be nice to somehow visualize it. Here, I will try to use the scatter chart, with dots, to show five games, with bottom dots as a player with a lower rating, and upper dots as a player with the higher rating. Rating difference can be shown as an absolute value here, as a dot, but formatted as a line.

It’s not chess analysis if you don’t have a chessboard shown in a report. 🙂

I find chessboard perfect for creating maps/heatmaps because it could be easily transformed as a heatmap base. For these custom maps, Excel 3D map does a perfect job.

First, I need to draw perfect 8×8=64 squares and colour them – that is easy with excel.

Centar of the chessboard will be 0, and X, Y axes will divide the board into 4 quadrants.

I want to “print” values at the centre of each of 64 squares, so I have to determine the right coordinates for it.

For example, from the 0, the increment would be 1, and the centre of the square E5 is (0,5; 0,5)

64 squares must be mapped this way: they will have Y coordinate from -3,5 to 3,5 and X coordinate from -3,5 to 3,5.

With data populated in table model like this one, Excel 3D maps can shine now. (Works great with a data model)

For data to display properly and to be shown in the middle of each square (64) it’s important to set X and Y values to 0,5 higher/lower than in the preparation table.

There are a few things that should be done before we populate the prep table with data. If I want to show data about games that ended in MATE, I am filtering, searching and analysing the last move in the game, in chess notation marked with “#“.

MATE moves here are strings with up to 7 characters and I only need to extract chessboard square position from that (A1 or B7 or etc). In chess notation, the last number in string is number dedicated to the square, usually, there is only one number in move notation, but there are some cases when we can see two numbers. So If I find the position of the last number in a string, I can easily extract square coordinates. Example, MATE move = Nb6#, Night to B6, mate. The number is always one position higher than a letter of the chessboard in the string.

Colouring Maven’s colours, so everything must be themed now, and the next move is collecting all elements together.

Here is the final result:

Link to report in pdf: https://drive.google.com/file/d/12RmeT0H81Pxnh7abc8WaZdIJRKn6_Duy/view

Link to report in jpg: https://drive.google.com/file/d/12V5devCnaOJ9ypKt9IDge0A5_kvT-a9n/view?usp=sharing

Full Queens Gambit Challenge details on Maven site: https://www.mavenanalytics.io/blog/queens-gambit-challenge

So, that would be all, I hope that someone will find this interesting and helpful. Please, leave your comments and critics. 🙂

Feel like playing chess game? See you on Lichess. 🙂

One Response

Adrien SLR says:

January 10, 2023 at 2:39 pm

Hey!
Sorry to bother but as an aspiring data analyst your work looks amazing to me ! You are the first to go the extra mile on this challenge but I would like to know more about how you could sort by “turn” ?

Reply

Leave a Reply Cancel reply

Uros

Uros is an analyst in making. Worked in the construction and waste management industry. Project management, sales, public procurement... regardless of the type of the problem, the solution was always in the data... and in the Excel of course.

More To Explore

Excel

Excel: Jump to the first blank cell

When we work with a table that has many rows, there is a need to quickly jump to the first blank cell. Let’s imagine that

Uros February 18, 2021

Excel

Queens Gambit Chalange

Hi, This is #queensgambitchallenge, organized by Maven Analytics. The dataset provided is data for 20,000+ chess games played on Lichess, including moves, victor, rating, opening details and

Uros February 5, 2021

CATEGORIES

Excel (2)

POSTS

Subscribe To Newsletter.

Lets stay connected and learn together.

uros@analyst2be.com

© All Rights Reserved 2021