A Data Science Journey with Lasso Regression to Predict Chess Game Winner.

A Data Science Journey with Lasso Regression to Predict Chess Game Winner.

Did you know that there are more possible variations of chess games than there are atoms in the observable universe?

Introduction

Chess is a noble game, primarily played by royalties. It is often referred to as ‘the game of kings’ because the King or Shah is the most important piece on the board. The game ends in a checkmate, which is the same as ‘Shah Mat’, which literarily means ‘the king cannot escape’ in Persian, where the game originated.

It is a timeless battle of strategy, skill, and foresight. A complex mix of simple moves crafted beautifully to make every game unique. Chess requires a high level of concentration and intelligence to be played well.

What if we could train a model to understand the moves of a chess game and predict its outcome? Brilliant, right? That is what this article is all about.

I know what you are thinking—why does predicting the outcome of this game matter in data science? Well, since chess is a very strategic and complex game that requires constant decision-making. Training a model to rightly predict the outcome of this game builds a stepping stone towards understanding and designing AI systems capable of strategic planning and decision-making, which are fundamental challenges in data science and AI.

Dataset Overview

For this project, I used over 20,000 chess games from Lichess.org. The collection provides detailed information about each game, making it perfect for analysis. The dataset includes:

  • Game ID: A unique identifier for each game.

  • Rated: Whether the game was rated or not (True/False).

  • Start and End Times: When the game began and ended.

  • Number of Turns: Total moves made during the game.

  • Game Status: The final state of the game (e.g., checkmate, draw, etc.).

  • Winner: The outcome—White, Black, or Draw.

  • Time Increment: Time added per move in seconds.

  • Player Details: IDs and ratings for both White and Black players.

  • Moves: A full list of all moves played, in standard chess notation.

  • Opening Eco: The ECO code (a standardized opening identifier).

  • Opening Name: The name of the opening, and

  • Opening Ply: the number of moves in the opening phase.

There are a lot of things you can do with this dataset, like exploring patterns, analyzing strategies, and building predictive models.

The target variable in this dataset is the winners column. It essentially describes who wins the game, Black, White or Draw.

The goal of this project is to uncover factors that influence game outcomes and develop a model that can accurately predict winners.

Challenges Encountered and how to solve them

While analyzing the data and preparing it for modeling, I encountered a few challenges which threw me off course for a while.

Class Imbalance: the target column of the dataset winner has imbalance distribution of its three classes, which could lead to lack of generalization and ultimately inefficiency in the model. To tackle this problem I resampled the dataset usinf oversampling class from SMOTE. (It is one of the libraries in python used to handle class imbalance, in our case, it over samples the minority classes so that they have equal number of occurence as the majority class)

Distribution of the winner column in the dataset.

from imblearn.over_sampling import SMOTE

#oversampling for class imbalance
smote = SMOTE( random_state=42)
X_res, y_res = smote.fit_resample(X_train_encoded, y_train_encoded)

#count the num of unique values in y_res
import numpy as np
y_i, count = np.unique(y_res, return_counts=True)
print(np.asarray((y_i, count)).T)

/the output of the over sampling give equal count for all the classes in winners

  • Feature engineering and extraction: trying to understand each feature in the dataset and extract meaningful features from them that would improve the model’s efficiency took an extra effort. for instance, trying to n=understand and extract meaning features from the moves column required alot of understanding of how chess games are documented.

      #extracting features from the moves column of the data
      moves = data['moves']
    
      #function to extract features for each color from the moves column
      def extract_features(moves):
          moves_list = moves.split() #split the elements of the moves column into a list
          white_moves = moves_list[::2] #even indexed moves list to get white moves
          black_moves = moves_list[1::2] #odd indexed moves
    
          features = {
              'white_captures' : sum('x' in move for move in white_moves),
              'black_captures' : sum('x' in move for move in black_moves),
              'white_checks' : sum('+' in move for move in white_moves),
              'black_checks' : sum('+' in move for move in black_moves),
              'white_promotions' : sum('=' in move for move in white_moves),
              'black_promotions' : sum('=' in move for move in black_moves)
          }
    
          return pd.Series(features)
    
      #apply to the moves column
      moves_extracted = moves.apply(extract_features)
    
      #view the first few rows of the extracted df
      moves_extracted.head()
    

  • Feature encoding: Finding the perfect method to encode categorical variables with high cardinality (variables with too many unique values) efficiently also posed a problem. I researched several encoding technique and picked the one that best suited my data (Target encoding)

Exploratory Data Analysis.

Here are some interesting things I found out about the dataset while studying it that might be fun to you:

  • The rating difference between players does not affect the outcome of the game. Through out the game, I noticed that the fact that a player rates significantly more than the other player has very little effect on who wins the game.

    Correlation Heatmap Interpretation
    Black rating being correlated to white rating means that players with similar ratings are often paired together. White wins are weakly correlated to white rating; this implies that the rating of white players does not necessarily affect the outcome of the game; the same goes for black wins and black rating, and vice versa. Black wins and white wins are strongly negatively correlated; this means that when black wins, white doesn't, and vice versa. Draw is also not strongly correlated to any of the columns in the visual
  • The best opening for each color, for white players more often than not, the Philidor Defense and English Opening seem to be the best opening. While for black players, Sicilian Defense, King and Queen’s pawn openings gave more advantage. I also discovered that Caro-Kann Defense and Ruy Lopez are the most likly to draw.

Side note, the most preferred opening in the whole dataset is the Sicilian Defense.

Relationship between popular Openings and in rate

This explains that popular openings are popular for a reason.
Popular openings almost always work in your favour if played right.

Feature Engineering

Feature engineering is an essential aspect of a predictive model as it helps to improve its performance by transforming its feature space. In this project, I performed a lot of engineering on several features to extract important data from them and discard features that seemed redundant to improve the model’s performance. For instance, the time increment column.

#split the time increment column into two columns
data[['game_time', 'time_increment_per_move']] = data['time_increment'].str.split('+', expand=True)

#change the type of time increment per move and game time
data['time_increment_per_move'] = data['time_increment_per_move'].astype(int)
data['game_time'] = data['game_time'].astype(int)

#get the estimated game duration based on the game time and the time increment
data['estimated_game_duration'] = (data['game_time'] * 60)  + (data['time_increment_per_move'] * data['turns'])

#classify time control into game type
def classify_time_control(row):
    if row['game_time'] <= 2 and row['time_increment_per_move'] <= 1:
        return 'bullet'
    elif row['game_time'] <= 10 and row['time_increment_per_move'] <= 5:
        return 'blitz'
    elif row['game_time'] <= 15 and row['time_increment_per_move'] <= 10:
        return 'rapid'
    else:
        return 'classical'

data['time_control_type'] = data.apply(classify_time_control, axis=1)
redundant_cols = ['start_time', 'end_time', 'time_increment', 'white_wins', 'black_wins', 'draw', 'black_id', 'white_id']
data.drop(columns=redundant_cols, axis=1, inplace=True)

#check the columsn left in the data
data.columns
💡
Index(['rated', 'turns', 'victory_status', 'winner', 'white_rating', 'black_rating', 'moves', 'opening_eco', 'opening_name', 'opening_ply', 'rating_difference', 'game_time', 'time_increment_per_move', 'estimated_game_duration', 'time_control_type'], dtype='object')

Is data preprocessing really that important?

Yes, data preprocessing is very important for model building and essentially the model’s performance. Here is a list of activities carried out in the data preprocessing stage:

  • Data Cleaning: Here invalid or missing data points are handled, either by being removed from the dataset, corrected or imputed as the case maybe.

  • Feature Engineering: In this stage, the features are properly examined to understand its relation to the target variable and then sudied to see if it can be further split or merged together into a feature that would make more sense or help the final model better.

  • Data Encoding: it is a general fact that ML models cannot understand non-numeric data. in this step, all non-numeric data/ features are converted to numbers to be better understood by the model.

  • Data Scaling: It is important scale the data to ensure that there is not too much disparity within the data, essentially to ensure that the data is uniformly disstributed this helps to improve the model’s performance even more.

The type of encoder or scaler to be used depends heavily on the particular dataset that you are working with.

Modeling Approach

For this project, I decided to stick to the simple models, so I delved deep into Linear Regression with regularization.

Linear regression models are usually considered too simple; they cannot generalize well on datasets with non-linear relationships between their features; hence, the introduction of regularization.

In this article, we will focus mainly on Lasso Regression, that is, Linear Regression with L1 regularization.

You might be wondering, how does LASSO Regression work?

LASSO Regression is a regression method based on Least Absolute Shrinkage and Selector Operator. LASSO Regression enhances Linear Regression by making use of a regularization process in the standard regression equation.

I know you want to know what this ‘regularization’ is, well, let me.

Regularization is the addition of a penalty term to a model to prevent it from overfitting.

The penalty term (alpha) in LASSO helps to lower the variance of the model by decreasing the coefficient (feature importance) of less significant features to zero.

I know you are probably thinking that increasing alpha will do all the trick, sorry to burst your bubble, it would not. This is because increasing alpha significantly increases the bias of the model, which would cause it to lose its sensitivity to significant patterns leading to underfitting.

The solution to this varaince-bias tradeoff is crossvalidation, before fitting the model, we cross validate to pick the most optimal value for alpha to prevent both over fitting and underfitting.

When to use LASSO

LASSO regression can be used when:

  • working with a high-dimensional dataset.

  • for feature selection.

  • for multicollinear features (features that are strongly correlated to one another).

  • for regularization.

In our case, we will be using lasso regression because we want to use it for feature selection and because our dataset has high dimensions (a lot of columns and features).

# Create a column transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),  # Scale numerical features
    ]
)

#define a range of paramter for C (Alpha equivalent)
param_grid = {
    "feature_selection__estimator__C": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100], #l1 regularization for feature selection
    "classification__C": [0.0001,0.001, 0.01, 0.1, 1, 10, 100], #l1 regularization for classification
}

#number of folds 
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Define the pipeline
pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),  # Preprocessing step
    ("feature_selection", SelectFromModel(LogisticRegression(
        penalty="l2", solver="saga", max_iter=1000, random_state=42))),  # Lasso for feature selection
    ("classification", LogisticRegression(
        penalty="l2", 
        solver="saga", 
        multi_class="multinomial", 
        max_iter=1000, 
        random_state=42))  # Logistic Regression for classification
])
#define the log loss scoring metric
log_loss_scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True)

# Perform Grid Search
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring=log_loss_scorer,  # Use an appropriate scoring metric
    cv=kf,                # Number of cross-validation folds
    n_jobs=-1 ,           # Use all available cores
    verbose=1
)

# Fit the grid search on your training data
grid_search.fit(X_train_encoded, y_train_encoded)

# Best parameters and performance
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Implementing LASSO and performing cross validation to get the most optimal parameters

#get the best parameters
best_params = grid_search.best_params_

#set the best paramters for the pipeline
pipeline.set_params(**best_params)

# Fit the model
pipeline.fit(X_train_encoded, y_train_encoded)

from sklearn.metrics import log_loss
# Predict and evaluate
y_pred = pipeline.predict(X_test_encoded)
y_pred_proba = pipeline.predict_proba(X_test_encoded)

#evaluate the model
report = classification_report(y_test_encoded, y_pred)
log_loss_value = log_loss(y_test_encoded, y_pred_proba)

#print the output
print(f"Classification Report: \n{report}")
print(f"Log loss: {log_loss_value}")

Output of the model's performance

The result of the model’s performance
The classification report shows the model performs well for categories 0 and 2, with high precision, recall, and F1-scores, but struggles significantly with category 1, achieving only 1% recall due to the low number of examples (class imbalance). Overall, the model has an accuracy of 80%, with a weighted F1-score of 0.78, indicating decent performance for the majority classes. However, the macro averages (precision: 0.60, recall: 0.56, F1: 0.55) highlight the poor balance across all categories. The log loss of 0.542 suggests moderate confidence in predictions but leaves room for improvement, particularly in handling the underrepresented category.

Feature Selection

Like we discussed before, one of the main pros of LASSO is that it helps with the selection of important features to the model which helps with tuning the model to perform better.

# Feature importance
selected_model = pipeline.named_steps["feature_selection"].estimator_
feature_names = X_train_encoded.columns.tolist()
coefficients = np.sum(np.abs(selected_model.coef_), axis=0)
important_feature_indices = np.where(coefficients != 0)[0]
important_features = [feature_names[i] for i in important_feature_indices]
print(f"Important Features: {important_features}")
💡
Important Features: ['rated', 'turns', 'victory_status', 'white_rating', 'black_rating', 'opening_eco', 'opening_name', 'opening_ply', 'rating_difference', 'game_time', 'time_increment_per_move', 'estimated_game_duration', 'time_control_type', 'white_captures', 'black_captures', 'white_checks'
#visualizing the important feature
plt.bar(important_features, coefficients[important_feature_indices])
plt.xticks(rotation=90)
plt.grid()
plt.title("Feature Selection Based on Lasso")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.ylim(0, 0.15)
plt.show()

A visual of the Important features, selected by LASSO

These features are then used to train the model once again to improve its performance

# Extract selected features for training and testing
X_train_selected = X_train_encoded[important_features]
X_test_selected = X_test_encoded[important_features]

#retraining the lasso model on th ebest parameter and the important features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), important_features)  # Apply scaling to selected features
    ]
)
# Create the pipeline with Lasso Logistic Regression using the best parameters
pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),  # Assume `preprocessor` is already defined
    ("classification", LogisticRegression(
        penalty="l1", 
        solver="saga", 
        C=best_params.get("classification__C", 1.0),  # Use best C from GridSearch
        multi_class="multinomial", 
        max_iter=1000
    ))
])

# Retrain the pipeline with the selected important features
pipeline.fit(X_train_selected, y_train_encoded)

# Make predictions on the test data
y_pred = pipeline.predict(X_test_selected)

y_pred_proba = pipeline.predict_proba(X_test_selected)

#calculate log loss
log_loss_value = log_loss(y_test_encoded, y_pred_proba)

# Evaluate the model
print("Classification Report:\n", classification_report(y_test_encoded, y_pred))
print(f"Log loss: {log_loss_value}")

Performance of the model trained with selected features

Performance of the model trained on selected features
A significant improvement in the model is seen here as it now generalizes better on all the classes and the overall accuracy has also improved.

Evaluation

Just as the name implies, it means testing the model to determine its perfomance and efficiency.

For the sake of this article we will be focusing on log loss as the evaluation metric for our model.

What is Log Loss and How does it work?

Log Loss means Logarithmic Loss, it means the uncertainty of the predictions made by the model. A perfect model has a log loss of 0. In simple terms, log loss, tests a model’s confidence in its predictions, it penalizes more when a model assigns low probability to a true class even if the predicted class is correct.

In short lower log loss equates to a better model and as we have seen above there was a significant improvement in our model as the log loss significantly reduced.

Conclusion

While working on this project, I learnt alot and finally understood a lot of theorectical concepts I had learned before.

My next project will be on Natural Language Processing: Text Analysis, which I have always found interesting.

Check out the complete project here: https://github.com/i-am-christy/The-Chess-Project

If you love my style of writing follow me here and on LinkedIn via: www.linkedin.com/in/christianah-adekunle