Tuning hyperparameters, performing the right kind of feature engineering, feature selection etc are all part of the data science flow for building your machine learning model. Hours are spent tweaking and modifying each part of the process to improve the outcome of our model.
However, there is one argument nested within the most popular functions in data science that can be altered to change your machine learning results.
…..and it has nothing to do with domain knowledge or any of the engineering you have done on your data.
A seemingly harmless argument that could change your results, yet barely any article teaches you how to optimise it. With some manipulation to the random permutation of the training data and the model seed, anyone can artificially improve their results.
In this article, I would like to gently highlight an often overlooked component of most data science projects — random state and how it affects our model outputs in machine learning.
So how does random state affect the classifier output?
To show how this affects the prediction result, I will be using Kaggle’s famous Titanic dataset to predict the survival of the passengers.
Read more: System net requestexception name or service not known
Using the train dataset, I have applied some bare minimum data cleaning and feature engineering just to get the data good enough for training. I will be using a typical grid search cross validation with the xgboost classifier model for this example.
Training data that I will be using:
Using grid search to find the optimal xgboost hyperparameters, I got the best parameters for the model.
Based on the cross validation result, my best performance achieved is 82.49% and the best parameters are:
‘colsample_bytree’: 1.0, ‘gamma’: 0.5, ‘max_depth’: 4, ‘min_child_weight’: 1, ‘subsample’: 0.8
This process is a staple to many machine learning projects: search through a range of hyperparameters to get the best averaged cross validation result. At this time, the work is considered done.
After all, cross validation should be robust to randomness…. right?
Read more: Bdsm hotels
For data science tutorials or showcases of results in Kaggle kernels, the notebook would have ended right there. However I would like pull back the previous workflow to show how the result differs with different random states.
This time lets run the code with 5 different random states on the classifier:
Lets change the cross validation random state as well:
All the results returned are different. With 5 different random states for the xgboost classifier and the cross validation split, the grid search run produces 25 different best performance results.
Having multiple results stems from the fact that the data and the algorithm we use have a random component that can affect the output.
However this creates a huge doubt in the data science process as we make changes to our model all the time.
Read more: Formidable login form error message
For each of the changes that I make, I would compare the results across different runs to validate the improvements. E.g Changing ‘a’ improves model by 2%, adding ‘b’ to the features improves it by a further 3%.
With the variation of results as shown above, it makes me wonder if my feature engineering actually contributed to a better result or the improvement was all down to chance.
Maybe a different random state would make my results worse than before.
My initial result was 82.49% but 84.84% is higher
Notice that with classifier random state 4 and stratified shuffle random state 2, my results are substantially higher at 84.84% compared to my initial run.
Which result do I present then?
It is tantalizing to present the best of the model results, since the random seed is fixed and the results are reproducible.