Back to Research & Publication
Cricket has always been an integral part of my childhood. It has united me with various people, and it helped me fit in after moving to India from the United States. I continue to explore the intricacies of the sport, and now I wish to harness the mechanisms of data science to find quantitative and objective answers to some subjective questions regarding the sport. These findings can potentially help a U-19 player tweak his game in order to maximize his chances of getting into the main team.
Cricket is a team sport in which players of a team have primarily two roles: batting and bowling (analogous to batsmen and pitchers in baseball). A batsman’s task is to hit the ball and score runs, while the bowler’s task is to prevent the batsman from scoring runs and picking up wickets (a wicket is analogous to three strikes in baseball). A boundary is scored by a batsman when a batsman hits the balls out of the ropes (analogous to a home run by a batter in baseball).
In the image below, a batsman is denoted by the yellow dot, and a bowler is denoted by a green dot. A run is completed as the batsman (green dot) runs across the pitch (represented by the brown rectangle). The red dots represent fielders, who try to prevent the batsmen from scoring runs and assist the bowler in picking wickets.
An important thing to note is that a player could both bat and bowl during his team’s respective innings. The amount of runs scored and wickets taken can be considered as the fundamental statistics for a cricketer. However, there are more nuanced statistics as well, such as the following:
Just like every sport, a cricketer has to play on several lower platforms before they can play for their respective country. One such platform is the U-19 World Cup, where players below or equal to the age of 19 years represent their national side and compete with each other. Becoming a part of the U-19 team is a huge achievement, and several U-19 players go on to represent their countries on the highest level. A U-19 World Cup happens after every two years, and a player can participate in the tournament once or twice throughout his career. This project develops a machine learning model that determines the probability of an Indian U-19 batsman getting into the main team based on performance in the U-19 World Cup. This project analyzes which features described above play a more important role in determining whether an U-19 Indian batsman gets into the main team or not and whether a particular feature contributes positively or negatively to the chances.
In order to perform my analysis, I needed a dataset containing the World Cup statistics of every U-19 Indian batsman over the last 20 years and whether they were selected for the main Indian cricket team or not. Generating a dataset was an extremely long task, as I was unable to find a ready-made dataset online. The data I needed was available on a website called cricinfo.com, which is a subsidiary of the ESPN network. This website contained the statistics of each Indian batsman for each World Cup in separate tables. In order to obtain this information from the website for each year in a format that was useful to me, I built a web scraping function that takes an espncricinfo website URL as an input, creates a new Pandas dataframe (Pandas is a data science library in Python and a dataframe is a structure that it uses to store data tables), extracts the data from the website link, and inputs this data into the newly created dataframe. In order to do this, I used a Python web scraping library called BeautifulSoup.
Using this function, I obtained 10 different dataframes: one for each World Cup in the years 2000 to 2020. I then concatenated these dataframes into one, and I had a dataset. However, there was one problem. The dataframe I generated had all the features I needed except for two: Captain (whether the player captained the Indian side in that particular year) and Played for India (whether the player went on to play for the Indian team or not). In order to add these two features to the dataset, I manually researched the features for each player and inputted them into the dataframe. This task was a bit tedious and long, but after completing it, I finally had a complete dataset containing World Cup statistics of 74 batsmen over the last 20 years.
Although I had a dataset, I had to tackle two issues with it before I could implement my model and analyze the statistics. The first issue was that the material was not in the appropriate format for it to be processed. Secondly, I had too many features, and I wanted to cut down on that.
To tackle the first issue, I first had to carry out the following tasks:
Now, I had to find the optimal features that I required to create a model that was as efficient and accurate as possible. Hence, using the matplotlib and seaborn libraries, I created visualizations that helped me see whether there was a correlation between each feature and whether the player went on to play for India or not. Two such visualizations can be seen below, We can observe that there is no trend shown for Played for India with respect to Economy and Bowling Average, respectively, and hence, I did not include them in my model. Through my analysis, I deduced that the following features were most relevant to my project: Runs, Balls Faced, Strike Rate, 100s, 4s, Wickets Taken, Captain.
To begin implementing my model, I first divided my dataset into two categories: training set (the dataset used to train the machine learning model) and test set (the dataset used to measure the accuracy of the machine learning model). Using a machine learning library in Python called Scikit, I randomly separated 70% of the data as the training set and 30% of the data as the test set.
I then used a machine learning technique called logistic regression to create a prediction model that can calculate the probability of any U-19 batsman based on his World Cup statistics. Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. A binary variable is one that can have only two possible outcomes; in this project, the criterion feature - Played for India - is a binary variable, as it can have only two possible outputs: True
or False
. The logistic function used to model this the binary dependent variable can be observed as the following:
\[p=\frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}}\]
where \(p\) is the probability of the predicted value being True
, \(\{X_1, X_2, ..., X_n\}\) describes the set of predictor features used to predict the outcome and \(\{\beta_0, \beta_1, \beta_2, …, \beta_n\}\) describes the set of coefficients assigned to each predictor feature by our machine learning model. This logistic function is also called the sigmoid function. Using the LogisticRegression
module in the Scikit library, I implemented my logistic regression model. Another thing that this module helped me do was prevent overfitting. One risk of implementing machine learning models is that the developed algorithm could assign coefficients that are reflective of the training set and not the general data. Hence, I used a technique called ridge regularization that prevents this from happening. Ridge regularization can be thought of as a penalty against complexity. Increasing the regularization strength penalizes "large" weight coefficients. It uses the following equation:
\[\hat{\theta} = \underset{\theta}{\operatorname{argmin}} \sum_{i=1}^{n} (y_i - f_{\theta}(x_i))^2 + \lambda R(\theta) \]
The first term is our minimized least squares term and we add another regularization term with parameter \(\lambda\). Hence, using these techniques, I developed a logistic regression model.
Testing the accuracy of this model was rather difficult, as my model predicted probabilities and not discrete outcomes. Therefore, I devised my own measure of testing accuracy. I modeled every probability value \(\ge 0.5\) as True
and every probability value \(\lt 0.5\) as False
. I used my model to predict the probability for every player represented in the test set, and then calculated the model accuracy. My model obtained an accuracy of 82%, and upon further analysis, I discovered that it was more accurate in predicting which player gets into the team compared to predicting which player doesn’t get into the team.
Using the statsmodels.api
library, I analyzed the coefficients my model assigned to each feature. One issue was, however, that the data each feature represented varied in range. A batsman’s runs will always be greater than equal to his batting average, and hence, it wouldn’t be possible for me to analyze whether a larger coefficient assigned to the Runs
feature is due to it being more significant or due to the bias in the data. Hence, I first designed a normalizing function, that normalized the data, so that all features represented data between 0 and 1. Using the function, I obtained a dataframe with normalized data, which I could use to analyze the coefficients. Using the statsmodels.api
library, I obtained the following table:
Each row in the table of results immediately above corresponds to a feature (a column of the earlier dataset) as follows:
Row # | Feature |
---|---|
0 | Runs |
1 | Balls Faced |
2 | Strike Rate |
3 | 100s |
4 | 4s |
5 | Wickets Taken |
6 | Captain |
Positive coefficients have been assigned to: Runs, Captain.
Negative coefficients have been assigned to: Balls Faced, Strike Rate, 100s, 4s, Wickets Taken.
The most significant features, with respect to magnitude, are: Runs, Balls Faced, Strike Rate, 100s, Wickets Taken, Captain.
Through my analysis, I went on to make the following inferences:
Thus, by the end of this project, I had developed a model that was able to predict the probability of an U-19 Indian batsman getting into the main cricket team. I want to expand the availability of this model, and hence, I hope to develop a web application that allows a user to input their statistics and receive a number as an output that represents the probability. Furthermore, I would also look towards developing a similar model for bowlers that would allow me to generalize my model to all Indian U-19 cricketers.
Spring 2020
Pulkit Bhasin