Back to Research & Publication
We studied the impact that January transfers can make on a team’s performance. We employed metrics such as xG, xGDIFF, and fee, which are the expected goals in a match along with the difference in the value due to the transfer and the fee for the transfer. After a transfer took place, we took in factors such as the individual player’s past performance statistics, team’s performance before and after transfer, and the change in the expected goals for that team.
We picked this study because soccer is the most lucrative sport on the planet and the English Premier League (EPL) in particular is very wealthy. These clubs spend a lot of money on recruiting players, and we wanted to know the outcome of spending so much cash. EPL clubs regularly outspend their counterparts across other European leagues so we wanted to see if their investment bore fruit.
We found a lot of data online, but most of it was in formats that weren’t readily usable.
So we chose to restrict our analysis to the EPL and only to attacking players as the historic stats for defensive players were sparse and the same model could not be used if we used different inputs to measure performance for attackers and defenders.
Ultimately, we came across three data sources, each contributing a data table:
From the understat data, we created our own data table to measure success or failure of that transfer window. First, we calculated xG and xPTS values per game because teams had played different number games at that point in the season. This was done for each season from before and after January. Next, a table was created that measured the difference in performance before and after January. It was classified as a success (binary) if the change was positive and a failure otherwise.
Finally, all three tables were merged. While doing this we were seeking to merge using the club name column in each table; however, we realized these names did not match because for example some tables might use Arenal FC while the other simply uses Arsenal. We cleaned the name column and merged the tables.
Our final data set contained EPL January transfers for the seasons, 2017/18, 2018/19 and 2019/20, the historic stats of the players and the change in team performance after signing that player over the rest of the season.
From the three tables that we used, we decided to not only round up decimals and convert some strings to integer format, but also remove most of the textual data about the teams and transferred players.
For example we removed 'Unnamed: 0', 'age', 'position', 'club_involved_name', 'fee', 'transfer_movement', 'transfer_period', 'league_name', 'season', 'Player', 'Nation', 'Pos', 'Comp', 'Age', 'Born', 'FK', 'PK', 'PKatt', 'npxG', 'npxG/Sh', 'np:G-xG.'
Visualizations showing relationship between transfer fee and players performance, fee and team performance etc. The first table is just an example of the final table we got for the 2017/18 season for a few players (not all). The second and third graphic are the precision/recall out of our Logistic Regression model and the PCA for the data. Using PCA we reduced the model’s dimensionality from eleven to two. The fourth graphic again a plot to better analyze the precision/recall as a good model has constant precision for increasing recall. The last visualization is a plot between xG (player) and the correlation with the team's points.
Variables:
Player position – Defender, Midfielder
Attacker – Categorical,
Transfer fee – Numerical
xG+A for player (Attackers, Midfielders) – Numerical
Defensive actions per 90 (Defenders) – Numerical
xG per 90 for Team before and after January – Numerical
xG against per 90 for Team before and after January – Numerical
xP per 90 for Team before and after January - Numerical.
Logistic regression model was created that predicts if a club’s transfer window will be a success or not considering the incoming players’ stats, the club’s own performance so far in that season and the total transfer fee shelled out by that club. We also hope to incorporate the defense stats for players which are hard to extract and are mostly purchasable. The model can be developed further for other leagues and more data. This will also help in obtaining a higher accuracy and a better metric for success or failure.
Fall 2020
Arjun Vats, Utkarsh Nath