Back to Research & Publication
Many factors go into the performance of an NBA player on a given night. As NBA players on the road have a lot of free time, the sites and attractions various cities have can be very distracting with an upcoming game. For my project this semester, I am planning to analyze the correlation between a player’s performance and a cities activity level. To classify a player’s performance, I went through the box score data and categorized each game in each city as average, above average, subpar, and anomaly’s. From there, cities were ranked and given an activity level value based on various factors of the city. Finally, I performed a linear regression with players performance and city activity level for each player position in the nba.
To collect the data for this project, I split the task into two parts:
1) Player Data Collection
2) City Activity Data Collection
For the player data, I used both box score data which has the box score for every player in every game since 2003-2019,
and also player average data, which shows every player’s average for each season from 2003-2019. Both of these datasets
were csv’s provided by the NBA.
For the city activity levels, the data was collected through the Yelp API and through web scraping tripadvisor.com. With
the Yelp API, I took the average rating of all restaurants, bars, and clubs in all NBA cities. When web scraping
tripadvisor.com, I looked to find all the attractions for a given city through the attractions part of the website. Both
of these factors were used when calculating the cities activity level.
1) Player Performance
To identify a player's performance in a given game, the metrics used were determined by the position the player plays.
There are five positions in basketball, Point Guard, Shooting Guard, Small Forward, Power Forward, and Center. In order
to determine whether a player of a certain position had a good game, the statistics that they are analyzed by are
different from the statistics a player of another position is analyzed by. For example, assists are a very important
statistic identifying a point guard’s performance, while it is unnecessary for a center. Therefore, I assigned each
player relevant statistics based on their position.
\[P=\frac{1}{n}\sum_{i=1}^n\frac{RS_i-\overline{RS_i}}{\sigma_{RS_i}},~RS_i~is~the~i^{th}~relevant~stat\]
Then, I categorized each player performance as average, above average, subpar, and anomaly (positive and negative).
2) City Activity Level
I split the city activity level into two parts.
Once I gathered information about each city’s bars, clubs, restaurants ratings as well as the number of attractions in each city, I calculated and ranked each city by their city rating. Los Angeles was the highest with a rating of 1.89 and Oklahoma City was the lowest with a rating of 0.69.
City | Rating |
---|---|
Los Angeles | 1.89 |
New York | 1.86 |
Miami | 1.82 |
Washington D.C. | 1.77 |
Chicago | 1.73 |
Brooklyn | 1.72 |
Toronto | 1.64 |
San Francisco | 1.61 |
Houston | 1.52 |
Boston | 1.48 |
Philadelphia | 1.43 |
Dallas | 1.41 |
Orlando | 1.38 |
Atlanta | 1.32 |
Detroit | 1.21 |
Phoenix | 1.16 |
New Orleans | 1.13 |
Charlotte | 1.05 |
Indianapolis | 1.03 |
Denver | 0.98 |
Minneapolis | 0.91 |
Cleveland | 0.88 |
Salt Lake City | 0.85 |
Sacramento | 0.83 |
Portland | 0.82 |
Milwaukee | 0.75 |
San Antonio | 0.73 |
Memphis | 0.71 |
Oklahoma City | 0.69 |
Once the different cities were ranked by city activity ratings, and the data for each player position’s performance in each city was calculated, I created 20 different linear regressions where the dependent variable was percentage of games of a certain performance category for a specific position in each city, and the independent variable was the city rating.
Of the 20 different linear regressions that were run, the two that came with the most correlation were:
To see if there is a true linear correlation in these two sets of data, I performed a hypothesis test on both of them.
x = City Rating
y = Percentage of Above Average Games for a Point Guard
Let \(y=\beta x+\alpha\) be the linear regression equation
\(\text{Null Hypothesis }(H_0):\beta=0\)
\(\text{Alternate Hypothesis }(H_a):\beta\not=0\)
\(\text{p-value}:0.05\)
\(\text{Degrees of Freedom}:27\)
\(\text{Regression Equation}:y=-0.13x+0.566\)
\(\text{Standard Deviation}:s_b=0.038\)
\(\text{Test Statistic}:t=\frac{b-\beta}{s_b}=-3.39\)
Since the t-value for a two-sided t-test with a 0.95 confidence is \(\pm 2.05\), and \(t<-2.05\),
We can reject the null hypothesis, meaning that there is a significant linear correlation between point guard above
average games and city rating.
x = City Rating
y = Percentage of Above Average Games for a Point Guard
Let \(y=\beta x+\alpha\) be the linear regression equation
\(\text{Null Hypothesis }(H_0):\beta=0\)
\(\text{Alternate Hypothesis }(H_a):\beta\not=0\)
\(\text{p-value}:0.05\)
\(\text{Degrees of Freedom}:27\)
\(\text{Regression Equation}:y=0.043x-0.0203\)
\(\text{Standard Deviation}:s_b=0.0131\)
\(\text{Test Statistic}:t=\frac{b-\beta}{s_b}=3.359\)
Since the t-value for a two-sided t-test with a 0.95 confidence is \(\pm 2.05\), and \(t>2.05\), We can reject the null hypothesis, meaning that there is a significant linear correlation between center anomaly games
and city rating.
From these hypothesis tests, we can confirm that there is a significant positive correlation between Center Anomaly Games and City Rating, as well as a negative correlation for Point Guard Above Average Games and City Rating.
Although certain trends were seen between city activity levels and player performance, there are certain parts of the methodology that can be changed to provide a more accurate result. For example, the opponent team in each city was not accounted for when measuring a player’s performance rating. This has a great impact on the results, because cities with strong teams will naturally have a lower performance rating among players who play in that city. In the future, I plan to account for this by putting the team's record as an independent variable. Another future application of this project would be using the current data acquired to train a model to predict a player’s performance based on the city they are playing in and the team they are playing against.
Spring 2020
Jai Sankar