This project focuses on using basic survival analysis techniques to determine factors influencing career length of NFL running backs, employing Kaplan-Meier esimates and Cox Proportional Hazards modeling procedures with the aid of RStudio and its survival library. We extract data from pro-football-reference.com using .csv files for career statistics and Beautiful Soup for physical measurements and maintain our dataset using pandas. We present our results in a visually appealing and easily comprehensible manner through the use of ggplot2, in addition to other R libraries employed for analytical purposes.
- Brian Luu
- Kevin Wang
- John Randazzo
- Start by going here (http://www.pro-football-reference.com/draft/) and selecting RB in the drop-down menu for Position. We are deeply grateful to be able to scrape their data without much consequence.
- Next to the "Drafted Players" heading, there is an option labeled "Share & more" which we will click, yielding an option to generate a .csv file that is suitable for Microsoft Excel. This is the compressed data of 300 NFL running backs. You can literally cut and paste this whole file into your text processor of choice. (We use TextWrangler)
- To get more observations from previous generations of NFL players, we go to the bottom of the table on the website and click "Next Page" and then repeat step 2 with one caveat: when cutting and pasting the raw data file, omit the first line with all of the columns' names.
- With Step 3 in mind, repeat Step 2, 4 more times. You should have 1500 observations total.
- Save your file as a .csv file and read into Python. We save it as nflrb_data.csv and that is the name we use in the Python part. Yay! We are ready to plug this baby into Python.
We are using Python to obtain further measures of interest for the RBs in our dataset. Namely, we wish to extract each player's height and weight in order to compute their Body Mass Index. To accomplish this, we need to write a function that accesses each player's Pro-Football-Reference page, then extracts the player's height and weight by parsing the page's .html code and finding a familiar expression. We also store the player's height and weight in integer form in two new columns in our dataset.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
nfl = pd.read_csv("nflrb_data.csv")
nfl.head()
Rk | Year | Rnd | Pick | Unnamed: 4 | Pos | DrAge | Tm | From | To | ... | G | GS | Att | Yds | TD | Rec | Yds.1 | TD.1 | College/Univ | Unnamed: 23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2016 | 1 | 4 | Ezekiel Elliott\ElliEz00 | RB | 21.0 | DAL | 2016.0 | 2016.0 | ... | 15.0 | 15.0 | 322.0 | 1631.0 | 15.0 | 32.0 | 363.0 | 1.0 | Ohio St. | College Stats |
1 | 2 | 2016 | 2 | 45 | Derrick Henry\HenrDe00 | RB | 22.0 | TEN | 2016.0 | 2016.0 | ... | 15.0 | 1.0 | 110.0 | 490.0 | 5.0 | 13.0 | 137.0 | 0.0 | Alabama | College Stats |
2 | 3 | 2016 | 3 | 73 | Kenyan Drake\DrakKe00 | RB | 22.0 | MIA | 2016.0 | 2016.0 | ... | 16.0 | 1.0 | 33.0 | 179.0 | 2.0 | 9.0 | 46.0 | 0.0 | Alabama | College Stats |
3 | 4 | 2016 | 3 | 90 | C.J. Prosise\ProsC.00 | RB | 22.0 | SEA | 2016.0 | 2016.0 | ... | 6.0 | 2.0 | 30.0 | 172.0 | 1.0 | 17.0 | 208.0 | 0.0 | Notre Dame | College Stats |
4 | 5 | 2016 | 4 | 119 | Tyler Ervin\ErviTy00 | RB | 22.0 | HOU | 2016.0 | 2016.0 | ... | 12.0 | 0.0 | 1.0 | 3.0 | 0.0 | 3.0 | 18.0 | 0.0 | San Jose St. | College Stats |
5 rows x 24 columns
## getting rid of the useless column that is never going to be used
nfl.drop(["Rk","Unnamed: 23"], axis=1, inplace=True)
nfl.head()
Year | Rnd | Pick | Unnamed: 4 | Pos | DrAge | Tm | From | To | AP1 | ... | CarAV | G | GS | Att | Yds | TD | Rec | Yds.1 | TD.1 | College/Univ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | 1 | 4 | Ezekiel Elliott\ElliEz00 | RB | 21.0 | DAL | 2016.0 | 2016.0 | 1 | ... | 16.0 | 15.0 | 15.0 | 322.0 | 1631.0 | 15.0 | 32.0 | 363.0 | 1.0 | Ohio St. |
1 | 2016 | 2 | 45 | Derrick Henry\HenrDe00 | RB | 22.0 | TEN | 2016.0 | 2016.0 | 0 | ... | 4.0 | 15.0 | 1.0 | 110.0 | 490.0 | 5.0 | 13.0 | 137.0 | 0.0 | Alabama |
2 | 2016 | 3 | 73 | Kenyan Drake\DrakKe00 | RB | 22.0 | MIA | 2016.0 | 2016.0 | 0 | ... | 2.0 | 16.0 | 1.0 | 33.0 | 179.0 | 2.0 | 9.0 | 46.0 | 0.0 | Alabama |
3 | 2016 | 3 | 90 | C.J. Prosise\ProsC.00 | RB | 22.0 | SEA | 2016.0 | 2016.0 | 0 | ... | 3.0 | 6.0 | 2.0 | 30.0 | 172.0 | 1.0 | 17.0 | 208.0 | 0.0 | Notre Dame |
4 | 2016 | 4 | 119 | Tyler Ervin\ErviTy00 | RB | 22.0 | HOU | 2016.0 | 2016.0 | 0 | ... | 0.0 | 12.0 | 0.0 | 1.0 | 3.0 | 0.0 | 3.0 | 18.0 | 0.0 | San Jose St. |
5 rows x 22 columns
nfl.columns
Index(['Year', 'Rnd', 'Pick', 'Unnamed: 4', 'Pos', 'DrAge', 'Tm', 'From', 'To',
'AP1', 'PB', 'St', 'CarAV', 'G', 'GS', 'Att', 'Yds', 'TD', 'Rec',
'Yds.1', 'TD.1', 'College/Univ'],
dtype='object')
nfl.columns = ['Year', 'Rnd', 'Pick', 'Player', 'Pos', 'DrAge', 'Tm', 'From', 'To',
'AP1', 'PB', 'St', 'CarAV', 'G', 'GS', 'Att', 'Yds', 'TD', 'Rec',
'Yds.1', 'TD.1', 'College/Univ']
nfl.isnull().sum()
Year 0
Rnd 0
Pick 0
Player 0
Pos 0
DrAge 344
Tm 0
From 361
To 361
AP1 0
PB 0
St 0
CarAV 361
G 361
GS 361
Att 450
Yds 450
TD 450
Rec 488
Yds.1 488
TD.1 488
College/Univ 2
dtype: int64
print("Number of Observations:", nfl.shape[0])
Number of Observations: 1500
## getting rid of the observations where not enough info could be found
nfl = nfl[nfl["From"].isnull() == False]
print("Number of Observations:", nfl.shape[0])
Number of Observations: 1139
## getting rid of the players that did not retire by 2016
nfl = nfl[nfl["To"]!=2016]
print("Number of Observations:", nfl.shape[0])
Number of Observations: 1037
nfl.isnull().sum()
Year 0
Rnd 0
Pick 0
Player 0
Pos 0
DrAge 0
Tm 0
From 0
To 0
AP1 0
PB 0
St 0
CarAV 0
G 0
GS 0
Att 88
Yds 88
TD 88
Rec 127
Yds.1 127
TD.1 127
College/Univ 0
dtype: int64
When the original .csv file was extracted, it contained a column that had the player's name along with a snippet of the URL that was part of their ProFootballReference page. This function returns a list containing the player's name and respective ProFootballReference URL. We run this function and then append the output sequentially to our existing dataset.
baseurl = "http://www.pro-football-reference.com/players/"
def split_player(row):
split_list = row["Player"].split("\\")
player_name = split_list[0]
player_url_code = split_list[1]
first_letter = player_url_code[0]
full_url = baseurl + first_letter + "/" + player_url_code + ".htm"
return [player_name, full_url]
a = nfl.apply(split_player,axis=1)
# converted the lists into numpy arrays and then added them into the dataframe
nfl["Player"] = np.array([row[0] for row in a])
nfl["PFR_URL"] = np.array([row[1] for row in a])
nfl.head()
Year | Rnd | Pick | Player | Pos | DrAge | Tm | From | To | AP1 | ... | G | GS | Att | Yds | TD | Rec | Yds.1 | TD.1 | College/Univ | PFR_URL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | 2015 | 5 | 138 | David Cobb | RB | 22.0 | TEN | 2015.0 | 2015.0 | 0 | ... | 7.0 | 1.0 | 52.0 | 146.0 | 1.0 | 1.0 | -2.0 | 0.0 | Minnesota | http://www.pro-football-reference.com/players/... |
37 | 2015 | 5 | 155 | Karlos Williams | RB | 22.0 | BUF | 2015.0 | 2015.0 | 0 | ... | 11.0 | 3.0 | 93.0 | 517.0 | 7.0 | 11.0 | 96.0 | 2.0 | Florida St. | http://www.pro-football-reference.com/players/... |
40 | 2015 | 6 | 205 | Josh Robinson | RB | 23.0 | IND | 2015.0 | 2015.0 | 0 | ... | 5.0 | 0.0 | 17.0 | 39.0 | 0.0 | 6.0 | 33.0 | 0.0 | Mississippi St. | http://www.pro-football-reference.com/players/... |
43 | 2015 | 7 | 231 | Joey Iosefa | FB | 24.0 | TAM | 2015.0 | 2015.0 | 0 | ... | 2.0 | 0.0 | 15.0 | 51.0 | 0.0 | NaN | NaN | NaN | Hawaii | http://www.pro-football-reference.com/players/... |
45 | 2014 | 2 | 54 | Bishop Sankey | RB | 21.0 | TEN | 2014.0 | 2015.0 | 0 | ... | 29.0 | 12.0 | 199.0 | 762.0 | 3.0 | 32.0 | 272.0 | 1.0 | Washington | http://www.pro-football-reference.com/players/... |
5 rows x 23 columns
Since the original .csv file did not contain the players' height and weight, this function was created to take in a player's respective URL and parse the website to find their height and weight using BeautifulSoup4. If the info could not be found, it would be assigned a missing data value using an error exception.
def player_info(row):
response = requests.get(row["PFR_URL"])
content = response.content
parser = BeautifulSoup(content, 'html.parser')
try:
height = parser.find_all(itemprop="height")[0].text
weight = parser.find_all(itemprop="weight")[0].text
except IndexError:
height=weight=None
return height, weight
a = nfl.apply(player_info, axis=1)
print(a.head())
35 (5-11, 229lb)
37 (6-1, 225lb)
40 (5-9, 215lb)
43 (6-0, 245lb)
45 (5-10, 209lb)
dtype: object
nfl["Height"] = np.array([row[0] for row in a])
nfl["Weight"] = np.array([row[1] for row in a])
## deleting the observations where no height or weight could be parsed
nfl = nfl[nfl["Height"].isnull() == False]
nfl = nfl[nfl["Weight"].isnull() == False]
## converting the height from character to integer
def convert_height(row):
height = row["Height"].split("-")
converted_height = 12*int(height[0]) + int(height[1])
return converted_height
nfl["Height"] = nfl.apply(convert_height,axis=1)
## converting the weight from character to integer
def convert_weight(row):
weight = int(row["Weight"][:3])
return weight
nfl["Weight"] = nfl.apply(convert_weight, axis=1)
nfl.head()
Year | Rnd | Pick | Player | Pos | DrAge | Tm | From | To | AP1 | ... | Att | Yds | TD | Rec | Yds.1 | TD.1 | College/Univ | PFR_URL | Height | Weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | 2015 | 5 | 138 | David Cobb | RB | 22.0 | TEN | 2015.0 | 2015.0 | 0 | ... | 52.0 | 146.0 | 1.0 | 1.0 | -2.0 | 0.0 | Minnesota | http://www.pro-football-reference.com/players/... | 71 | 229 |
37 | 2015 | 5 | 155 | Karlos Williams | RB | 22.0 | BUF | 2015.0 | 2015.0 | 0 | ... | 93.0 | 517.0 | 7.0 | 11.0 | 96.0 | 2.0 | Florida St. | http://www.pro-football-reference.com/players/... | 73 | 225 |
40 | 2015 | 6 | 205 | Josh Robinson | RB | 23.0 | IND | 2015.0 | 2015.0 | 0 | ... | 17.0 | 39.0 | 0.0 | 6.0 | 33.0 | 0.0 | Mississippi St. | http://www.pro-football-reference.com/players/... | 69 | 215 |
43 | 2015 | 7 | 231 | Joey Iosefa | FB | 24.0 | TAM | 2015.0 | 2015.0 | 0 | ... | 15.0 | 51.0 | 0.0 | NaN | NaN | NaN | Hawaii | http://www.pro-football-reference.com/players/... | 72 | 245 |
45 | 2014 | 2 | 54 | Bishop Sankey | RB | 21.0 | TEN | 2014.0 | 2015.0 | 0 | ... | 199.0 | 762.0 | 3.0 | 32.0 | 272.0 | 1.0 | Washington | http://www.pro-football-reference.com/players/... | 70 | 209 |
5 rows x 25 columns
## Output the new dataframe into a new .csv file
nfl.to_csv("nfl.csv")
Our analysis will employ the theory of Survival Analysis, which measures survival probability and instantaneous rate of hazard for an event of interest over a given time period.
We are interested in the amount of games (our time variable) it takes for an NFL runningback's professional career to end (our event of interest). Thanks to our web scraping process, we now have a large, informative and (somewhat) tidy dataset. It is now time to read our finished product (nfl.csv) from our Python program into R.
nfl <- read.csv("filepath/nfl.csv")
To install necessary packages, use:
install.packages("package")
To access the installed package:
library(package)
Here are the packages we used for our analysis:
survival
ggplot2
KMSurv
flexsurv
survminer
We also make use of the ggsurv funtion, documented here: https://www.r-statistics.com/2013/07/creating-good-looking-survival-curves-the-ggsurv-function/
As it turns out, our dataset is still a bit on the messy side. We have some measures associated to each player that cannot be of use in survival analysis. We also do not have a variable set to represent our event of interest, retirement. We need to make a few adjustments before we can start our analysis.
Tidying up:
A few averages, and other stats:
nfl$YPC <- nfl$Yds / nfl$Att
nfl$Years <- nfl$To - nfl$From
nfl$PB.1 <- ifelse(nfl$PB >= 1, 1, 0) #binary predictor
nfl$AP1.1 <- ifelse(nfl$AP1 >= 1, 1, 0)
nfl$BMI <- (nfl$Weight / (nfl$Height * nfl$Height)) * 703
We are now ready to begin performing survival analysis on the career lengths of NFL running backs.
- Two primary variables of interest for building models:
- Duration of time until event or censoring
- Binary indicator of event for each observation
- 0; censored (left the study or did not experience event during study)
- 1; experienced the event
- Let T = Failure time (in the context of our study, T = games played until retirement)
- Let ti denote a given time
- Then we can define our HAZARD FUNCTION as:
h(ti)=P**r(T = ti|T ≥ ti)
- Our SURVIVAL FUNCTION is thus defined as:
S(T)=∏ti ≤ T(1 − h(ti))
We estimate the survival function using Kaplan-Meier Estimation, which
computes
The KM Estimator is limited because it only considers a single homogenous population at a time. We use the Cox Proportional Hazards model to measure the effects of particular covariates on career survival, or more specifically, the instantaneous rates of hazard. With this very handy tool, we are able to judge the level at which covariates such as BMI influence a player's career length. Some theory of note:
-
Let X be a vector of data values associated to a given observation
-
The Cox Model is given by:
- Baseline hazard rate as a function of time (if all Xi = 0 for a given observation, then h0(t) is the actual hazard rate)
cox <- coxph(Surv(G,Retired)~BMI+YPC+DrAge, data = nfl.ret)
summary(cox)
## Call:
## coxph(formula = Surv(G, Retired) ~ BMI + YPC + DrAge, data = nfl.ret)
##
## n= 932, number of events= 932
## (82 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BMI -0.07703 0.92586 0.01439 -5.352 8.71e-08 ***
## YPC -0.20421 0.81529 0.02986 -6.838 8.02e-12 ***
## DrAge 0.17489 1.19111 0.04337 4.033 5.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BMI 0.9259 1.0801 0.9001 0.9524
## YPC 0.8153 1.2266 0.7689 0.8644
## DrAge 1.1911 0.8396 1.0941 1.2968
##
## Concordance= 0.591 (se = 0.011 )
## Rsquare= 0.078 (max possible= 1 )
## Likelihood ratio test= 75.55 on 3 df, p=2.22e-16
## Wald test = 92.89 on 3 df, p=0
## Score (logrank) test = 87.74 on 3 df, p=0
Using AIC/BIC as a criterion for our model we find that our three most significant covariates are BMI, Yards/Carry and Draft Age. Our model is given by:
h(t, X)=h0(t)*exp(( − .0770 * BMI)+(−.2042 * YPC)+(.1749 * DraftAg**e))
The key assumption for the Cox Model is that covariate effects on survival are independent of time. We can test this using Schoenfeld Residuals. We are looking for a mean of 0 for the entire time duration, which suggests that errors are evenly distributed over time. Further, there is a p-value associated to each covariate. The hypotheses yielding each probability measure are:
H0: The covariate's effect is independent of time H1: The covariate's effect exhibits time-dependency
A low p-value indicates that we should consider omitting the associated covariate.
We are thrilled with these results. Our model very much aligns with the Proportional Hazards assumption.
Now that we have a legitimate model in our hands, we can visualize the effects of different covariate levels on career survival:
We can use our model to estimate real-life career survival probability. Here is a Kaplan-Meier estimate for SD Chargers legend Ladainian Tomlinson made from our Cox model:
The black vertical line denotes the actual number of games LT played in his NFL career.
For Ezekiel Elliott, we can provide an estimate for the probability he is still in the league after a given amount of games. He has only played 1 season (16 games), and given his measurements it is no surprise that he has made it this far:
Here is the estimated career survival probability for Bo Jackson, arguably the greatest pure athlete in American history:
Bo Jackson only played in three seasons, although he moonlighted as a star MLB player as well. His career was tragically ended by a catastrophic hip injury; the allegation is that the injury was worsened by Bo's prior steroid abuse. It is important to note here that statistical models such as KM estimates and Cox PH models are mere approximations of reality, and cannot in any way reliably predict real-life as it unfolds, but can be very useful to elucidate interesting relationships between two or more phenomena.
We figured that the Pro-Bowl and All-Pro covariates were likely highly significant in explaining career lengths amongst NFL players. However, these did not meet the Proportional Hazards assumption. Still seeking to employ the power of these covariates, we seek to find which accolade is more associated to a lengthy professional career. It should be noted that the presence of both accolades is the best indicator of a long playing career.
The second plot is incomplete. We find that All-Pro status seems to imply being a Pro-Bowler as well. Generally, an All-Pro will not go long without being selected to a Pro-Bowl.
We can fit a parametric distribution to our overall survival curve. Observe:
## Call:
## flexsurvreg(formula = Surv(G, Retired) ~ 1, data = nfl.ret, dist = "gengamma")
##
## Estimates:
## est L95% U95% se
## mu 4.3805 4.2774 4.4837 0.0526
## sigma 0.7168 0.6541 0.7856 0.0335
## Q 1.6940 1.4332 1.9548 0.1331
##
## N = 1014, Events: 1014, Censored: 0
## Total time at risk: 58734
## Log-likelihood = -5101.664, df = 3
## AIC = 10209.33
We are impressed by the goodness of this fit. Our parameters for the distribution were estimated to be: (mu = 4.3805, sigma = .7168, Q = 1.694)
More on the Generalized Gamma distribution can be found here: https://en.wikipedia.org/wiki/Generalized_gamma_distribution
For this project, our goal was to examine the statistical effects of career statistics, accolades and physical measurements on the career lengths of runningbacks in the NFL. Upon obtaining the dataset via Python methods, we switched to R and implemented theoretical tools of survival analysis such as the Kaplan Meier estimator and the Cox Proportional Hazards model. We found that there are three highly significant and time-independent covariates which can tell us a great deal about an NFL running back's potential career length: the age at which a player was drafted, the player's BMI, and the player's Yards per Carry statistic.
-
Our estimated career survival curve for NFL RBs fits a Generalized Gamma Distribution with parameters (mu = 4.3805, sigma = .7168, Q = 1.694).
-
BMI is highly significant in predicting career length, having the largest magnitude of the three predictors in our model. Players with a higher computed BMI will last longer in the league. Since BMI loses predictive power in terms of indicating obesity when considering extremely muscular and athletic individuals (such as NFL RBs) the measure becomes one of body density. Specifically, the stockier a RB's build is, the longer we can expect them to last in the league.
-
Draft Age is highly significant in predicting career length. On average, the earlier a player enters the NFL, the longer it will take for their athletic ability to decrease to the point of seeking retirement from the league.
-
Yards per carry is also very significant. As an average measure, it is a prime indicator of how good an NFL RB is in their career. Of course, players with higher YPC can be expected to have lasted longer in the league.
-
All-Pro and Pro-Bowl status are indicators of a player's quality of play through their career. Perhaps trivially, players with these accolades lasted far longer than those who never acheived them. We found that All-Pro was a significantly better predictor of career longevity than Pro-Bowl.
https://www.r-statistics.com/2013/07/creating-good-looking-survival-curves-the-ggsurv-function/
https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node69.html