Abstract

This project focuses on using basic survival analysis techniques to determine factors influencing career length of NFL running backs, employing Kaplan-Meier esimates and Cox Proportional Hazards modeling procedures with the aid of RStudio and its survival library. We extract data from pro-football-reference.com using .csv files for career statistics and Beautiful Soup for physical measurements and maintain our dataset using pandas. We present our results in a visually appealing and easily comprehensible manner through the use of ggplot2, in addition to other R libraries employed for analytical purposes.

Contributors

Brian Luu
Kevin Wang
John Randazzo

Motivation

Runningbacks don't last very long in the NFL

Methodology- Web Scraping

Getting started with the .csv file

Start by going here (http://www.pro-football-reference.com/draft/) and selecting RB in the drop-down menu for Position. We are deeply grateful to be able to scrape their data without much consequence.
Next to the "Drafted Players" heading, there is an option labeled "Share & more" which we will click, yielding an option to generate a .csv file that is suitable for Microsoft Excel. This is the compressed data of 300 NFL running backs. You can literally cut and paste this whole file into your text processor of choice. (We use TextWrangler)
To get more observations from previous generations of NFL players, we go to the bottom of the table on the website and click "Next Page" and then repeat step 2 with one caveat: when cutting and pasting the raw data file, omit the first line with all of the columns' names.
With Step 3 in mind, repeat Step 2, 4 more times. You should have 1500 observations total.
Save your file as a .csv file and read into Python. We save it as nflrb_data.csv and that is the name we use in the Python part. Yay! We are ready to plug this baby into Python.

Getting Started in Python

We are using Python to obtain further measures of interest for the RBs in our dataset. Namely, we wish to extract each player's height and weight in order to compute their Body Mass Index. To accomplish this, we need to write a function that accesses each player's Pro-Football-Reference page, then extracts the player's height and weight by parsing the page's .html code and finding a familiar expression. We also store the player's height and weight in integer form in two new columns in our dataset.

Importing the Required Libraries

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

Cleaning the Dataset

nfl = pd.read_csv("nflrb_data.csv")
nfl.head()

	Rk	Year	Rnd	Pick	Unnamed: 4	Pos	DrAge	Tm	From	To	...	G	GS	Att	Yds	TD	Rec	Yds.1	TD.1	College/Univ	Unnamed: 23
0	1	2016	1	4	Ezekiel Elliott\ElliEz00	RB	21.0	DAL	2016.0	2016.0	...	15.0	15.0	322.0	1631.0	15.0	32.0	363.0	1.0	Ohio St.	College Stats
1	2	2016	2	45	Derrick Henry\HenrDe00	RB	22.0	TEN	2016.0	2016.0	...	15.0	1.0	110.0	490.0	5.0	13.0	137.0	0.0	Alabama	College Stats
2	3	2016	3	73	Kenyan Drake\DrakKe00	RB	22.0	MIA	2016.0	2016.0	...	16.0	1.0	33.0	179.0	2.0	9.0	46.0	0.0	Alabama	College Stats
3	4	2016	3	90	C.J. Prosise\ProsC.00	RB	22.0	SEA	2016.0	2016.0	...	6.0	2.0	30.0	172.0	1.0	17.0	208.0	0.0	Notre Dame	College Stats
4	5	2016	4	119	Tyler Ervin\ErviTy00	RB	22.0	HOU	2016.0	2016.0	...	12.0	0.0	1.0	3.0	0.0	3.0	18.0	0.0	San Jose St.	College Stats

5 rows x 24 columns

## getting rid of the useless column that is never going to be used
nfl.drop(["Rk","Unnamed: 23"], axis=1, inplace=True)
nfl.head()

	Year	Rnd	Pick	Unnamed: 4	Pos	DrAge	Tm	From	To	AP1	...	CarAV	G	GS	Att	Yds	TD	Rec	Yds.1	TD.1	College/Univ
0	2016	1	4	Ezekiel Elliott\ElliEz00	RB	21.0	DAL	2016.0	2016.0	1	...	16.0	15.0	15.0	322.0	1631.0	15.0	32.0	363.0	1.0	Ohio St.
1	2016	2	45	Derrick Henry\HenrDe00	RB	22.0	TEN	2016.0	2016.0	0	...	4.0	15.0	1.0	110.0	490.0	5.0	13.0	137.0	0.0	Alabama
2	2016	3	73	Kenyan Drake\DrakKe00	RB	22.0	MIA	2016.0	2016.0	0	...	2.0	16.0	1.0	33.0	179.0	2.0	9.0	46.0	0.0	Alabama
3	2016	3	90	C.J. Prosise\ProsC.00	RB	22.0	SEA	2016.0	2016.0	0	...	3.0	6.0	2.0	30.0	172.0	1.0	17.0	208.0	0.0	Notre Dame
4	2016	4	119	Tyler Ervin\ErviTy00	RB	22.0	HOU	2016.0	2016.0	0	...	0.0	12.0	0.0	1.0	3.0	0.0	3.0	18.0	0.0	San Jose St.

5 rows x 22 columns

Renaming the Columns

nfl.columns

Index(['Year', 'Rnd', 'Pick', 'Unnamed: 4', 'Pos', 'DrAge', 'Tm', 'From', 'To',
       'AP1', 'PB', 'St', 'CarAV', 'G', 'GS', 'Att', 'Yds', 'TD', 'Rec',
       'Yds.1', 'TD.1', 'College/Univ'],
      dtype='object')

nfl.columns = ['Year', 'Rnd', 'Pick', 'Player', 'Pos', 'DrAge', 'Tm', 'From', 'To',
       'AP1', 'PB', 'St', 'CarAV', 'G', 'GS', 'Att', 'Yds', 'TD', 'Rec',
       'Yds.1', 'TD.1', 'College/Univ']

Getting Rid of Some Missing Data

nfl.isnull().sum()

Year              0
Rnd               0
Pick              0
Player            0
Pos               0
DrAge           344
Tm                0
From            361
To              361
AP1               0
PB                0
St                0
CarAV           361
G               361
GS              361
Att             450
Yds             450
TD              450
Rec             488
Yds.1           488
TD.1            488
College/Univ      2
dtype: int64

print("Number of Observations:", nfl.shape[0])

Number of Observations: 1500

## getting rid of the observations where not enough info could be found
nfl = nfl[nfl["From"].isnull() == False]
print("Number of Observations:", nfl.shape[0])

Number of Observations: 1139

## getting rid of the players that did not retire by 2016
nfl = nfl[nfl["To"]!=2016]
print("Number of Observations:", nfl.shape[0])

Number of Observations: 1037

nfl.isnull().sum()

Year              0
Rnd               0
Pick              0
Player            0
Pos               0
DrAge             0
Tm                0
From              0
To                0
AP1               0
PB                0
St                0
CarAV             0
G                 0
GS                0
Att              88
Yds              88
TD               88
Rec             127
Yds.1           127
TD.1            127
College/Univ      0
dtype: int64

Finding New Data

When the original .csv file was extracted, it contained a column that had the player's name along with a snippet of the URL that was part of their ProFootballReference page. This function returns a list containing the player's name and respective ProFootballReference URL. We run this function and then append the output sequentially to our existing dataset.

baseurl = "http://www.pro-football-reference.com/players/"
def split_player(row):
    split_list = row["Player"].split("\\")
    player_name = split_list[0]
    player_url_code = split_list[1]
    first_letter = player_url_code[0]
    full_url = baseurl + first_letter + "/" + player_url_code + ".htm"
    return [player_name, full_url]
a = nfl.apply(split_player,axis=1)

# converted the lists into numpy arrays and then added them into the dataframe
nfl["Player"] = np.array([row[0] for row in a])
nfl["PFR_URL"] = np.array([row[1] for row in a])

nfl.head()

	Year	Rnd	Pick	Player	Pos	DrAge	Tm	From	To	...	G	GS	Att	Yds	TD	Rec	Yds.1	TD.1	College/Univ	PFR_URL
35	2015	5	138	David Cobb	RB	22.0	TEN	2015.0	2015.0	...	7.0	1.0	52.0	146.0	1.0	1.0	-2.0	0.0	Minnesota	http://www.pro-football-reference.com/players/...
37	2015	5	155	Karlos Williams	RB	22.0	BUF	2015.0	2015.0	...	11.0	3.0	93.0	517.0	7.0	11.0	96.0	2.0	Florida St.	http://www.pro-football-reference.com/players/...
40	2015	6	205	Josh Robinson	RB	23.0	IND	2015.0	2015.0	...	5.0	0.0	17.0	39.0	0.0	6.0	33.0	0.0	Mississippi St.	http://www.pro-football-reference.com/players/...
43	2015	7	231	Joey Iosefa	FB	24.0	TAM	2015.0	2015.0	...	2.0	0.0	15.0	51.0	0.0	NaN	NaN	NaN	Hawaii	http://www.pro-football-reference.com/players/...
45	2014	2	54	Bishop Sankey	RB	21.0	TEN	2014.0	2015.0	...	29.0	12.0	199.0	762.0	3.0	32.0	272.0	1.0	Washington	http://www.pro-football-reference.com/players/...

5 rows x 23 columns

Since the original .csv file did not contain the players' height and weight, this function was created to take in a player's respective URL and parse the website to find their height and weight using BeautifulSoup4. If the info could not be found, it would be assigned a missing data value using an error exception.

def player_info(row):
    response = requests.get(row["PFR_URL"])
    content = response.content
    parser = BeautifulSoup(content, 'html.parser')
    try:
        height = parser.find_all(itemprop="height")[0].text
        weight = parser.find_all(itemprop="weight")[0].text
    except IndexError:
        height=weight=None
    return height, weight

a = nfl.apply(player_info, axis=1)
print(a.head())

35    (5-11, 229lb)
37     (6-1, 225lb)
40     (5-9, 215lb)
43     (6-0, 245lb)
45    (5-10, 209lb)
dtype: object

nfl["Height"] = np.array([row[0] for row in a])
nfl["Weight"] = np.array([row[1] for row in a])

## deleting the observations where no height or weight could be parsed
nfl = nfl[nfl["Height"].isnull() == False]
nfl = nfl[nfl["Weight"].isnull() == False]

## converting the height from character to integer 
def convert_height(row):
    height = row["Height"].split("-")
    converted_height = 12*int(height[0]) + int(height[1])
    return converted_height
nfl["Height"] = nfl.apply(convert_height,axis=1)

## converting the weight from character to integer
def convert_weight(row):
    weight = int(row["Weight"][:3])
    return weight
nfl["Weight"] = nfl.apply(convert_weight, axis=1)

nfl.head()

	Year	Rnd	Pick	Player	Pos	DrAge	Tm	From	To	...	Att	Yds	TD	Rec	Yds.1	TD.1	College/Univ	PFR_URL	Height	Weight
35	2015	5	138	David Cobb	RB	22.0	TEN	2015.0	2015.0	...	52.0	146.0	1.0	1.0	-2.0	0.0	Minnesota	http://www.pro-football-reference.com/players/...	71	229
37	2015	5	155	Karlos Williams	RB	22.0	BUF	2015.0	2015.0	...	93.0	517.0	7.0	11.0	96.0	2.0	Florida St.	http://www.pro-football-reference.com/players/...	73	225
40	2015	6	205	Josh Robinson	RB	23.0	IND	2015.0	2015.0	...	17.0	39.0	0.0	6.0	33.0	0.0	Mississippi St.	http://www.pro-football-reference.com/players/...	69	215
43	2015	7	231	Joey Iosefa	FB	24.0	TAM	2015.0	2015.0	...	15.0	51.0	0.0	NaN	NaN	NaN	Hawaii	http://www.pro-football-reference.com/players/...	72	245
45	2014	2	54	Bishop Sankey	RB	21.0	TEN	2014.0	2015.0	...	199.0	762.0	3.0	32.0	272.0	1.0	Washington	http://www.pro-football-reference.com/players/...	70	209

5 rows x 25 columns

## Output the new dataframe into a new .csv file 
nfl.to_csv("nfl.csv")

Methodology- Analyis

Our analysis will employ the theory of Survival Analysis, which measures survival probability and instantaneous rate of hazard for an event of interest over a given time period.

We are interested in the amount of games (our time variable) it takes for an NFL runningback's professional career to end (our event of interest). Thanks to our web scraping process, we now have a large, informative and (somewhat) tidy dataset. It is now time to read our finished product (nfl.csv) from our Python program into R.

nfl <- read.csv("filepath/nfl.csv")

To install necessary packages, use:

install.packages("package")

To access the installed package:

library(package)

Here are the packages we used for our analysis:

survival
ggplot2
KMSurv
flexsurv
survminer

We also make use of the ggsurv funtion, documented here: https://www.r-statistics.com/2013/07/creating-good-looking-survival-curves-the-ggsurv-function/

Further Tidying

As it turns out, our dataset is still a bit on the messy side. We have some measures associated to each player that cannot be of use in survival analysis. We also do not have a variable set to represent our event of interest, retirement. We need to make a few adjustments before we can start our analysis.

Tidying up:

A few averages, and other stats:

nfl$YPC <- nfl$Yds / nfl$Att
nfl$Years <- nfl$To - nfl$From
nfl$PB.1 <- ifelse(nfl$PB >= 1, 1, 0) #binary predictor
nfl$AP1.1 <- ifelse(nfl$AP1 >= 1, 1, 0)
nfl$BMI <- (nfl$Weight / (nfl$Height * nfl$Height)) * 703

We are now ready to begin performing survival analysis on the career lengths of NFL running backs.

A Brief Overview of Survival Analysis

Two primary variables of interest for building models:
- Duration of time until event or censoring
- Binary indicator of event for each observation
  - 0; censored (left the study or did not experience event during study)
  - 1; experienced the event
Let T = Failure time (in the context of our study, T = games played until retirement)
Let t_i denote a given time
Then we can define our HAZARD FUNCTION as:

h(t_i)=P**r(T = t_i|T ≥ t_i)

Our SURVIVAL FUNCTION is thus defined as:

S(T)=∏_{t_i ≤ T}(1 − h(t_i))

Kaplan-Meier Estimates:

We estimate the survival function using Kaplan-Meier Estimation, which computes $\hat{S}(T)$ as a function of each t_i ∈ [0, 1, ...239]. These are best used when considering the entire subject population's aggregate survival with no additional covariates, or with a few discrete valued covariate levels. We make use of the ggsurv function here to create aesthetically pleasing survival plots:

Cox Models:

The KM Estimator is limited because it only considers a single homogenous population at a time. We use the Cox Proportional Hazards model to measure the effects of particular covariates on career survival, or more specifically, the instantaneous rates of hazard. With this very handy tool, we are able to judge the level at which covariates such as BMI influence a player's career length. Some theory of note:

Let X be a vector of data values associated to a given observation
The Cox Model is given by:

$h(t,X) = h_0(t) * exp( \sum_{i = 1}^n \beta_i X_i )$

Baseline hazard rate as a function of time (if all X_i = 0 for a given observation, then h₀(t) is the actual hazard rate)

cox <- coxph(Surv(G,Retired)~BMI+YPC+DrAge, data = nfl.ret)
summary(cox)

## Call:
## coxph(formula = Surv(G, Retired) ~ BMI + YPC + DrAge, data = nfl.ret)
## 
##   n= 932, number of events= 932 
##    (82 observations deleted due to missingness)
## 
##           coef exp(coef) se(coef)      z Pr(>|z|)    
## BMI   -0.07703   0.92586  0.01439 -5.352 8.71e-08 ***
## YPC   -0.20421   0.81529  0.02986 -6.838 8.02e-12 ***
## DrAge  0.17489   1.19111  0.04337  4.033 5.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       exp(coef) exp(-coef) lower .95 upper .95
## BMI      0.9259     1.0801    0.9001    0.9524
## YPC      0.8153     1.2266    0.7689    0.8644
## DrAge    1.1911     0.8396    1.0941    1.2968
## 
## Concordance= 0.591  (se = 0.011 )
## Rsquare= 0.078   (max possible= 1 )
## Likelihood ratio test= 75.55  on 3 df,   p=2.22e-16
## Wald test            = 92.89  on 3 df,   p=0
## Score (logrank) test = 87.74  on 3 df,   p=0

Using AIC/BIC as a criterion for our model we find that our three most significant covariates are BMI, Yards/Carry and Draft Age. Our model is given by:

h(t, X)=h₀(t)*exp(( − .0770 * BMI)+(−.2042 * YPC)+(.1749 * DraftAg**e))

The Proportional Hazards Assumption

The key assumption for the Cox Model is that covariate effects on survival are independent of time. We can test this using Schoenfeld Residuals. We are looking for a mean of 0 for the entire time duration, which suggests that errors are evenly distributed over time. Further, there is a p-value associated to each covariate. The hypotheses yielding each probability measure are:

H₀: The covariate's effect is independent of time H₁: The covariate's effect exhibits time-dependency

A low p-value indicates that we should consider omitting the associated covariate.

We are thrilled with these results. Our model very much aligns with the Proportional Hazards assumption.

Examining Our Model's Fit

Now that we have a legitimate model in our hands, we can visualize the effects of different covariate levels on career survival:

Fun With Our Model: A Tale of 3 RBs

We can use our model to estimate real-life career survival probability. Here is a Kaplan-Meier estimate for SD Chargers legend Ladainian Tomlinson made from our Cox model:

The black vertical line denotes the actual number of games LT played in his NFL career.

For Ezekiel Elliott, we can provide an estimate for the probability he is still in the league after a given amount of games. He has only played 1 season (16 games), and given his measurements it is no surprise that he has made it this far:

Here is the estimated career survival probability for Bo Jackson, arguably the greatest pure athlete in American history:

Bo Jackson only played in three seasons, although he moonlighted as a star MLB player as well. His career was tragically ended by a catastrophic hip injury; the allegation is that the injury was worsened by Bo's prior steroid abuse. It is important to note here that statistical models such as KM estimates and Cox PH models are mere approximations of reality, and cannot in any way reliably predict real-life as it unfolds, but can be very useful to elucidate interesting relationships between two or more phenomena.

Bonus: All-Pro vs. Pro-Bowl

We figured that the Pro-Bowl and All-Pro covariates were likely highly significant in explaining career lengths amongst NFL players. However, these did not meet the Proportional Hazards assumption. Still seeking to employ the power of these covariates, we seek to find which accolade is more associated to a lengthy professional career. It should be noted that the presence of both accolades is the best indicator of a long playing career.

The second plot is incomplete. We find that All-Pro status seems to imply being a Pro-Bowler as well. Generally, an All-Pro will not go long without being selected to a Pro-Bowl.

Fitting a Distribution to Our Estimate:

We can fit a parametric distribution to our overall survival curve. Observe:

## Call:
## flexsurvreg(formula = Surv(G, Retired) ~ 1, data = nfl.ret, dist = "gengamma")
## 
## Estimates: 
##        est     L95%    U95%    se    
## mu     4.3805  4.2774  4.4837  0.0526
## sigma  0.7168  0.6541  0.7856  0.0335
## Q      1.6940  1.4332  1.9548  0.1331
## 
## N = 1014,  Events: 1014,  Censored: 0
## Total time at risk: 58734
## Log-likelihood = -5101.664, df = 3
## AIC = 10209.33

We are impressed by the goodness of this fit. Our parameters for the distribution were estimated to be: (mu = 4.3805, sigma = .7168, Q = 1.694)

More on the Generalized Gamma distribution can be found here: https://en.wikipedia.org/wiki/Generalized_gamma_distribution

Summary and Resources

For this project, our goal was to examine the statistical effects of career statistics, accolades and physical measurements on the career lengths of runningbacks in the NFL. Upon obtaining the dataset via Python methods, we switched to R and implemented theoretical tools of survival analysis such as the Kaplan Meier estimator and the Cox Proportional Hazards model. We found that there are three highly significant and time-independent covariates which can tell us a great deal about an NFL running back's potential career length: the age at which a player was drafted, the player's BMI, and the player's Yards per Carry statistic.

Key Results

Our estimated career survival curve for NFL RBs fits a Generalized Gamma Distribution with parameters (mu = 4.3805, sigma = .7168, Q = 1.694).
BMI is highly significant in predicting career length, having the largest magnitude of the three predictors in our model. Players with a higher computed BMI will last longer in the league. Since BMI loses predictive power in terms of indicating obesity when considering extremely muscular and athletic individuals (such as NFL RBs) the measure becomes one of body density. Specifically, the stockier a RB's build is, the longer we can expect them to last in the league.
Draft Age is highly significant in predicting career length. On average, the earlier a player enters the NFL, the longer it will take for their athletic ability to decrease to the point of seeking retirement from the league.
Yards per carry is also very significant. As an average measure, it is a prime indicator of how good an NFL RB is in their career. Of course, players with higher YPC can be expected to have lasted longer in the league.
All-Pro and Pro-Bowl status are indicators of a player's quality of play through their career. Perhaps trivially, players with these accolades lasted far longer than those who never acheived them. We found that All-Pro was a significantly better predictor of career longevity than Pro-Bowl.

Acknowledgments

http://stackoverflow.com/questions/26296020/github-displays-all-code-chunks-from-readme-rmd-despite-include-false

https://www.r-statistics.com/2013/07/creating-good-looking-survival-curves-the-ggsurv-function/

https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node69.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Abstract

Contributors

Motivation

Runningbacks don't last very long in the NFL

Methodology- Web Scraping

Getting started with the .csv file

Getting Started in Python

Importing the Required Libraries

Cleaning the Dataset

Renaming the Columns

Getting Rid of Some Missing Data

Finding New Data

Methodology- Analyis

Further Tidying

A Brief Overview of Survival Analysis

Kaplan-Meier Estimates:

Cox Models:

The Proportional Hazards Assumption

Examining Our Model's Fit

Fun With Our Model: A Tale of 3 RBs

Bonus: All-Pro vs. Pro-Bowl

Fitting a Distribution to Our Estimate:

Summary and Resources

Key Results

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Abstract

Contributors

Motivation

Runningbacks don't last very long in the NFL

Methodology- Web Scraping

Getting started with the .csv file

Getting Started in Python

Importing the Required Libraries

Cleaning the Dataset

Renaming the Columns

Getting Rid of Some Missing Data

Finding New Data

Methodology- Analyis

Further Tidying

A Brief Overview of Survival Analysis

Kaplan-Meier Estimates:

Cox Models:

The Proportional Hazards Assumption

Examining Our Model's Fit

Fun With Our Model: A Tale of 3 RBs

Bonus: All-Pro vs. Pro-Bowl

Fitting a Distribution to Our Estimate:

Summary and Resources

Key Results

Acknowledgments