-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08_linear_models.qmd
417 lines (258 loc) · 12 KB
/
08_linear_models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
# Linear Models
In a Linear Model, the expected value (mean) of a response variable,
$Y$, is modeled as a linear combination of one or more predictors. The
usual formulation of a linear model is as follows.
$$E[Y] = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$$ We can also write
down a version of the model that includes a random component.
$$Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon$$
**Examples of Linear Models**
- Simple Linear Regression -- only one predictor
- ANOVA -- the predictors are dummy variables representing group
membership
- Multiple Linear Regression -- the predictors can be any combination
of binary, nominal, ordinal, and continuous
Even an independent samples t-test is an example of a linear model. In
this case, there is only one predictor, $X$, which is a dummy variable
representing the group.
### Another example of data entry
The *linmod.csv* data set contains the following measurements on 255
individuals.
- Systolic Blood Pressure
- Diastolic Blood Pressure
- Combined Blood Pressure
- Age
- Gender
- Smoking Status
We begin by reading in the data.
`> dat = read.csv("linmod.csv", header=TRUE)`
Before beginning any modeling, it's important to investigate the data to
make sure it is there as you expect. If we just type `dat` at the
command line, **[R]{.sans-serif}** prints the entire data set. This is
not the most efficient way to explore the data. Here we explore several
different ways of exploring the data set in a more concise manner.
The `dim` function returns the dimensions of the data set; the `head`
and `tail` functions enables us to see the first and last rows of the
data set.
`> class(dat)`
`> names(dat)`
`> dim(dat)`
`> head(dat)`
`> head(dat, 15)`
`> tail(dat)`
The `summary` function is a powerful method for summarizing variables in
the data set.
`> summary(dat)`
The output includes a numeric summary for the variables *sbp* and *dbp*.
The *bp* column is read as *character* (and converted to a factor)
because it contains the character "/". Instead of a numeric summary for
*bp*, `summary` returns a table of values for the factor levels.
We would expect *age* to be a continuous variable, but
**[R]{.sans-serif}** returns a table summary for *age* rather than a
numeric summary. This indicates that *age* is being read as a factor,
and we should check for strings in the age column. In the *smoke*
column, we see that one person has the value $888$ and two the value
$999$.
Another powerful function for investigating the structure of a data set
is the `str` command.
`> str(dat)`
The output lists *age* as a *factor* which confirms our suspicion that
*age* is being read as a character variable. For factors, we can
investigate the factor levels with the `levels` command.
`> levels(dat$age)`
Someone has recorded "Not Reported\" in the age column for one person.
With a data set as small as ours, we could type `dat` and scroll down to
find out which person had a "Not Reported\" for age. With large data
sets this is difficult. The `which` statement is another option.
`> which(dat$age == "Not Reported")`
`> dat[which(dat$age == "Not Reported"), ]`
Now let's investigate the *smoke* variable.
`> levels(dat$smoke)`
`> which(dat$smoke == "888")`
`> which(dat$smoke == "999")`
**Fixing the Data** Suppose that 888 and 999 and "Not Reported" all
indicate missing values. Let's update the *age* and *smoke* variables by
putting NA to represent missingness.
`> dat$age[189] = NA`
`> dat$smoke[c(83,80,91)] = NA`
`> summary(dat$smoke)`
`> dat$smoke = factor(dat$smoke)`
`> summary(dat$smoke)`
**Common Data Management Problem!**
We still need to change *age* to a numeric variable. Just using
`as.numeric` won't work on *factors*.
`> as.numeric(dat$age)`
Instead, we must first change it to character and then numeric.
`> as.numeric(as.character(dat$age))`
`> dat$age = as.numeric(as.character(dat$age))`
### Solution
Another alternative is to read the data set using the `na.strings`
argument.
`> dat = read.csv("linmod.csv", na.strings=c("NA","Not Reported","888","999"))`
**Graphical Descriptions of Blood Pressure** Let's start by examining
the marginal distribution of Systolic Blood Pressure.
`> hist(dat$sbp, main="Histogram of Systolic Blood Pressure", xlab="Systolic BP")`
`> boxplot(dat$sbp, main="Boxplot of SBP", ylab="SBP")`
Now use the `plot` function to create a scatterplot of Systolic Blood
Pressure versus Diastolic Blood Pressure. Suppose we want Diastolic
Blood Pressure on the horizontal axis and Systolic Blood Pressure on the
vertical axis. The horizontal axis is the first argument and the
vertical axis is the second argument.
`> plot(sbp `$\sim$` dbp, data=dat)`
`plot` has many arguments which will allow us to modify the graph. Let's
take a look at a few of them.
`> plot(sbp `$\sim$` dbp, data=dat, col="green")`
`> plot(sbp `$\sim$` dbp, data=dat, pch=2, col="green")`
`> plot(sbp `$\sim$` dbp, data=dat, pch=2, xlab="Diastolic",`
`+ ylab="Systolic", main="Blood Pressure", col="green")`
**Numerical Descriptions of Blood Pressure** Using the `sd`, `cov`, and
`cor` functions, we can investigate the marginal and joint variation of
Diastolic and Systolic Blood Pressure.
`> summary(dat$dbp)`
`> summary(dat$sbp)`
`> sd(dat$dbp)`
`> sd(dat$sbp)`
`> cov(dat$dbp, dat$sbp)`
`> cor(dat$dbp, dat$sbp)`
## Simple Linear Regression using the `lm` function
In a simple linear regression, we propose the model:
$$Y = \beta_0 + \beta_1 X + \epsilon,$$ where $Y$ is the dependent
variable, $X$ is the sole independent variable, and $\epsilon$
represents a random component. One of the goals in a simple linear
regression is to find the estimates, $\hat{\beta_0}$ and
$\hat{\beta_1}$, that fit the data best.
The **[R]{.sans-serif}** function used to fit regression models is the
`lm` function. Let's begin by doing a simple linear regression of
systolic blood pressure on diastolic blood pressure.
`> lm(sbp `$\sim$` dbp, data=dat)`
At first glance, **[R]{.sans-serif}** returns the estimated regression
parameters $\hat{\beta_0}$ and $\hat{\beta_1}$ but very little else.
What about the model $r^2$? How do we find the residuals? What about
confidence intervals, influential points, or the other diagnostics one
should consider when performing a regression analysis?
By storing the fitted model as an object, we are able to unleash all the
power in the `lm` function. Let's try again, but this time, store the
linear model as an object.
`> mymod = lm(sbp `$\sim$` dbp, data=dat)`
The variable `mymod` now stores the information from the regression of
`sbp` on `dbp`. `mymod` is a linear model object. Just as we earlier saw
examples of numeric objects (`x = 5`) and character objects
(`y = "Hi"`), we now have the object `mymod` which is a linear model
object. Let's verify that `mymod` is a linear model object.
`> class(mymod)`
Now that we have the "lm\" object stored in `mymod`, let's do some more
investigation.
`> summary(mymod)`
The `summary` function returns the following
- A 5-number summary of the residuals
- A table of regression coefficients, standard errors, t-statistics,
and p-values for testing the hypotheses that $\beta_i = 0$
- An estimate of the error standard deviation
- Unadjusted and adjusted model $r^2$
- An overall F-test of no model effect
We can use the names function to see everything that is stored in
`mymod`.
`> names(mymod)`
We can extract any single attribute using \$.
`> mymod$coefficients`
`> mymod$fitted.values`
**[R]{.sans-serif}** has many "extractor" functions:
`> coef(mymod)`
`> fitted(mymod)`
**[R]{.sans-serif}** also has powerful graphing tools for checking model
assumptions. For a simple linear regression, we need to check for
- The nature of the relationship between $Y$ and $X$ (linear?)
- The error distribution
- Influential points
Using the `plot` function, we can cycle through diagnostic graphs to
test each of the above assumptions.
`> plot(residuals(mymod), predict(mymod), main="Residual Plot")`
Confidence intervals for the regression coefficients provide much more
information than p-values. Confidence intervals for $\beta_0$ and
$\beta_1$ can be generated using the `confint` function.
`> confint(mymod)`
`> confint(mymod, level=.90)`
We can also examine the ANOVA table associated with the regression
model.
`> anova(mymod)`
## Analysis of Variance
Suppose one wishes to do a 2-way ANOVA model, where diastolic blood
pressure is the response with gender and smoking status the two factors.
As always, it is important to begin by investigating the relationships
graphically.
`> boxplot(dbp `$\sim$` smoke, data=dat)`
`> boxplot(dbp `$\sim$` gen, data=dat)`
`> with(dat, interaction.plot(smoke, gen, dbp))`
Since an ANOVA model is simply a linear model where the only predictors
are dummy variables representing group membership, ANOVA models can be
fit using the `lm` function.
`> lm.mod = lm(dbp `$\sim$` gen + smoke, data=dat)`
`> summary(lm.mod)`
Many researchers prefer output organized in a different ANOVA table. We
can also use the `aov` function to fit an ANOVA model.
`> aov.mod = aov(dbp `$\sim$` gen + smoke, data=dat)`
`> summary(aov.mod)`
`lm.mod` and `aov.mod` represent the same fit but the `summary` function
reports them differently. `summary` is an example of a generic function.
Different versions are used for different classes. Remember that we used
`summary` earlier to describe numeric vectors, factors, and data.frames.
Let's investigate the class of the two models.
`> class(lm.mod)`
`> class(aov.mod)`
The following commands give the mean of the dependent variable and each
factor level:
`> model.tables(aov.mod)`
`> ?model.tables`
We can use Tukey's HSD procedure to test the pairwise differences,
adjusting for multiple testing:
`> TukeyHSD(aov.mod)`
`> ?TukeyHSD`
## Multiple Linear Regression
A multiple linear regression assumes the following relationship:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon$$
In this notation, $Y$ represents the response variable of interest, and
the $X_j$ correspond to predictor variables. When fitting a linear
regression model, one aim to estimate the $\beta$ parameters. The fitted
model is sometimes used to predict future responses.
We will use the built-in LifeCycleSavings data set (which we saw in
passing earlier) to illustrate fitting a multiple linear regression. We
will use the popular `car` (Companion to Applied Regression) package for
some regression diagnostics.
`> data()`
Let's first begin by learning about the data set.
`> help(LifeCycleSavings)`
Shorten the name.
`> Life=LifeCycleSavings`
Now let's make a boxplot of each of the five variables. For comparison,
it would be helpful to view all of the plots on the same output. The
`par` and `mfrow` commands are useful for this. The `par` function
allows you to set graphical parameters, and `mfrow` allows you to
specify an array of plots.
The following commands set up a two by two grid of plots.
`> par(mfrow=c(2,2))`
We can add boxplots one at a time.
`> boxplot(Life$sr, main="sr")`
`> boxplot(Life$pop15, main="pop15")`
`> boxplot(Life$pop75, main="pop75")`
`> boxplot(Life$dpi, main="dpi")`
After exploring the marginal relationships of each of the variables, it
is a good idea to investigate the bivariate relationships. We do this
both numerically and graphically.
`> cor(Life)`
`> pairs(Life, panel=panel.smooth)`
Now fit a multiple regression model using the `lm` function.
`> mr.mod = lm(sr `$\sim$` pop15 + pop75 + dpi + ddpi, data=Life)`
`> summary(mr.mod)`
`> confint(mr.mod)`
We saw that there was a large negative correlation between pop75 and
pop15. One diagnostic for multicolinearity is the variance inflation
factor (VIF). Let's investigate the variance inflation factors. It has
been implemented in the `car` package. To access the function we must
load the package:
`> library(car)`
`> ?vif`
`> vif(mr.mod)`
How does the model change if we leave out one of the variables? We fit a
different model removing pop75.
`> mr.mod2 = lm(sr `$\sim$` pop15 + dpi + ddpi, data=Life)`
We now perform an F test for nested models.
`> anova(mr.mod, mr.mod2, test="F")`