MTH 412/512 - ch9_regression.Rmd

Example of using linear model for (x,y) data

xdata <- Spruce$Ht.change
ydata <- Spruce$Di.change
head(Spruce,1)
##   Tree Competition Fertilizer Height0 Height5 Diameter0 Diameter5 Ht.change
## 1    1          NC          F      15      60  1.984375       7.4        45
##   Di.change
## 1  5.415625

correlation coefficient and least squares linear fit of data

rho <- cor(xdata,ydata)  # correlation coefficient
# least squares fit
xbar <- mean(xdata)
ybar <- mean(ydata)
sx <- sd(xdata)
b <- cov(xdata,ydata)/sx**2
a <- ybar - b*xbar

sample correlation coefficient is rho = 0.9020819

the least squares linear fit is y = a + bx with
intercept a = -0.518904, slope b = 0.1459492

check coefficients with R linear model “lm” command

data.lm <- lm(formula=ydata~xdata)
a <- data.lm$coefficients[1]
b <- data.lm$coefficients[2]

intercept a = -0.518904, slope b = 0.1459492

scatter plot

plot(xdata,ydata,col="blue")   # scatter plot of data
abline(data.lm,col='red')     # graph of linear model

residuals

y_predicted <- a+b*xdata      # predicted values
res <- ydata - y_predicted    # residuals
plot(xdata,res)               # plot of residuals
abline(h=0,col='red')         # horizontal line at zero

If linear fit is valid, the residuals should be random with no apparent pattern.

r-squared

r2 <- cor(xdata,ydata)**2     # r-squared

# r-squared as proportion of variance in y described by linear fit
n <-length(xdata)
sy <- sd(ydata)
factor <- (n-1)
ssy <- factor*sy**2  # n*(variance in data)
sfit <- sd(y_predicted)
ssfit <- factor*sfit**2 # n*(variance in fit)
r2_check <- ssfit/ssy   # r2 as ratio of var in fit to var in data

r2 = 0.8137518
ratio of variance fit/variance data = 0.8137518

The r-squared value indicates the strength of the linear fit where \(0 \le r^2 \le 1\) where \(r^2\) quantifies how much of the overall variance in the data is described by the variance of the predictions of the linear fit. If the data is perfectly fit then \(r^2 =1\).