Example of using linear model for (x,y) data
xdata <- Spruce$Ht.change
ydata <- Spruce$Di.change
head(Spruce,1)
## Tree Competition Fertilizer Height0 Height5 Diameter0 Diameter5 Ht.change
## 1 1 NC F 15 60 1.984375 7.4 45
## Di.change
## 1 5.415625
correlation coefficient and least squares linear fit of data
rho <- cor(xdata,ydata) # correlation coefficient
# least squares fit
xbar <- mean(xdata)
ybar <- mean(ydata)
sx <- sd(xdata)
b <- cov(xdata,ydata)/sx**2
a <- ybar - b*xbar
sample correlation coefficient is rho = 0.9020819
the least squares linear fit is y = a + bx with
intercept a = -0.518904, slope b = 0.1459492
check coefficients with R linear model “lm” command
data.lm <- lm(formula=ydata~xdata)
a <- data.lm$coefficients[1]
b <- data.lm$coefficients[2]
intercept a = -0.518904, slope b = 0.1459492
scatter plot
plot(xdata,ydata,col="blue") # scatter plot of data
abline(data.lm,col='red') # graph of linear model
residuals
y_predicted <- a+b*xdata # predicted values
res <- ydata - y_predicted # residuals
plot(xdata,res) # plot of residuals
abline(h=0,col='red') # horizontal line at zero
If linear fit is valid, the residuals should be random with no apparent pattern.
r-squared
r2 <- cor(xdata,ydata)**2 # r-squared
# r-squared as proportion of variance in y described by linear fit
n <-length(xdata)
sy <- sd(ydata)
factor <- (n-1)
ssy <- factor*sy**2 # n*(variance in data)
sfit <- sd(y_predicted)
ssfit <- factor*sfit**2 # n*(variance in fit)
r2_check <- ssfit/ssy # r2 as ratio of var in fit to var in data
r2 = 0.8137518
ratio of variance fit/variance data = 0.8137518
The r-squared value indicates the strength of the linear fit where \(0 \le r^2 \le 1\) where \(r^2\) quantifies how much of the overall variance in the data is described by the variance of the predictions of the linear fit. If the data is perfectly fit then \(r^2 =1\).