\(\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\)
Covariance of a Sum:
\(\text{Cov}(X_1 + X_2, Y) = \text{Cov}(X_1, Y) + \text{Cov}(X_2, Y)\)
Variance of a Sum:
\(\text{Var}(X + Y) = \text{Var}(X) + 2\text{Cov}(X, Y) + \text{Var}(Y)\)
If \( X \perp Y \), then: \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\)
Example joint pdf: \( f(x, y) = \frac{3}{2}(x^2 + y^2), \quad 0 < x, y < 1 \)
Then: \(\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = \frac{3}{8} - \frac{5}{8} \cdot \frac{5}{8} = -\frac{1}{64}\)
\(\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)} \cdot \sqrt{\text{Var}(Y)}}\)
\(\rho\) is invariant to linear transformation: \(\rho(a + bX, Y) = \rho(X, Y)\)
Let \( X = \mu_X + \sigma_X Z_X, \quad Y = \mu_Y + \sigma_Y Z_Y \), then:
\(\rho(X, Y) = \rho(Z_X, Z_Y) = \text{Cov}(Z_X, Z_Y)\)
\( |\rho| \leq 1 \), with equality if \( Y = \frac{\sigma_Y}{\sigma_X}X + a \)
\( r \approx \rho = \frac{1}{n} \sum (x_i - \bar{x})(y_i - \bar{y}) \bigg/ \sqrt{\frac{1}{n} \sum (x_i - \bar{x})^2} \cdot \sqrt{\frac{1}{n} \sum (y_i - \bar{y})^2} \)
Alternate form:
\( r = \frac{ \left( \frac{1}{n} \sum x_i y_i \right) - \bar{x}\bar{y} }{ s_x s_y } \)
Interpretation of \( r \):
Model: \( \hat{y} = a + bx \)
Minimize squared error: \( S(a, b) = \sum (y_i - (a + b x_i))^2 \)
Solution:
Alternate form for slope:
\( b = \frac{ \sum (x_i - \bar{x})(y_i - \bar{y}) }{ \sum (x_i - \bar{x})^2 } = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \)
\( b = r \cdot \frac{s_y}{s_x} \)
\( r^2 = \frac{\text{variation in predicted values}}{\text{variation in observed values}} \)
\( r^2 \in [0, 1] \): indicates how well the regression explains the variation in \( y \)
See R codech9_regression.html for full derivations and examples.
Residual: \( \varepsilon_i = y_i - \hat{y}_i \)
Variance of residuals: \( s_e^2 = \frac{1}{n - 1} \sum (y_i - \hat{y}_i)^2 \)
Plot of residuals vs x should show random scatter without trend.
If residuals show a pattern (e.g., parabolic), consider nonlinear models.
Outliers: Can affect small datasets; only remove if justified.
Section 9.4: Inference on regression coefficients \( a, b \), including error bars.