Chapter 10: Categorical Data

Section 10.1: Independence in Contingency Tables

A survey asked 200 students (HS and College) whether they prefer Instagram or Snapchat:

Group	Insta	Snap	Total	% Insta
HS	34	66	100	34%
College	52	48	100	52%
Total	86	114	200	43%

Question: Are the differences in Instagram preference due to chance?

Hypotheses:

If \( H_0 \) is true, then all groups should have the same Instagram preference: 43%.

Expected values assuming independence:

Group	Insta	Snap	Total
HS	43	57	100
College	43	57	100
Total	86	114	200

If \( H_0 \) is true, then observed counts should be close to expected.

If group and preference are independent, then any permutation of preferences is equally likely.

Steps:

p-value: \( p = P(C \ge C_{\text{obs}}) \)

If \( p \) is small, reject independence.

Chi-squared statistic:

\[ C = \sum \frac{(O - E)^2}{E} \]

Example: for observed vs expected counts:

\[ C = \frac{(34 - 43)^2}{43} + \frac{(66 - 57)^2}{57} + \frac{(52 - 43)^2}{43} + \frac{(48 - 57)^2}{57} = 6.6 \]

Large \( C \Rightarrow \) large deviation between observed and expected → group matters.

Distribution under \( H_0 \): Chi-square with degrees of freedom:

\[ df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1 \]

p-value: \[ P(C \ge 6.6) = 0.01 \]

Conclusion: Reject \( H_0 \): group affects preference.