Chapter 10: Categorical Data

Section 10.1: Independence in Contingency Tables

A survey asked 200 students (HS and College) whether they prefer Instagram or Snapchat:

GroupInstaSnapTotal% Insta
HS346610034%
College524810052%
Total8611420043%

Question: Are the differences in Instagram preference due to chance?

Hypotheses:

If \( H_0 \) is true, then all groups should have the same Instagram preference: 43%.

Expected Counts

Expected values assuming independence:

GroupInstaSnapTotal
HS4357100
College4357100
Total86114200

If \( H_0 \) is true, then observed counts should be close to expected.

Section 10.2: Permutation Test of Independence

If group and preference are independent, then any permutation of preferences is equally likely.

Steps:

  1. Randomly permute preference labels among individuals
  2. For each permutation, compute the chi-squared statistic \( C_i \)
  3. Repeat to generate empirical distribution of \( C \)
  4. Compare observed \( C \) to this distribution

p-value: \( p = P(C \ge C_{\text{obs}}) \)

If \( p \) is small, reject independence.

Section 10.3: Chi-Square Test of Independence

Chi-squared statistic:

\[ C = \sum \frac{(O - E)^2}{E} \]

Example: for observed vs expected counts:

\[ C = \frac{(34 - 43)^2}{43} + \frac{(66 - 57)^2}{57} + \frac{(52 - 43)^2}{43} + \frac{(48 - 57)^2}{57} = 6.6 \]

Large \( C \Rightarrow \) large deviation between observed and expected → group matters.

Distribution under \( H_0 \): Chi-square with degrees of freedom:

\[ df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1 \]

p-value: \[ P(C \ge 6.6) = 0.01 \]

Conclusion: Reject \( H_0 \): group affects preference.