Chapter 1: Data and Case Studies

Section 1.1: Introduction to Data

Statistics is driven by real applications:

Collect data
Understand what it means

Example (Sec 1.1: Flight Delays):

Columns = variables
Rows = observations

Section 1.2: Types of Data

Population: Data for all individuals or observations (finite or infinite)
Sample: Data from a subset of the population
Random sample: Observations are chosen randomly

Statistical inference: Making conclusions about a population based on a sample.

Notation:

Capital letters \( X, Y, Z \): random variables
Lowercase letters \( x_1, x_2, \ldots \): observed values (data)

Section 1.3: Observations

Observations of a Random Process

Population is infinite. Random sample means independent and identically distributed (i.i.d) observations.

Example: Let \( X \in \{0, 1\} \) with equal probability → Bernoulli distribution:

\( X \sim \text{Bern}(p) \), where:

\[ X = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1 - p \end{cases} \]

Observations of Data

Population is finite. May not come from a known random process.

Example: Population = \{1, 2, 3, 3̅, 7\} (size \( N = 5 \))

Sampling with replacement: sample = \{2, 3, 2\}
Sampling without replacement: sample = \{7, 3, 3̅\}

If \( N \gg n \), both methods yield similar results.

Section 1.4: Statistics vs Parameters

Statistic: numerical characteristic of a sample
Parameter: numerical characteristic of a population or a probability distribution

Examples:

\( X \sim \mathcal{N}(\mu, \sigma^2) \): \( \mu, \sigma \) are parameters
Sample mean: \( \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \): statistic (also a random variable)
Data mean: \( \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \): statistic
Population = all babies born in US in 2004:
- Mean of population = parameter \( \mu \)
- Mean of sample of 200 babies = statistic \( \bar{x} \)

Section 1.5: Surveys

Survey: Ask people what they think or how they live

Sample survey: Use a sample from the population due to practicality

Population → Sampling Frame → Random Sample

Example: General Social Survey (GSS)

Surveys on sociological factors: gender, income, education, etc.
Archived online since 1970s, evolving questions
Uses multilevel stratified sampling (not covered in this course)

Section 1.6: Observational vs Experimental Studies

Observational Study: Observe only, no intervention

Example: Beer and hot wings consumption (Sec. 1.9)

Experimental Study: Change conditions or give treatment

Treatment group vs control group
Random assignment of subjects to groups

Example: Tree seedling growth under different fertilizer/competition (Sec. 1.10)

Caution: Non-random samples → results not generalizable

Example: Tai Chi arthritis study (Sec. 1.11)

All volunteers → cannot generalize to broader population

Probability Review & Statistics

Fundamentals of Probability

The sample space \( S \) is the set of all possible outcomes of a random experiment.

Example: Rolling a 6-sided die → \( S = \{1, 2, 3, 4, 5, 6\} \)

A discrete random variable maps outcomes to a countable set: \( X: S \rightarrow \{x_1, x_2, \ldots\} \)

A continuous random variable takes values over the real numbers: \( S = \mathbb{R} \)

Probability is the relative frequency of an event if the experiment were repeated many times.

Example: For a fair die, \( P(1) = \frac{1}{6} \)

We can write \( P(A) = P(X \in A) \), where \( A \) is a set of outcomes.

Conditional Probability

\( P(A|B) = \frac{P(A \cap B)}{P(B)} \): the probability of \( A \) given \( B \) has occurred.

Example: Two dice are rolled
\( A = \{(1,2), (2,1)\} \), \( P(A) = \frac{2}{36} \)
\( B = \{(1,1), (1,2), \dots, (1,6)\} \), \( P(B) = \frac{6}{36} \)
\( A \cap B = \{(1,2)\} \), \( P(A|B) = \frac{1}{6} \)

Law of Total Probability

If \( \{B_1, B_2, \ldots, B_n\} \) is a partition of \( S \), then:

\( P(A) = \sum_{i=1}^n P(B_i) P(A|B_i) \)

Example (rolling a total of 3 with two dice):

\( P(A) = \frac{1}{6} \cdot \frac{1}{6} + \frac{1}{6} \cdot \frac{1}{6} = \frac{2}{36} \)

Discrete Random Variables and PMF

A discrete random variable has outcomes in a finite or countably infinite set.

The probability mass function (PMF): \( p(x) = P(X = x) \), and \( \sum p(x) = 1 \)

Example (sum of 2 dice): Distribution peaks at 7 with \( p(7) = \frac{6}{36} \)

Binomial Distribution

\( p(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k = 0, 1, \ldots, n \)

Expected Value and Variance

\( \mathbb{E}[g(X)] = \sum_j g(x_j) p(x_j) \)

\( \mu = \mathbb{E}[X] \), \( \sigma^2 = \mathbb{E}[(X - \mu)^2] \)

Continuous Random Variables and PDF

PDF: \( f(x) \geq 0 \), \( \int_{-\infty}^{\infty} f(x) dx = 1 \)

\( P(a < X < b) = \int_a^b f(x) dx \)

Cumulative Distribution Function (CDF)

\( F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) dt \)

Exponential Distribution

\( f(x) = \begin{cases} \lambda e^{-\lambda x}, & x \geq 0 \\ 0, & x < 0 \end{cases} \)

\( F(x) = 1 - e^{-\lambda x} \)

Mean: \( \mu = \frac{1}{\lambda} \), Variance: \( \sigma^2 = \frac{1}{\lambda^2} \)

Shortcut for Variance

\( \text{Var}(X) = \mathbb{E}[X^2] - \mu^2 \)

\( \text{Var}(a + bX) = b^2 \cdot \text{Var}(X) \)

Sample Mean of Random Variables

Let \( X_1, X_2, \ldots, X_n \) be independent and identically distributed (i.i.d.) random variables with mean \( \mu \) and variance \( \sigma^2 \).

The sample mean is defined as \( \bar{X} = \frac{1}{n} \sum_{j=1}^{n} X_j \).

\( \mathbb{E}[\bar{X}] = \mu \)
\( \text{Var}[\bar{X}] = \frac{\sigma^2}{n} \)

Example: Bernoulli Random Variable

\( X_j \in \{0, 1\} \), with \( \mathbb{E}[X_j] = p \), \( \text{Var}[X_j] = p(1 - p) \)
Then \( \mathbb{E}[\bar{X}] = p \), and \( \text{Var}[\bar{X}] = \frac{p(1 - p)}{n} \)

Normal Distribution

Normal Distribution: \( X \sim \mathcal{N}(\mu, \sigma^2) \)

PDF: \( f(x) = \frac{1}{\sqrt{2\pi} \sigma} e^{- \frac{(x - \mu)^2}{2\sigma^2}} \)

Standard Normal Distribution: \( Z \sim \mathcal{N}(0, 1) \)

PDF: \( f(z) = \frac{1}{\sqrt{2\pi}} e^{- z^2 / 2} \)

Standardization: \( Z = \frac{X - \mu}{\sigma}, \quad X = \mu + \sigma Z \)

Cumulative Distribution Function (CDF): \( \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-t^2 / 2} dt \)

Approximate Probabilities:

\( P(-1 < Z < 1) \approx 0.68 \)
\( P(-2 < Z < 2) \approx 0.95 \)
\( P(-3 < Z < 3) \approx 0.997 \)

Sums of Normal Random Variables: If \( X \sim \mathcal{N}(\mu_1, \sigma_1^2), Y \sim \mathcal{N}(\mu_2, \sigma_2^2) \), and they are independent:

\( X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \)

Normal Distribution Application Example

Example: Weight of boys \( X \sim \mathcal{N}(100, 5^2) \), weight of girls \( Y \sim \mathcal{N}(90, 6^2) \)

Want: \( P(X - Y > 6) \)

\( X - Y \sim \mathcal{N}(10, 61) \)
Standardize: \( Z = \frac{(X - Y - 10)}{\sqrt{61}} \)
\( P(X - Y > 6) = P(Z > -0.5121) \approx 0.695 \)

Sum of Independent Normal Variables & Moment Generating Functions

Let \( X_i \sim \mathcal{N}(\mu_i, \sigma_i^2) \), and define \( X = \sum a_i X_i \).

\( \mu = \sum a_i \mu_i \)
\( \sigma^2 = \sum a_i^2 \sigma_i^2 \)

Corollary: If \( X_i \sim \mathcal{N}(\mu_0, \sigma_0^2) \), then:

\( \bar{X} = \frac{1}{n} \sum X_i \sim \mathcal{N}(\mu_0, \frac{\sigma_0^2}{n}) \)

Example: Coffee volume \( \sim \mathcal{N}(8, 0.47) \), sample size \( n = 10 \):

\( \bar{X} \sim \mathcal{N}(8, \frac{0.47}{10}) \)
\( P(\bar{X} > 8.5) = P(Z > 2.306) \approx 0.0105 \)

Moment Generating Function: \( M(t) = \mathbb{E}[e^{tX}] \)

\( \frac{d^n}{dt^n} M(t) \bigg|_{t = 0} = \mathbb{E}[X^n] \)