Chapter 1: Data and Case Studies

Section 1.1: Introduction to Data

Statistics is driven by real applications:

Example (Sec 1.1: Flight Delays):


Section 1.2: Types of Data

Statistical inference: Making conclusions about a population based on a sample.

Notation:


Section 1.3: Observations

Observations of a Random Process

Population is infinite. Random sample means independent and identically distributed (i.i.d) observations.

Example: Let \( X \in \{0, 1\} \) with equal probability → Bernoulli distribution:

\( X \sim \text{Bern}(p) \), where:

\[ X = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1 - p \end{cases} \]

Observations of Data

Population is finite. May not come from a known random process.

Example: Population = \{1, 2, 3, 3̅, 7\} (size \( N = 5 \))

If \( N \gg n \), both methods yield similar results.


Section 1.4: Statistics vs Parameters

Examples:


Section 1.5: Surveys

Survey: Ask people what they think or how they live

Sample survey: Use a sample from the population due to practicality

Example: General Social Survey (GSS)


Section 1.6: Observational vs Experimental Studies

Observational Study: Observe only, no intervention

Example: Beer and hot wings consumption (Sec. 1.9)

Experimental Study: Change conditions or give treatment

Example: Tree seedling growth under different fertilizer/competition (Sec. 1.10)

Caution: Non-random samples → results not generalizable

Example: Tai Chi arthritis study (Sec. 1.11)

Probability Review & Statistics

Fundamentals of Probability

The sample space \( S \) is the set of all possible outcomes of a random experiment.

Example: Rolling a 6-sided die → \( S = \{1, 2, 3, 4, 5, 6\} \)

A discrete random variable maps outcomes to a countable set: \( X: S \rightarrow \{x_1, x_2, \ldots\} \)

A continuous random variable takes values over the real numbers: \( S = \mathbb{R} \)

Probability is the relative frequency of an event if the experiment were repeated many times.

Example: For a fair die, \( P(1) = \frac{1}{6} \)

We can write \( P(A) = P(X \in A) \), where \( A \) is a set of outcomes.

Conditional Probability

\( P(A|B) = \frac{P(A \cap B)}{P(B)} \): the probability of \( A \) given \( B \) has occurred.

Law of Total Probability

If \( \{B_1, B_2, \ldots, B_n\} \) is a partition of \( S \), then:

\( P(A) = \sum_{i=1}^n P(B_i) P(A|B_i) \)

Example (rolling a total of 3 with two dice):

\( P(A) = \frac{1}{6} \cdot \frac{1}{6} + \frac{1}{6} \cdot \frac{1}{6} = \frac{2}{36} \)

Discrete Random Variables and PMF

A discrete random variable has outcomes in a finite or countably infinite set.

The probability mass function (PMF): \( p(x) = P(X = x) \), and \( \sum p(x) = 1 \)

Example (sum of 2 dice): Distribution peaks at 7 with \( p(7) = \frac{6}{36} \)

Binomial Distribution

\( p(k) = \binom{n}{k} p^k (1 - p)^{n - k}, \quad k = 0, 1, \ldots, n \)

Expected Value and Variance

\( \mathbb{E}[g(X)] = \sum_j g(x_j) p(x_j) \)

\( \mu = \mathbb{E}[X] \), \( \sigma^2 = \mathbb{E}[(X - \mu)^2] \)

Continuous Random Variables and PDF

PDF: \( f(x) \geq 0 \), \( \int_{-\infty}^{\infty} f(x) dx = 1 \)

\( P(a < X < b) = \int_a^b f(x) dx \)

Cumulative Distribution Function (CDF)

\( F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) dt \)

Exponential Distribution

\( f(x) = \begin{cases} \lambda e^{-\lambda x}, & x \geq 0 \\ 0, & x < 0 \end{cases} \)

\( F(x) = 1 - e^{-\lambda x} \)

Mean: \( \mu = \frac{1}{\lambda} \), Variance: \( \sigma^2 = \frac{1}{\lambda^2} \)

Shortcut for Variance

\( \text{Var}(X) = \mathbb{E}[X^2] - \mu^2 \)

\( \text{Var}(a + bX) = b^2 \cdot \text{Var}(X) \)


Sample Mean of Random Variables

Let \( X_1, X_2, \ldots, X_n \) be independent and identically distributed (i.i.d.) random variables with mean \( \mu \) and variance \( \sigma^2 \).

The sample mean is defined as \( \bar{X} = \frac{1}{n} \sum_{j=1}^{n} X_j \).

Example: Bernoulli Random Variable


Normal Distribution

Normal Distribution: \( X \sim \mathcal{N}(\mu, \sigma^2) \)

PDF: \( f(x) = \frac{1}{\sqrt{2\pi} \sigma} e^{- \frac{(x - \mu)^2}{2\sigma^2}} \)

Standard Normal Distribution: \( Z \sim \mathcal{N}(0, 1) \)

PDF: \( f(z) = \frac{1}{\sqrt{2\pi}} e^{- z^2 / 2} \)

Standardization: \( Z = \frac{X - \mu}{\sigma}, \quad X = \mu + \sigma Z \)

Cumulative Distribution Function (CDF): \( \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-t^2 / 2} dt \)

Approximate Probabilities:

Sums of Normal Random Variables: If \( X \sim \mathcal{N}(\mu_1, \sigma_1^2), Y \sim \mathcal{N}(\mu_2, \sigma_2^2) \), and they are independent:

\( X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \)


Normal Distribution Application Example

Example: Weight of boys \( X \sim \mathcal{N}(100, 5^2) \), weight of girls \( Y \sim \mathcal{N}(90, 6^2) \)

Want: \( P(X - Y > 6) \)


Sum of Independent Normal Variables & Moment Generating Functions

Let \( X_i \sim \mathcal{N}(\mu_i, \sigma_i^2) \), and define \( X = \sum a_i X_i \).

Corollary: If \( X_i \sim \mathcal{N}(\mu_0, \sigma_0^2) \), then:

\( \bar{X} = \frac{1}{n} \sum X_i \sim \mathcal{N}(\mu_0, \frac{\sigma_0^2}{n}) \)

Example: Coffee volume \( \sim \mathcal{N}(8, 0.47) \), sample size \( n = 10 \):

Moment Generating Function: \( M(t) = \mathbb{E}[e^{tX}] \)

\( \frac{d^n}{dt^n} M(t) \bigg|_{t = 0} = \mathbb{E}[X^n] \)