Chapter 2: Summary Statistics & Exploratory Data Analysis

Section 2.1: Basic Statistics

Data: A set \( \{x_1, x_2, ..., x_n\} \), e.g., \( \{8, 3, 14, 1, 5, 7, 21, 4, 10, 3\} \).

Ordered Data: \( \{1, 3, 3, 4, 5, 7, 8, 10, 14, 21\} \)

See supporting material: ch02_basicstats.html

Section 2.2: Standard Deviation

Sample Standard Deviation:

\( s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} = \sqrt{\frac{1}{9}(332.4)} = 6.077 \)

Population Standard Deviation:

\( s_{\text{pop}} = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \)

\( s_{\text{pop}} < s \), and \( \lim_{n \to \infty} \frac{s_{\text{pop}}}{s} = 1 \)

Section 2.3: Quartiles and Five-Number Summary

Section 2.4: Box Plot

Interquartile Range (IQR): \( Q_3 - Q_1 = 6.25 \)

Upper Fence: \( Q_3 + 1.5 \cdot \text{IQR} = 18.875 \)

Lower Fence: \( Q_1 - 1.5 \cdot \text{IQR} = -6.125 \)

Outliers are data points outside the fences.

The box shows \( Q_1 \), median, and \( Q_3 \); whiskers extend to the most extreme data points within the fences.

Section 2.5: Histogram

Data are grouped into bins:

Section 2.6: Empirical CDF

Definition: \( \hat{F}(x) = \frac{1}{n} \cdot \text{# of observations } \leq x \)

For sorted data \( \{0, 3, 3, 5, 7\} \):

See exploration demo: ch02_explore_data.html


Section 2.7: Quantiles and Distributions

Quantile Definition:

Let \( X \) be a random variable. The \( p \)-th quantile \( q_p \) satisfies:

\( P(X \leq q_p) = p \), i.e., \( F(q_p) = p \)

Example (Standard Normal):

\( X \sim N(0, 1) \), then \( F(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-z^2/2} dz \)

Section 2.8: Solving Quantiles from PDF

Given \( f(x) = e^{-x} \) for \( x \ge 0 \), find \( q_{0.75} \):

\( \int_0^{q_p} e^{-x} dx = 0.75 \Rightarrow -e^{-q_p} + 1 = 0.75 \Rightarrow q_p = \ln(4) \approx 1.386 \)

Section 2.9: Scaling of Quantiles in Normal Distribution

Let \( Z \sim N(0,1) \), \( X \sim N(\mu, \sigma^2) \)

\( X = \mu + \sigma Z \Rightarrow q_p^{(X)} = \mu + \sigma q_p^{(Z)} \)

Thus, plotting \( q_p^{(X)} \) vs. \( q_p^{(Z)} \) gives a line with slope \( \sigma \) and intercept \( \mu \).

Section 2.10: Quantile-Quantile (Q-Q) Plot

Purpose: Compare sample data to a theoretical distribution

Steps:

Section 2.11: Interpreting the Q-Q Plot

If points lie along the line \( y = x \), the data follows the reference distribution.

For normal data \( N(\mu, \sigma^2) \), quantiles relate linearly to \( N(0,1) \):

\( q^{(d)} = \mu + \sigma q^{(t)} \)

Conclusion: You do not need to know \( \mu \) and \( \sigma \) to test normality.

See derivations and formulae: ch02_prob_dist.html