Chapter 2: Summary Statistics & Exploratory Data Analysis

Section 2.1: Basic Statistics

Data: A set \( \{x_1, x_2, ..., x_n\} \), e.g., \( \{8, 3, 14, 1, 5, 7, 21, 4, 10, 3\} \).

Ordered Data: \( \{1, 3, 3, 4, 5, 7, 8, 10, 14, 21\} \)

Count: \( n = 10 \)
Mean: \( \bar{x} = \frac{1}{10} \cdot 76 = 7.6 \)
Minimum: \( 1 \)
Maximum: \( 21 \)
Median: \( \frac{y_5 + y_6}{2} = \frac{5 + 7}{2} = 6 \)

See supporting material: ch02_basicstats.html

Section 2.2: Standard Deviation

Sample Standard Deviation:

\( s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} = \sqrt{\frac{1}{9}(332.4)} = 6.077 \)

Population Standard Deviation:

\( s_{\text{pop}} = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \)

\( s_{\text{pop}} < s \), and \( \lim_{n \to \infty} \frac{s_{\text{pop}}}{s} = 1 \)

Section 2.3: Quartiles and Five-Number Summary

\( Q_1 = 3.25 \)
\( Q_3 = 9.5 \)
Five-number summary: \( (1, 3.25, 6, 9.5, 21) \)

Section 2.4: Box Plot

Interquartile Range (IQR): \( Q_3 - Q_1 = 6.25 \)

Upper Fence: \( Q_3 + 1.5 \cdot \text{IQR} = 18.875 \)

Lower Fence: \( Q_1 - 1.5 \cdot \text{IQR} = -6.125 \)

Outliers are data points outside the fences.

The box shows \( Q_1 \), median, and \( Q_3 \); whiskers extend to the most extreme data points within the fences.

Section 2.5: Histogram

Data are grouped into bins:

(0, 5]: {1, 3, 3, 4, 5} → 5 counts
(5, 10]: {7, 8, 10} → 3 counts
(10, 15]: {14} → 1 count
(15, 20]: — → 0
(20, 25]: {21} → 1 count

Section 2.6: Empirical CDF

Definition: \( \hat{F}(x) = \frac{1}{n} \cdot \text{# of observations } \leq x \)

For sorted data \( \{0, 3, 3, 5, 7\} \):

\( x < 0 \): \( \hat{F}(x) = 0 \)
\( 0 \le x < 3 \): \( \hat{F}(x) = 0.2 \)
\( 3 \le x < 5 \): \( \hat{F}(x) = 0.6 \)
\( 5 \le x < 7 \): \( \hat{F}(x) = 0.8 \)
\( x \ge 7 \): \( \hat{F}(x) = 1 \)

See exploration demo: ch02_explore_data.html

Section 2.7: Quantiles and Distributions

Quantile Definition:

Let \( X \) be a random variable. The \( p \)-th quantile \( q_p \) satisfies:

\( P(X \leq q_p) = p \), i.e., \( F(q_p) = p \)

Example (Standard Normal):

\( X \sim N(0, 1) \), then \( F(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-z^2/2} dz \)

\( q_{0.5} = 0 \)
\( q_{0.8} \approx 0.84 \)

Section 2.8: Solving Quantiles from PDF

Given \( f(x) = e^{-x} \) for \( x \ge 0 \), find \( q_{0.75} \):

\( \int_0^{q_p} e^{-x} dx = 0.75 \Rightarrow -e^{-q_p} + 1 = 0.75 \Rightarrow q_p = \ln(4) \approx 1.386 \)

Section 2.9: Scaling of Quantiles in Normal Distribution

Let \( Z \sim N(0,1) \), \( X \sim N(\mu, \sigma^2) \)

\( X = \mu + \sigma Z \Rightarrow q_p^{(X)} = \mu + \sigma q_p^{(Z)} \)

Thus, plotting \( q_p^{(X)} \) vs. \( q_p^{(Z)} \) gives a line with slope \( \sigma \) and intercept \( \mu \).

Section 2.10: Quantile-Quantile (Q-Q) Plot

Purpose: Compare sample data to a theoretical distribution

Steps:

Sort data \( x_1 \le x_2 \le \cdots \le x_n \)
Use probability nodes \( p_i = \frac{i}{n+1} \)
Compare data quantiles \( q_{p_i}^{(d)} = x_i \) to theoretical quantiles \( q_{p_i}^{(t)} \)

Section 2.11: Interpreting the Q-Q Plot

If points lie along the line \( y = x \), the data follows the reference distribution.

For normal data \( N(\mu, \sigma^2) \), quantiles relate linearly to \( N(0,1) \):

\( q^{(d)} = \mu + \sigma q^{(t)} \)

Conclusion: You do not need to know \( \mu \) and \( \sigma \) to test normality.

See derivations and formulae: ch02_prob_dist.html