Chapter 6: Estimation

Section 6.1: Maximum Likelihood Estimation (MLE)

Example: Coin Flip

Observation: H, H, H, T

Let \( p \) be the probability of heads. Then:

\[ P(HHHT) = p^3 (1 - p) \]

Define the likelihood function:

\[ L(p) = p^3 (1 - p) \]

To find the MLE of \( p \), solve for \( L'(p) = 0 \):

\[ L'(p) = 3p^2 - 4p^3 = p^2 (3 - 4p) \]

Solutions: \( p = 0 \) or \( p = \frac{3}{4} \). Since \( L(0) = 0 \), the MLE is:

\[ \hat{p}_{\text{MLE}} = \frac{3}{4} \]

MLE for Discrete Distributions

Let \( X_1, ..., X_n \) be i.i.d. samples from a discrete distribution with PMF \( f(x; \theta) \).

The likelihood function is:

\[ L(\theta) = \prod_{i=1}^n f(x_i; \theta) \]

Example: Poisson Distribution

PMF: \( f(x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \)

Observed data: \( x_1 = 3, x_2 = 4, x_3 = 3, x_4 = 7 \)

\[ L(\lambda) = \frac{\lambda^{17} e^{-4\lambda}}{3! \cdot 4! \cdot 3! \cdot 7!} \]

Log-likelihood:

\[ \ln L(\lambda) = 17 \ln \lambda - 4\lambda - \ln(\#) \]

Differentiating and solving:

\[ \frac{17}{\lambda} - 4 = 0 \Rightarrow \hat{\lambda}_{\text{MLE}} = \frac{17}{4} \]

MLE for Continuous Distributions

Let \( X_1, ..., X_n \) be i.i.d. samples from a continuous distribution with PDF \( f(x; \theta) \).

\[ L(\theta) = \prod_{i=1}^n f(x_i; \theta) \]

Example: Exponential Distribution

PDF: \( f(x; \lambda) = \lambda e^{-\lambda x}, \quad x \ge 0 \)

\[ L(\lambda) = \lambda^n e^{-\lambda \sum x_i} \]

Log-likelihood:

\[ \ln L(\lambda) = n \ln \lambda - \lambda \sum x_i \]

Solving:

\[ \frac{n}{\lambda} - \sum x_i = 0 \Rightarrow \hat{\lambda}_{\text{MLE}} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}} \]

MLE for Cauchy Distribution (Numerical Optimization)

Distribution: \( X \sim \text{Cauchy}(\theta) \)

PDF: \( f(x; \theta) = \frac{1}{\pi(1 + (x - \theta)^2)} \)

Data: \( x = [1, 2, 3] \)

Log-likelihood:

\[ \ln L(\theta) = -3 \ln \pi - \sum_{i=1}^3 \ln(1 + (x_i - \theta)^2) \]

Score equation:

\[ \sum_{i=1}^3 \frac{2(x_i - \theta)}{1 + (x_i - \theta)^2} = 0 \]

This is nonlinear in \( \theta \), so solve numerically.

R Code for Optimization:


x <- c(1, 2, 3)
logL <- function(theta) sum(log(dcauchy(x, theta)))
optimize(logL, interval = c(0, 4), maximum = TRUE)

Output: maximum = 2, objective = -4.8205

MLE with Two Parameters in Normal Distribution

Distribution: \( X \sim N(\mu, \sigma^2) \)

PDF: \( f(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right) \)

Log-likelihood:

\[ \ln L(\mu, \sigma) = -\frac{n}{2} \ln(2\pi) - n \ln \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]

MLE for \( \mu \):

\[ \frac{\partial}{\partial \mu} \ln L = 0 \Rightarrow \mu = \frac{1}{n} \sum x_i = \bar{x} \]

MLE for \( \sigma \):

\[ \frac{\partial}{\partial \sigma} \ln L = 0 \Rightarrow \sigma^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 \]

Final MLE Estimates:

\[ \hat{\mu}_{\text{MLE}} = \bar{x}, \quad \hat{\sigma}_{\text{MLE}} = \sqrt{ \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 } \]

Section 6.2: Method of Moments (MOM)

Let \( X \) be a random variable with PDF \( f(x; \theta) \), and observed sample \( x_1, x_2, \dots, x_n \).

The r-th moment of the distribution is:

\[ \mathbb{E}[X^r] = \int_{-\infty}^{\infty} x^r f(x; \theta) dx \]

The corresponding sample moment is:

\[ \frac{1}{n} \sum_{i=1}^n x_i^r \]

Key idea: Match theoretical moments with sample moments to solve for parameters.

Case 1: Single parameter \( \theta \)

Match the first moment:

\[ \mathbb{E}[X] = \int x f(x; \theta) dx = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \Rightarrow \hat{\theta}_{\text{MOM}} = \text{solution to the above} \]

Example: Exponential Distribution

\( f(x; \lambda) = \lambda e^{-\lambda x}, \quad x \ge 0, \lambda > 0 \)

Theoretical mean:

\[ \mathbb{E}[X] = \int_0^\infty x \lambda e^{-\lambda x} dx = \frac{1}{\lambda} \]

Set equal to sample mean:

\[ \frac{1}{\lambda} = \bar{x} \Rightarrow \hat{\lambda}_{\text{MOM}} = \frac{1}{\bar{x}} \]

Case 2: Two parameters (e.g., Normal Distribution)

\( X \sim N(\mu, \sigma^2) \), match moments for \( r = 1, 2 \):

First moment: \(\mathbb{E}[X] = \mu \Rightarrow \hat{\mu}_{\text{MOM}} = \bar{x}\)
Second moment: \(\mathbb{E}[X^2] = \sigma^2 + \mu^2\)

Sample second moment:

\[ \frac{1}{n} \sum_{i=1}^n x_i^2 = \mathbb{E}[X^2] \Rightarrow \hat{\sigma}_{\text{MOM}}^2 = \frac{1}{n} \sum x_i^2 - \bar{x}^2 \]

This is equivalent to the non-adjusted sample variance (denominator is \( n \), not \( n - 1 \)).

Section 6.3: Properties of Estimators

Let \( \hat{\theta} \) be an estimator of parameter \( \theta \) from a sample.

The bias of \( \hat{\theta} \) is defined as:

\[ \text{Bias}[\hat{\theta}] = \mathbb{E}[\hat{\theta}] - \theta \]

Ideally, we want the bias to be zero (i.e., the estimator is unbiased).

Example: Estimating the Mean

Suppose we use the sample mean to estimate \( \mu = \mathbb{E}[X] \).

Let \( \hat{\mu} = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \). Then:

\[ \text{Bias}[\bar{X}] = \mathbb{E}[\bar{X}] - \mu = \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^n X_i\right] - \mu = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] - \mu = \mu - \mu = 0 \]

Conclusion: The sample mean \( \bar{X} \) is an unbiased estimator of \( \mu \).

This applies whether \( \bar{X} \) is derived via MLE or Method of Moments.

Section 6.3: Properties of Estimators (Continued)

Unbiased Estimator of Variance

Theorem 6.3.2: Let \( X_1, \dots, X_n \) be independent random variables with variance \( \text{Var}(X_i) = \sigma^2 \). Then an unbiased estimator of \( \sigma^2 \) is:

\[ S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - \bar{X})^2 \]

This is the sample variance. The factor \( \frac{1}{n - 1} \) is used to make it unbiased.

Why Divide by \( n-1 \)?

If we instead use:

\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 \]

Then the bias is:

\[ \text{Bias}[\hat{\sigma}^2] = \mathbb{E}[\hat{\sigma}^2] - \sigma^2 = -\frac{\sigma^2}{n} \ne 0 \]

We correct this by rescaling:

\[ S^2 = \frac{n}{n - 1} \hat{\sigma}^2 \Rightarrow \mathbb{E}[S^2] = \sigma^2 \]

Asymptotic Bias

An estimator is asymptotically unbiased if:

\[ \lim_{n \to \infty} \text{Bias}[\hat{\theta}] = 0 \]

For example:

\[ \lim_{n \to \infty} \text{Bias}[\hat{\sigma}^2] = 0 \Rightarrow \hat{\sigma}^2 \text{ is asymptotically unbiased} \]

Efficiency of Estimators

Efficiency measures how "tight" the estimator is. Given two unbiased estimators \( \hat{\theta}_1, \hat{\theta}_2 \):

\[ \text{If } \text{Var}[\hat{\theta}_1] < \text{Var}[\hat{\theta}_2] \Rightarrow \hat{\theta}_1 \text{ is more efficient} \]

Example:

\( \hat{\mu}_1 = \frac{1}{3}(X_1 + X_2 + X_3) \Rightarrow \text{Var} = \frac{\sigma^2}{3} \)
\( \hat{\mu}_2 = \frac{1}{6}X_1 + \frac{1}{3}X_2 + \frac{1}{2}X_3 \Rightarrow \text{Var} = \frac{7}{18} \sigma^2 \)

Conclusion: \( \hat{\mu}_1 \) is more efficient.

Mean Squared Error (MSE)

The mean squared error is:

\[ \text{MSE}[\hat{\theta}] = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Var}[\hat{\theta}] + (\text{Bias}[\hat{\theta}])^2 \]

MSE combines both variance and bias, and is a useful metric when estimators are biased.

Example: \( X \sim \text{Binomial}(n, p) \)

Estimator 1: \( \hat{p}_1 = \frac{X}{n} \)
\( \text{Bias}[\hat{p}_1] = 0, \quad \text{MSE} = \frac{p(1 - p)}{n} \)
Estimator 2: \( \hat{p}_2 = \frac{X + 1}{n + 2} \)
\( \text{Bias}[\hat{p}_2] \ne 0, \quad \text{MSE} = \frac{np(1 - p)}{(n + 2)^2} + \left( \frac{1 - 2p}{n + 2} \right)^2 \)

Depending on values of \( n \) and \( p \), either estimator may have lower MSE.

Consistency of Estimators

An estimator \( \hat{\theta} \) is consistent if:

\[ \lim_{n \to \infty} P(|\hat{\theta} - \theta| < \varepsilon) = 1 \quad \text{for any } \varepsilon > 0 \]

Example: Let \( X_1, ..., X_n \sim N(\mu, 1) \), then:

\[ \bar{X}_n = \frac{1}{n} \sum X_i \quad \Rightarrow \mathbb{E}[\bar{X}_n] = \mu, \quad \text{Var}[\bar{X}_n] = \frac{1}{n} \]

Since variance goes to 0 as \( n \to \infty \), \( \bar{X}_n \) is a consistent estimator of \( \mu \).

Alternative criterion:

If \( \text{MSE}[\hat{\theta}_n] \to 0 \) as \( n \to \infty \), then \( \hat{\theta}_n \) is consistent.