Example: Coin Flip
Observation: H, H, H, T
Let \( p \) be the probability of heads. Then:
\[ P(HHHT) = p^3 (1 - p) \]
Define the likelihood function:
\[ L(p) = p^3 (1 - p) \]
To find the MLE of \( p \), solve for \( L'(p) = 0 \):
\[ L'(p) = 3p^2 - 4p^3 = p^2 (3 - 4p) \]
Solutions: \( p = 0 \) or \( p = \frac{3}{4} \). Since \( L(0) = 0 \), the MLE is:
\[ \hat{p}_{\text{MLE}} = \frac{3}{4} \]
Let \( X_1, ..., X_n \) be i.i.d. samples from a discrete distribution with PMF \( f(x; \theta) \).
The likelihood function is:
\[ L(\theta) = \prod_{i=1}^n f(x_i; \theta) \]
PMF: \( f(x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \)
Observed data: \( x_1 = 3, x_2 = 4, x_3 = 3, x_4 = 7 \)
\[ L(\lambda) = \frac{\lambda^{17} e^{-4\lambda}}{3! \cdot 4! \cdot 3! \cdot 7!} \]
Log-likelihood:
\[ \ln L(\lambda) = 17 \ln \lambda - 4\lambda - \ln(\#) \]
Differentiating and solving:
\[ \frac{17}{\lambda} - 4 = 0 \Rightarrow \hat{\lambda}_{\text{MLE}} = \frac{17}{4} \]
Let \( X_1, ..., X_n \) be i.i.d. samples from a continuous distribution with PDF \( f(x; \theta) \).
\[ L(\theta) = \prod_{i=1}^n f(x_i; \theta) \]
PDF: \( f(x; \lambda) = \lambda e^{-\lambda x}, \quad x \ge 0 \)
\[ L(\lambda) = \lambda^n e^{-\lambda \sum x_i} \]
Log-likelihood:
\[ \ln L(\lambda) = n \ln \lambda - \lambda \sum x_i \]
Solving:
\[ \frac{n}{\lambda} - \sum x_i = 0 \Rightarrow \hat{\lambda}_{\text{MLE}} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}} \]
Distribution: \( X \sim \text{Cauchy}(\theta) \)
PDF: \( f(x; \theta) = \frac{1}{\pi(1 + (x - \theta)^2)} \)
Data: \( x = [1, 2, 3] \)
Log-likelihood:
\[ \ln L(\theta) = -3 \ln \pi - \sum_{i=1}^3 \ln(1 + (x_i - \theta)^2) \]
Score equation:
\[ \sum_{i=1}^3 \frac{2(x_i - \theta)}{1 + (x_i - \theta)^2} = 0 \]
This is nonlinear in \( \theta \), so solve numerically.
x <- c(1, 2, 3)
logL <- function(theta) sum(log(dcauchy(x, theta)))
optimize(logL, interval = c(0, 4), maximum = TRUE)
Output: maximum = 2, objective = -4.8205
Distribution: \( X \sim N(\mu, \sigma^2) \)
PDF: \( f(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right) \)
Log-likelihood:
\[ \ln L(\mu, \sigma) = -\frac{n}{2} \ln(2\pi) - n \ln \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]
MLE for \( \mu \):
\[ \frac{\partial}{\partial \mu} \ln L = 0 \Rightarrow \mu = \frac{1}{n} \sum x_i = \bar{x} \]
MLE for \( \sigma \):
\[ \frac{\partial}{\partial \sigma} \ln L = 0 \Rightarrow \sigma^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 \]
Final MLE Estimates:
\[ \hat{\mu}_{\text{MLE}} = \bar{x}, \quad \hat{\sigma}_{\text{MLE}} = \sqrt{ \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 } \]
Let \( X \) be a random variable with PDF \( f(x; \theta) \), and observed sample \( x_1, x_2, \dots, x_n \).
The r-th moment of the distribution is:
\[ \mathbb{E}[X^r] = \int_{-\infty}^{\infty} x^r f(x; \theta) dx \]
The corresponding sample moment is:
\[ \frac{1}{n} \sum_{i=1}^n x_i^r \]
Key idea: Match theoretical moments with sample moments to solve for parameters.
Match the first moment:
\[ \mathbb{E}[X] = \int x f(x; \theta) dx = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \Rightarrow \hat{\theta}_{\text{MOM}} = \text{solution to the above} \]
\( f(x; \lambda) = \lambda e^{-\lambda x}, \quad x \ge 0, \lambda > 0 \)
Theoretical mean:
\[ \mathbb{E}[X] = \int_0^\infty x \lambda e^{-\lambda x} dx = \frac{1}{\lambda} \]
Set equal to sample mean:
\[ \frac{1}{\lambda} = \bar{x} \Rightarrow \hat{\lambda}_{\text{MOM}} = \frac{1}{\bar{x}} \]
\( X \sim N(\mu, \sigma^2) \), match moments for \( r = 1, 2 \):
Sample second moment:
\[ \frac{1}{n} \sum_{i=1}^n x_i^2 = \mathbb{E}[X^2] \Rightarrow \hat{\sigma}_{\text{MOM}}^2 = \frac{1}{n} \sum x_i^2 - \bar{x}^2 \]
This is equivalent to the non-adjusted sample variance (denominator is \( n \), not \( n - 1 \)).
Let \( \hat{\theta} \) be an estimator of parameter \( \theta \) from a sample.
The bias of \( \hat{\theta} \) is defined as:
\[ \text{Bias}[\hat{\theta}] = \mathbb{E}[\hat{\theta}] - \theta \]
Ideally, we want the bias to be zero (i.e., the estimator is unbiased).
Suppose we use the sample mean to estimate \( \mu = \mathbb{E}[X] \).
Let \( \hat{\mu} = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \). Then:
\[ \text{Bias}[\bar{X}] = \mathbb{E}[\bar{X}] - \mu = \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^n X_i\right] - \mu = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] - \mu = \mu - \mu = 0 \]
Conclusion: The sample mean \( \bar{X} \) is an unbiased estimator of \( \mu \).
This applies whether \( \bar{X} \) is derived via MLE or Method of Moments.
Theorem 6.3.2: Let \( X_1, \dots, X_n \) be independent random variables with variance \( \text{Var}(X_i) = \sigma^2 \). Then an unbiased estimator of \( \sigma^2 \) is:
\[ S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - \bar{X})^2 \]
This is the sample variance. The factor \( \frac{1}{n - 1} \) is used to make it unbiased.
If we instead use:
\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 \]
Then the bias is:
\[ \text{Bias}[\hat{\sigma}^2] = \mathbb{E}[\hat{\sigma}^2] - \sigma^2 = -\frac{\sigma^2}{n} \ne 0 \]
We correct this by rescaling:
\[ S^2 = \frac{n}{n - 1} \hat{\sigma}^2 \Rightarrow \mathbb{E}[S^2] = \sigma^2 \]
An estimator is asymptotically unbiased if:
\[ \lim_{n \to \infty} \text{Bias}[\hat{\theta}] = 0 \]
For example:
\[ \lim_{n \to \infty} \text{Bias}[\hat{\sigma}^2] = 0 \Rightarrow \hat{\sigma}^2 \text{ is asymptotically unbiased} \]
Efficiency measures how "tight" the estimator is. Given two unbiased estimators \( \hat{\theta}_1, \hat{\theta}_2 \):
\[ \text{If } \text{Var}[\hat{\theta}_1] < \text{Var}[\hat{\theta}_2] \Rightarrow \hat{\theta}_1 \text{ is more efficient} \]
Example:
Conclusion: \( \hat{\mu}_1 \) is more efficient.
The mean squared error is:
\[ \text{MSE}[\hat{\theta}] = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Var}[\hat{\theta}] + (\text{Bias}[\hat{\theta}])^2 \]
MSE combines both variance and bias, and is a useful metric when estimators are biased.
Example: \( X \sim \text{Binomial}(n, p) \)
Depending on values of \( n \) and \( p \), either estimator may have lower MSE.
An estimator \( \hat{\theta} \) is consistent if:
\[ \lim_{n \to \infty} P(|\hat{\theta} - \theta| < \varepsilon) = 1 \quad \text{for any } \varepsilon > 0 \]
Example: Let \( X_1, ..., X_n \sim N(\mu, 1) \), then:
\[ \bar{X}_n = \frac{1}{n} \sum X_i \quad \Rightarrow \mathbb{E}[\bar{X}_n] = \mu, \quad \text{Var}[\bar{X}_n] = \frac{1}{n} \]
Since variance goes to 0 as \( n \to \infty \), \( \bar{X}_n \) is a consistent estimator of \( \mu \).
Alternative criterion:
If \( \text{MSE}[\hat{\theta}_n] \to 0 \) as \( n \to \infty \), then \( \hat{\theta}_n \) is consistent.