Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a method to estimate unknown parameters of a probability distribution or statistical model.

The principle: choose the parameter values that make the observed data most likely.

Definition

Suppose we have independent observations $x_{1}, x_{2}, \dots, x_{n}$ from a distribution with parameter $θ$ and probability density/mass function $f (x ∣ θ)$ .

The likelihood function is: $L (θ) = \prod_{i = 1}^{n} f (x_{i} ∣ θ)$ The MLE is the parameter value $\hat{θ}$ that maximizes $L (θ)$ : $\hat{θ} = ar g max_{θ} L (θ)$

Often, we maximize the log-likelihood instead: $ℓ (θ) = lo g L (θ) = \sum_{i = 1}^{n} lo g f (x_{i} ∣ θ)$

Fisher Information

The Fisher Information measures how much information an observable random variable carries about an unknown parameter.

It is defined as:

Equivalently:

$I (θ) = - E [\frac{\partial ^{2}}{\partial θ ^{2}} lo g f (X ∣ θ)]$

Connection to MLE

For large samples, the MLE $\hat{θ}$ is approximately normally distributed: $\hat{θ} \sim N (θ, \frac{1}{n I ( θ )})$
The Fisher Information thus determines the variance of the MLE.
A higher $I (θ)$ means the data provide more information about $θ$ , leading to a more precise estimate.

Example

Suppose we flip a coin $n$ times and observe $k$ heads.

Let $p$ = probability of heads.

The likelihood is: $L (p) = p^{k} (1 - p)^{n - k}$

Log-likelihood: $ℓ (p) = k lo g p + (n - k) lo g (1 - p)$

Differentiate and solve: $\frac{d ℓ}{d p} = \frac{k}{p} - \frac{n - k}{1 - p} = 0$ $\overset{p}{^} = \frac{k}{n}$

So the MLE of $p$ is just the sample proportion of heads.

Properties of MLE

Consistency: $\hat{θ} \to θ$ as $n \to \infty$ .
Asymptotic normality: For large $n$ , $\hat{θ} \sim N (θ, \frac{1}{I ( θ )})$ where $I (θ)$ is the Fisher information.
Efficiency: Achieves the lowest possible variance asymptotically.

MLE Table of various distributions

Distribution	Likelihood	Log-Likelihood	First Derivative	Second Derivative	MLE
(Bernoulli(p))	$p^{x} (1 - p)^{1 - x}$	$x lo g p + (1 - x) lo g (1 - p)$	$\frac{x}{p} - \frac{1 - x}{1 - p}$	$- \frac{x}{p ^{2}} - \frac{1 - x}{( 1 - p ) ^{2}}$	$\overset{p}{^} = \overset{ˉ}{X}_{n}$
Binomial(n, p)	$(x n) p^{x} (1 - p)^{n - x}$	$lo g (x n) + x lo g p + (n - x) lo g (1 - p)$	$\frac{x}{p} - \frac{n - x}{1 - p}$	$- \frac{x}{p ^{2}} - \frac{n - x}{( 1 - p ) ^{2}}$	$\overset{p}{^} = \frac{x}{n}$
Poisson(λ)	$\frac{e ^{- λ} λ ^{x}}{x !}$	$- λ + x lo g λ - lo g x!$	$- 1 + \frac{x}{λ}$	$- \frac{x}{λ ^{2}}$	$\hat{λ} = \overset{ˉ}{X}_{n}$
Uniform(a, b)	$\frac{1}{b - a} 1_{[a, b]} (x)$	$- lo g (b - a)$	0	0	$\overset{a}{^} = min x_{i}, \hat{b} = max x_{i}$
Geometric(p)	$(1 - p)^{x - 1} p$	$(x - 1) lo g (1 - p) + lo g p$	$- \frac{x - 1}{1 - p} + \frac{1}{p}$	$- \frac{x - 1}{( 1 - p ) ^{2}} - \frac{1}{p ^{2}}$	$\overset{p}{^} = \frac{1}{X ˉ _{n}}$
Normal(μ, σ²)	$\frac{1}{2 π σ ^{2}} e^{- \frac{( x - μ ) ^{2}}{2 σ ^{2}}}$	$- \frac{1}{2} lo g (2 π σ^{2}) - \frac{( x - μ ) ^{2}}{2 σ ^{2}}$	$\frac{x - μ}{σ ^{2}}$	$- \frac{1}{σ ^{2}}$	$\overset{μ}{^} = \overset{ˉ}{X}_{n}, \overset{σ}{^}^{2} = \frac{1}{n} \sum (x_{i} - \overset{ˉ}{X}_{n})^{2}$
Exponential(λ)	$λ e^{- λ x}$	$lo g λ - λ x$	$\frac{1}{λ} - x$	$- \frac{1}{λ ^{2}}$	$\hat{λ} = \frac{1}{X ˉ _{n}}$
Gamma(α, β)	$\frac{β ^{α}}{Γ ( α )} x^{α - 1} e^{- β x}$	$α lo g β - lo g Γ (α) + (α - 1) lo g x - β x$	$\frac{α}{β} - x$ (w.r.t β)	$- \frac{α}{β ^{2}}$ (w.r.t β)	$\hat{β} = \frac{α}{X ˉ _{n}}$
Neg. Binomial(r, p)	$(x x + r - 1) (1 - p)^{r} p^{x}$	$lo g (x x + r - 1) + r lo g (1 - p) + x lo g p$	$\frac{x}{p} - \frac{r}{1 - p}$	$- \frac{x}{p ^{2}} - \frac{r}{( 1 - p ) ^{2}}$	$\overset{p}{^} = \frac{r}{r + X ˉ _{n}}$

Summary

MLE finds parameter values that maximize the likelihood of the observed data.
Simple to apply to many models, though sometimes requires numerical optimization.
Forms the basis for many other statistical methods (Wald test, likelihood ratio test, confidence intervals).
Fisher Information quantifies the precision of those estimates.

Nathans Note space

Explorer