Skip to content

Latest commit

 

History

History
24 lines (18 loc) · 2.36 KB

File metadata and controls

24 lines (18 loc) · 2.36 KB
created 2023-05-01
lastmod 2025-01-15

Dan Dennett once said of Darwin's theory of evolution that it was the best idea that anyone has ever had. You could say the same about the MLE in the realm of [[statistical inference]]. It's simple and elegant and sometimes optimal.

Given a parametric model ${P_\theta:\theta\in\Theta}$ ([[parametric versus nonparametric statistics]]) and data $X^n$ we solve $$ \sup_{\theta\in\Theta} L(\theta), \quad \text{ where }\quad L(\theta) = P_\theta(X^n). $$ So, given the data, we just optimize over the parameters that could have generated that data. Badaboom-badabing, and we have a solution to [[parametric density estimation]].

Of course, the idea that we should just find the parameters that have the highest probability given the data is not some bedrock philosophical principle that can't be debated. And, as you might imagine, people do debate it—[[Bayesian statistics|Bayesians]] in particular. MLE is [[frequentist statistics|frequentist]] by nature; parameters are fixed and there are no priors. It also doesn't provide natural [[uncertainty quantification]] since we just get a point estimate. Of course this is where [[central limit theorems]] kick in.

The MLE can be seen as [[empirical risk minimization]] with the loss $\log p_\theta(X_i) / p_{\wh{\theta}}(X_i)$. The associated risk (see [[statistical decision theory]]) is the [[KL divergence]] between $P_\theta(X,\theta)$ and $P_{\wh{\theta}}(X,\wh{\theta})$. This can be used to show that, under certain regularity conditions, the MLE is a consistent estimator. The conditions are strong identifiability and that the empirical risk obey a uniform [[laws of large numbers|LLN]].

When our model is misspecified (ie the data are being generated by some distribution that's not in our model), then we can use the connection between the KL divergence and the MLE to see that the MLE is finding the parameter that minimizes the distance between the true data-generating distribution and $P_{\wh{\theta}}$. That is, if the true data-generating distribution is $Q$, we have $$ \kl(Q | P_{\wh{\theta}}) \leq \inf_{\theta\in\Theta}\kl(Q| P_\theta), $$ where $\wh{\theta}$ is the MLE estimate.

Under enough regularity conditions, the MLE obeys a [[central limit theorems|CLT]] with variance equal to the inverse of the [[Fisher information]], thus matching the [[Cramer-Rao lower bound]].