Back when I first studied machine learning, I came across the following formula for computing variance of a sampled distribution :
\[\sigma^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2\]where \(x_i\) is the \(i^{th}\) sample from the distribution over \(x\), \(N\) is the total number of samples, and \(\bar{x}\) is the sample mean. This was called the unbiased estimate of the variance.
Why? Intuitively, shouldn’t the denominator be \(N\), since \(\sigma^2 = \mathbb{E}[(x-\bar{x})^2]\)?
That’s where I was mistaken. And now that I’ve had time to think about it, it’s actually quite elegant.
Variance is defined as :
the expectation of the squared deviation of a random variable from its mean.
The ‘mean’ there, that is the true mean of the distribution of the random variable. \(\bar{x}\), on the other hand, is the sampled mean. Which means every time we compute \((x_i - \bar{x})\), we’re off by a factor of \((x_i-\mu)\), where \(\mu\) is the true mean of the random variable.
Alright, so something is off. But by how much? And how did \(N\) get replaced exactly by \(N-1\)? Time to dive into some math…
Let’s call the unbiased variance as \(\sigma_{true}\) and the biased variance as \(\sigma_{biased}\). Now, \(\begin{aligned} \sigma_{true}^2 &= \mathbb{E}[(x-\mu)^2] \\ &= \mathbb{E}[(x-\bar{x} + \bar{x} - \mu)^2] \\ &= \mathbb{E}[(x-\bar{x})^2 + (\bar{x} - \mu)^2 + 2(x-\bar{x})(\bar{x}-\mu)] \\ &= \mathbb{E}[(x-\bar{x})^2] + \mathbb{E}[(\bar{x} - \mu)^2] + 2\mathbb{E}[(x-\bar{x})(\bar{x}-\mu)]\\ &= \sigma_{biased}^2 + \mathbb{E}[(\bar{x} - \mu)^2] + 2(\bar{x}-\mu)\mathbb{E}[(x-\bar{x})]\\ \sigma_{true}^2 &= \sigma_{biased}^2 + \mathbb{E}[(\bar{x} - \mu)^2] \\ \end{aligned}\)
Alright now. We have established that our biased estimate is smaller than the true estimate (Why? Because the second term on the RHS is the expectation of a squared quantity). Let us compute that quantity.
In the last step, we could use that property since all the \(x_i\)s are independently sampled, and hence uncorrelated. Thus, we have
There we go, the familiar variance of a sampled distribution.
Doesn’t it feel great to prove something to yourself rather than take someone’s word for it? :)
Parting notes
- The use of \(N-1\) instead of \(N\) in that formula is called the Bessel’s correction.
- I’m not sure what or who to attribute the definition of sample variance to, so here’s the Wikipedia link.
- Check out this link for a crisp refresher on independent vs uncorrelated variables (TL;DR - independent variables are uncorrelated; uncorrelated variables may not be independent)