Tutorial: Kernel Density Estimation Explained
Tutorial type: Algorithm oriented, written in Python
Requirements: Python >= 2.7.0, NumPy >= 1.6.1, Matplotlib >= 1.4.3 (To run Python code, the output is given in the article)
What is KDE?
Kernel density estimation or KDE is a non-parametric way to estimate the probability density function of a random variable. In other words the aim of KDE is to find probability density function (PDF) for a given dataset. How it differs from normalized histogram approach? Well, it smooths the around values of PDF. Let's take a look at KDE using Gaussian kernel in multiple examples .
First let's remember how Gaussian PDF looks like:
For selecting the kernel we must consider the main rule for PDF - it sums up to one. So the kernel which will define the PDF must also sum up to one:
In KDE sigma from Gaussian is called bandwidth parameter, because it defines how much the function is spread. The kernel function is defined by:
where h is a bandwidth and n is the number of data points. This final PDF is the average sum of Gaussian PDF for every point for every dx.
One data point example
The easiest way to grasp the idea is to use one-two points example. Consider the code and output:
It's the fastest way to understand the main difference between kernel methods and normalized histograms. PDF can have values higher than one, because it is only it's integral that matters. First (top left) image is NOT normed histogram - the width of the bin does not matter here. All of the other plots integrate to 1! Three lower graphs are the same Gaussian kernel with different bandwidth value. Of course the discrete result histogram with narrow bin makes much more sense in the case of one point (the distribution is not natural here, obviously).
Two data points example
Example with two points explains visually how the probability densities sums up to get the integral value equal to one. It's simple average of all the probability densities which make up the final result.
In the lower graphs it's nice visualization how Gaussians sum up and how bandwidth value affects the form of final PDF.
Multiple data points example
Multiple points example is the most realistic, the dataset is still a made-up of two Gaussians here.
To get deeper understanding of KDE we strongly recommend to take a look at practical task using KDE.
Download the code from this tutorial from our GitHub channel.