Kernel Density Estimation: Predict KDE & Generate Data
Task: For a given data set X, predict its density function. Using the found density function generate new random data set Y and check the hypothesis that X and Y distributions are identical.
Method: Histogram data generation, Gaussian Kernel Density Estimation data generation, K-S testing of density integral.
Tutorial type: Algorithm oriented, written in Python
Requirements: Python >= 2.7.0, Numpy >= 1.6.0, Matplotlib 1.4.3
Here we find a usage for Kernel Density Estimation (KDE) in real situation - random data generation of the same PDF as given dataset. To achieve the the following steps are taken:
- We load the dataset, analyze it with histograms
- Produce a PDF using different KDE techniques
- Check if generated data is the same distribution as given dataset
Data generation from Histogram
First we try how good is approach to generate data from normalized histogram. In the code below the following steps are taken: find histogram of given dataset, generate new data of given distribution, plot the results.
Visually distributions are very similar. We will evaluate the results precisely later, comparing it to data generated from Kernel Density Estimation.
Data generation using Kernel Density Estimation
Now the task is to find KDE first, then generate data from it. If you are not familiar with KDE take a look at this post explaining it. The following code uses Gaussian kernel and two methods to estimate optimal h: Silverman's and Least Squares Cross Validation (LSCV). The theory about LSCV can be found in this paper,
The above picture displays the output of the script. In top of it is the found PDF function using different bandwidth, found using different techniques. At the bottom the histograms of generated data with known PDF are displayed.
K-S Statistics To Evaluate If Generated Data Is The Same Distribution As Dataset
We will use the simple version of K-S statistics test described here, where we find maximum difference between cumulative PDF of two distributions - original and generated in our case.
Found D(difference) value is then compared against the table of critical values as shown in wikipedia article mentioned above. Consider the following code, showing how to compare every dataset we generated so far. One more important thing is the term Empirical distribution function, which is used in the theory. In fact it is the sum of histogram values divided by the count of total number of elements. We use normalized histograms here, so just need to multiply by dx.
The above graphs shows us that the minimal difference D is using KDE method, where bandwidth is found using least squares cross validation. To check the hypothesis that newly generated data has the same distribution as original we use the following code below. There is a deep theory explanation why such values as in the code defines the limit to accept the hypothesis, but we do not dig into it this article.
Download Scripts And Dataset
The code and dataset used in the tutorial are located on GitHub.