Overview
fast-opt implements multivariate density estimation via the Optional Polya Tree (OPT) (Wong and Ma, 2010). Continuous and discrete variables are supported. However, categorical variables are not yet supported. There are also options for performing a copula transform via OPT and also multi-class classification.
Based on the MiniBooNE dataset (UCI link), a 50 dimensional particle identification problem. The data set was split into 12000/130064 samples used for testing and the remaining for training. fast-opt achieves a classification rate of 85.71% compared to 84.52% using gaussian SVM. Download data.
On a standard workstation with 16 GB of memory, fast-opt can support up to several millions of samples and up to around 50 dimensions.
Please cite:
For the Optional Polya Tree link
Wing Hung Wong and Li Ma (2010), Optional Polya Tree and Bayesian Inference.
Annals of Statistics, 38:1433-1459.
For the fast-opt package link
Hui Jiang, John C. Mu, Kun Yang, Chao Du, Luo Lu and Wing Hung Wong (2013)
Computational Aspects of Optional PĆ³lya Tree (submitted)
http://arxiv.org/abs/1309.5489
Installation
Only Linux and Mac OS X supported. No additional libraries required.
To install type
make
Then copy the binary fast-opt
to the directory of your choice.
Examples
Density Estimation Example
Download the following small dataset randomly generated from a 2-D normal distribution. Download data. The true density at a large number of points is also included.
Estimate without copula transform
Estimate the density with LLOPT
fast-opt llopt -np 0.001 3 2 dim3_2norm_50000.txt dim3_2norm
These parameters are ok for low dimensional cases. For high dimensional cases I recommend only looking ahead 2 levels.
Now, compute the density at the original data points. Of course, you would normally use new data here.
fast-opt density dim3_2norm_50000.txt dim3_2norm.den > density.txt
We can also compute the hellinger distance if we have samples from the true density with the true density at those points. One million such points are included.
fast-opt hell_dist density_dim3_2norm_100000.txt dim3_2norm.den
The hellinger distance should be 0.0671282.
Estimate with copula transform
Perform a copula transform
fast-opt copula 2 dim3_2norm_50000.txt dim3_2norm
Estimate the density of the copula transformed data with LLOPT
fast-opt llopt -np 0.001 3 2 dim3_2norm.copula.txt dim3_2norm.copula
Now, compute the density at the original data points. Of course, you would normally use new data here.
fast-opt density dim3_2norm_50000.txt dim3_2norm.copula.den dim3_2norm.marginal.den > density2.txt
We can also compute the hellinger distance.
fast-opt hell_dist density_dim3_2norm_100000.txt dim3_2norm.copula.den dim3_2norm.marginal.den
The hellinger distance should be 0.029494. Clearly a copula transform improves the estimate since the marginals are independent.
Classification Example
I will only show the example with copula transform since the example without is similar. Download data.
Perform a copula transform
fast-opt copula 2 train_0.txt train_0
fast-opt copula 2 train_1.txt train_1
Estimate the density of the copula transformed data with LLOPT
fast-opt llopt -np 0.001 3 2 train_0.copula.txt train_0.copula
fast-opt llopt -np 0.001 3 2 train_1.copula.txt train_1.copula
Perform classification with the test data
fast-opt classify -c test.txt 1,1 train_0.copula.den,train_1.copula.den train_0.marginal.den,train_1.marginal.den > classify.out
The classification rate you should get is 0.6832. This is similar to SVM, which is 0.687100. However, LLOPT is much faster.
Future Updates/Issues
- Matlab and/or R package
- More speed
- Use parameter parsing library
- Random sample feature
- Marginal distribution
- Conditional distribution
Contact
Please report any bugs to John Mu