View on GitHub

fast-opt

Package for fast computation of the Optional Polya Tree

Download this project as a .zip file Download this project as a tar.gz file

Overview

fast-opt implements multivariate density estimation via the Optional Polya Tree (OPT) (Wong and Ma, 2010). Continuous and discrete variables are supported. However, categorical variables are not yet supported. There are also options for performing a copula transform via OPT and also multi-class classification.

Based on the MiniBooNE dataset (UCI link), a 50 dimensional particle identification problem. The data set was split into 12000/130064 samples used for testing and the remaining for training. fast-opt achieves a classification rate of 85.71% compared to 84.52% using gaussian SVM. Download data.

On a standard workstation with 16 GB of memory, fast-opt can support up to several millions of samples and up to around 50 dimensions.

Please cite:

For the Optional Polya Tree link

Wing Hung Wong and Li Ma (2010), Optional Polya Tree and Bayesian Inference. 
Annals of Statistics, 38:1433-1459.

For the fast-opt package link

Hui Jiang, John C. Mu, Kun Yang, Chao Du, Luo Lu and Wing Hung Wong (2013)
Computational Aspects of Optional Pólya Tree (submitted)
http://arxiv.org/abs/1309.5489

Installation

Only Linux and Mac OS X supported. No additional libraries required.

To install type

make

Then copy the binary fast-opt to the directory of your choice.

Examples

Density Estimation Example

Download the following small dataset randomly generated from a 2-D normal distribution. Download data. The true density at a large number of points is also included.

Estimate without copula transform

Estimate the density with LLOPT

fast-opt llopt -np 0.001 3 2 dim3_2norm_50000.txt dim3_2norm

These parameters are ok for low dimensional cases. For high dimensional cases I recommend only looking ahead 2 levels.

Now, compute the density at the original data points. Of course, you would normally use new data here.

fast-opt density dim3_2norm_50000.txt dim3_2norm.den > density.txt

We can also compute the hellinger distance if we have samples from the true density with the true density at those points. One million such points are included.

fast-opt hell_dist density_dim3_2norm_100000.txt dim3_2norm.den

The hellinger distance should be 0.0671282.

Estimate with copula transform

Perform a copula transform

fast-opt copula 2 dim3_2norm_50000.txt dim3_2norm

Estimate the density of the copula transformed data with LLOPT

fast-opt llopt -np 0.001 3 2 dim3_2norm.copula.txt dim3_2norm.copula

Now, compute the density at the original data points. Of course, you would normally use new data here.

fast-opt density dim3_2norm_50000.txt dim3_2norm.copula.den dim3_2norm.marginal.den > density2.txt

We can also compute the hellinger distance.

fast-opt hell_dist density_dim3_2norm_100000.txt dim3_2norm.copula.den dim3_2norm.marginal.den

The hellinger distance should be 0.029494. Clearly a copula transform improves the estimate since the marginals are independent.

Classification Example

I will only show the example with copula transform since the example without is similar. Download data.

Perform a copula transform

fast-opt copula 2 train_0.txt train_0
fast-opt copula 2 train_1.txt train_1

Estimate the density of the copula transformed data with LLOPT

fast-opt llopt -np 0.001 3 2 train_0.copula.txt train_0.copula
fast-opt llopt -np 0.001 3 2 train_1.copula.txt train_1.copula

Perform classification with the test data

fast-opt classify -c test.txt 1,1 train_0.copula.den,train_1.copula.den train_0.marginal.den,train_1.marginal.den > classify.out

The classification rate you should get is 0.6832. This is similar to SVM, which is 0.687100. However, LLOPT is much faster.

Future Updates/Issues

Matlab and/or R package
More speed
Use parameter parsing library
Random sample feature
Marginal distribution
Conditional distribution

Contact

Please report any bugs to John Mu