## Overview

fast-opt implements multivariate density estimation via the Optional Polya Tree (OPT) (Wong and Ma, 2010). Continuous and discrete variables are supported. However, categorical variables are not yet supported. There are also options for performing a copula transform via OPT and also multi-class classification.

Based on the MiniBooNE dataset (UCI link), a 50 dimensional particle identification problem. The data set was split into 12000/130064 samples used for testing and the remaining for training. fast-opt achieves a classification rate of 85.71% compared to 84.52% using gaussian SVM. Download data.

On a standard workstation with 16 GB of memory, fast-opt can support up to several millions of samples and up to around 50 dimensions.

Please cite:

For the Optional Polya Tree link

```
Wing Hung Wong and Li Ma (2010), Optional Polya Tree and Bayesian Inference.
Annals of Statistics, 38:1433-1459.
```

For the fast-opt package link

```
Hui Jiang, John C. Mu, Kun Yang, Chao Du, Luo Lu and Wing Hung Wong (2013)
Computational Aspects of Optional PĆ³lya Tree (submitted)
http://arxiv.org/abs/1309.5489
```

## Installation

Only Linux and Mac OS X supported. No additional libraries required.

To install type

```
make
```

Then copy the binary `fast-opt`

to the directory of your choice.

## Examples

### Density Estimation Example

Download the following small dataset randomly generated from a 2-D normal distribution. Download data. The true density at a large number of points is also included.

**Estimate without copula transform**

Estimate the density with LLOPT

```
fast-opt llopt -np 0.001 3 2 dim3_2norm_50000.txt dim3_2norm
```

These parameters are ok for low dimensional cases. For high dimensional cases I recommend only looking ahead 2 levels.

Now, compute the density at the original data points. Of course, you would normally use new data here.

```
fast-opt density dim3_2norm_50000.txt dim3_2norm.den > density.txt
```

We can also compute the hellinger distance if we have samples from the true density with the true density at those points. One million such points are included.

```
fast-opt hell_dist density_dim3_2norm_100000.txt dim3_2norm.den
```

The hellinger distance should be 0.0671282.

**Estimate with copula transform**

Perform a copula transform

```
fast-opt copula 2 dim3_2norm_50000.txt dim3_2norm
```

Estimate the density of the copula transformed data with LLOPT

```
fast-opt llopt -np 0.001 3 2 dim3_2norm.copula.txt dim3_2norm.copula
```

Now, compute the density at the original data points. Of course, you would normally use new data here.

```
fast-opt density dim3_2norm_50000.txt dim3_2norm.copula.den dim3_2norm.marginal.den > density2.txt
```

We can also compute the hellinger distance.
```
fast-opt hell_dist density_dim3_2norm_100000.txt dim3_2norm.copula.den dim3_2norm.marginal.den
```

The hellinger distance should be 0.029494. Clearly a copula transform improves the estimate since the marginals are independent.

### Classification Example

I will only show the example with copula transform since the example without is similar. Download data.

Perform a copula transform

```
fast-opt copula 2 train_0.txt train_0
fast-opt copula 2 train_1.txt train_1
```

Estimate the density of the copula transformed data with LLOPT

```
fast-opt llopt -np 0.001 3 2 train_0.copula.txt train_0.copula
fast-opt llopt -np 0.001 3 2 train_1.copula.txt train_1.copula
```

Perform classification with the test data

`fast-opt classify -c test.txt 1,1 train_0.copula.den,train_1.copula.den train_0.marginal.den,train_1.marginal.den > classify.out`

The classification rate you should get is 0.6832. This is similar to SVM, which is 0.687100. However, LLOPT is much faster.

## Future Updates/Issues

- Matlab and/or R package
- More speed
- Use parameter parsing library
- Random sample feature
- Marginal distribution
- Conditional distribution

## Contact

Please report any bugs to John Mu