Fastest Random Forest- Sklearn?

FacebookTwitterGoogle+RedditLinkedIn

1932426_10104214228469240_1604027540_n

I was tipped off by a github thread that the development version of the Random Forest Classifier in  Sklearn (15-dev) had major speed improvements. I built a small benchmark using the MINST handwriting data set and compared the training and prediction speeds of Sklearn (14.1), Sklearn (15-dev), and and WiseRF  (1.1). At least in this small test, the development version of sklearn is the king besting wiseRF in both training and prediction.

[Update: Check out a recent post confirming the sklearn speedup across a wider range of data-sets.]

Setup
Data: MINST handwritten numeral dataset
Classes: 10 Digits
Features: 784 grayscale pixel values
Training: 60k
Testing: 10k
Sklearn first based on 14.1 available on Conda, then 15-dev built from the github repository against Anaconda with Accelerate
CPU: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz, 8 cores
100 Trees, default settings

Benchmark Timings (Lower is better)
Training
Sklearn (0.14.1) 25.47s
WiseRF (1.1) 7.95s
Sklearn (0.15-git) 7.16s

Prediction
Sklearn (0.14.1) 2.12s,  correct 0.970%
WiseRF (1.1) 0.18s,  correct 0.967%
Sklearn (0.15-git) 0.136s, correct 0.970%

Example of the Minst Digits, each 28 by 28 greyscale pixels
1932426_10104214228469240_1604027540_n

Caveats
I haven’t done a memory size comparison which is an active area of development for the Random Forest package in Sklearn.
Versions of wiseRF past 1.1 have been developed but are no longer being released since they’ve moved their business to a analysis as a service model.

Also not for nothing but 3% error rate on just the raw pixels isn’t bad. However, current state of the art is as low as 0.23% and was already .8% by 1998 according to the  Mnist page

The benchmark code


#Minst data acquired from http://yann.lecun.com/exdb/mnist/
#Code to load and manipulate minst from http://g.sweyla.com/blog/2012/mnist-numpy/
#Which was originally adapted from: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py

import os, struct
from array import array as pyarray
from numpy import append, array, int8, uint8, zeros

def read(digits, dataset = "training", path):
    if dataset is "training":
        fname_img = os.path.join(path, 'train-images.idx3-ubyte')
        fname_lbl = os.path.join(path, 'train-labels.idx1-ubyte')
    elif dataset is "testing":
        fname_img = os.path.join(path, 't10k-images.idx3-ubyte')
        fname_lbl = os.path.join(path, 't10k-labels.idx1-ubyte')
    else:
        raise ValueError, "dataset must be 'testing' or 'training'"

    flbl = open(fname_lbl, 'rb')
    magic_nr, size = struct.unpack(">II", flbl.read(8))
    lbl = pyarray("b", flbl.read())
    flbl.close()

    fimg = open(fname_img, 'rb')
    magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16))
    img = pyarray("B", fimg.read())
    fimg.close()

    ind = [ k for k in xrange(size) if lbl[k] in digits ]
    N = len(ind)

    images = zeros((N, rows, cols), dtype=uint8)
    labels = zeros((N, 1), dtype=int8)
    for i in xrange(len(ind)):
        images[i] = array(img[ ind[i]*rows*cols : (ind[i]+1)*rows*cols ]).reshape((rows, cols))
        labels[i] = lbl[ind[i]]

    return images, labels

directory = #where the files live
images,labels= read([0,1,2,3,4,5,6,7,8,9], dataset="training", path=directory)
images_test,labels_test= read([0,1,2,3,4,5,6,7,8,9], dataset="testing",path=directory)

if __name__ == "__main__":

    from sklearn.ensemble import RandomForestClassifier
    import sklearn.ensemble
    rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1, verbose=True)

    from PyWiseRF import WiseRF
    import time

    t=time.time()
    rf.fit(images.reshape(60000,-1), labels)
    print "Sklearn Train Time", time.time()-t

    rf_wise = WiseRF(n_estimators=100)
    t=time.time()
    rf_wise.fit(images.reshape(60000,-1), labels.flatten())
    print "WiseRF Train Time", time.time()-t

    t=time.time()
    predictions=rf.predict(images_test.reshape(10000,-1))
    print "Sklearn Predict Time", time.time()-t

    correct = predictions.flatten()==labels_test.flatten()
    print "Sklearn Percent Correct", 1.0*sum(correct)/len(correct)

    t=time.time()
    predictions=rf_wise.predict(images_test.reshape(10000,-1))
    print "WiseRF Predict Time", time.time()-t

    correct = predictions.flatten()==labels_test.flatten()
    print "WiseRF Predict Correct", 1.0*sum(correct)/len(correct)

FacebookTwitterGoogle+RedditLinkedIn

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>