I was tipped off by a github thread that the development version of the Random Forest Classifier in Sklearn (15-dev) had major speed improvements. I built a small benchmark using the MINST handwriting data set and compared the training and prediction speeds of Sklearn (14.1), Sklearn (15-dev), and and WiseRF (1.1). At least in this small test, the development version of sklearn is the king besting wiseRF in both training and prediction.
[Update: Check out a recent post confirming the sklearn speedup across a wider range of data-sets.]
Data: MINST handwritten numeral dataset
Classes: 10 Digits
Features: 784 grayscale pixel values
Sklearn first based on 14.1 available on Conda, then 15-dev built from the github repository against Anaconda with Accelerate
CPU: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz, 8 cores
100 Trees, default settings
Benchmark Timings (Lower is better)
Sklearn (0.14.1) 25.47s
WiseRF (1.1) 7.95s
Sklearn (0.15-git) 7.16s
Sklearn (0.14.1) 2.12s, correct 0.970%
WiseRF (1.1) 0.18s, correct 0.967%
Sklearn (0.15-git) 0.136s, correct 0.970%
I haven’t done a memory size comparison which is an active area of development for the Random Forest package in Sklearn.
Versions of wiseRF past 1.1 have been developed but are no longer being released since they’ve moved their business to a analysis as a service model.
Also not for nothing but 3% error rate on just the raw pixels isn’t bad. However, current state of the art is as low as 0.23% and was already .8% by 1998 according to the Mnist page
The benchmark code
#Minst data acquired from http://yann.lecun.com/exdb/mnist/ #Code to load and manipulate minst from http://g.sweyla.com/blog/2012/mnist-numpy/ #Which was originally adapted from: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py import os, struct from array import array as pyarray from numpy import append, array, int8, uint8, zeros def read(digits, dataset = "training", path): if dataset is "training": fname_img = os.path.join(path, 'train-images.idx3-ubyte') fname_lbl = os.path.join(path, 'train-labels.idx1-ubyte') elif dataset is "testing": fname_img = os.path.join(path, 't10k-images.idx3-ubyte') fname_lbl = os.path.join(path, 't10k-labels.idx1-ubyte') else: raise ValueError, "dataset must be 'testing' or 'training'" flbl = open(fname_lbl, 'rb') magic_nr, size = struct.unpack(">II", flbl.read(8)) lbl = pyarray("b", flbl.read()) flbl.close() fimg = open(fname_img, 'rb') magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16)) img = pyarray("B", fimg.read()) fimg.close() ind = [ k for k in xrange(size) if lbl[k] in digits ] N = len(ind) images = zeros((N, rows, cols), dtype=uint8) labels = zeros((N, 1), dtype=int8) for i in xrange(len(ind)): images[i] = array(img[ ind[i]*rows*cols : (ind[i]+1)*rows*cols ]).reshape((rows, cols)) labels[i] = lbl[ind[i]] return images, labels directory = #where the files live images,labels= read([0,1,2,3,4,5,6,7,8,9], dataset="training", path=directory) images_test,labels_test= read([0,1,2,3,4,5,6,7,8,9], dataset="testing",path=directory) if __name__ == "__main__": from sklearn.ensemble import RandomForestClassifier import sklearn.ensemble rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1, verbose=True) from PyWiseRF import WiseRF import time t=time.time() rf.fit(images.reshape(60000,-1), labels) print "Sklearn Train Time", time.time()-t rf_wise = WiseRF(n_estimators=100) t=time.time() rf_wise.fit(images.reshape(60000,-1), labels.flatten()) print "WiseRF Train Time", time.time()-t t=time.time() predictions=rf.predict(images_test.reshape(10000,-1)) print "Sklearn Predict Time", time.time()-t correct = predictions.flatten()==labels_test.flatten() print "Sklearn Percent Correct", 1.0*sum(correct)/len(correct) t=time.time() predictions=rf_wise.predict(images_test.reshape(10000,-1)) print "WiseRF Predict Time", time.time()-t correct = predictions.flatten()==labels_test.flatten() print "WiseRF Predict Correct", 1.0*sum(correct)/len(correct)