Representative ground reference data at or sufficiently near the time of image acquisition are generally difficult and/or expensive to come by. In this regard, the simple 2:1 train:test split used in the previous example is rather wasteful of the available labeled training pixels. Moreover, the variability due to the training data is not taken properly into account, since the data are sampled just once from their underlying distributions. In the case of neural networks, we also ignore the variability of the training procedure itself with respect to the random initialization of the synaptic weights. Different initializations may lead to different local minima in the cost function and correspondingly different misclassification rates.
An alternative approach, one which at least makes more efficient use of the training data, is to apply \(n\)-fold cross-validation: A small fraction (one \(n\)th of the labeled pixels) is held back for testing, and the remaining data are used to train the classifier. This is repeated \(n\) times for \(n\) complementary test data subsets and then the results, e.g., misclassification rates, are averaged. In this way a larger fraction of the labeled data, namely \((n-1)/n\), is used for training. Moreover, all of the data are used for both training/testing and each observation is used for testing exactly once. For neural network classifiers, the effect of synaptic weight initialization is also reflected in the variance of the test results.
The drawback here, of course, is the necessity to repeat the train/test procedure \(n\) times rather than carrying it through only once. This is a problem especially for classifiers like neural networks with computationally expensive training algorithms. The cross-validation steps can, however, be performed in parallel given appropriate computer resources. Fortunately these are becoming increasingly available, in the form of multi-core processors, GPU hardware and, of course, the cloud.
Here we make use of IPython's parallel computing capabilities to run the cross-validation procedure in parallel on the locally available core processors. On my machine (Intel Core i5 CPU 760 @ 2.8 GHz x 4) there are four processors. The script classify_cv.py will use them all to run a 10-fold cross-validation on the training data. We will use a neural network classifier trained with the scaled conjugate gradient algorithm as described in Appendix B of my book.
After starting four IPython engines on the Notebook Homepage, we run the script with option -a 3 for the neural network:
run classify_cv -p [1,2,3,4] -a 3 imagery/may0107_pca.tif imagery/train.shp
The initial training phase took about 68 seconds, and the cross-validation required 226 seconds. If we now reduce the number of engines to 1, the computation time increases accordingly:
run classify_cv -p [1,2,3,4] -a 3 imagery/may0107_pca.tif imagery/train.shp