Multispectral Image Analysis in a Docker Container: Examples (continued)

Example 5: Supervised Land Cover Classification with Cross-Validation

Representative ground reference data at or sufficiently near the time of image acquisition are generally difficult and/or expensive to come by. In this regard, the simple 2:1 train:test split used in the previous example is rather wasteful of the available labeled training pixels. Moreover, the variability due to the training data is not taken properly into account, since the data are sampled just once from their underlying distributions. In the case of neural networks, we also ignore the variability of the training procedure itself with respect to the random initialization of the synaptic weights. Different initializations may lead to different local minima in the cost function and correspondingly different misclassification rates.

An alternative approach, one which at least makes more efficient use of the training data, is to apply \(n\)-fold cross-validation: A small fraction (one \(n\)th of the labeled pixels) is held back for testing, and the remaining data are used to train the classifier. This is repeated \(n\) times for \(n\) complementary test data subsets and then the results, e.g., misclassification rates, are averaged. In this way a larger fraction of the labeled data, namely \((n-1)/n\), is used for training. Moreover, all of the data are used for both training/testing and each observation is used for testing exactly once. For neural network classifiers, the effect of synaptic weight initialization is also reflected in the variance of the test results.

The drawback here, of course, is the necessity to repeat the train/test procedure \(n\) times rather than carrying it through only once. This is a problem especially for classifiers like neural networks with computationally expensive training algorithms. The cross-validation steps can, however, be performed in parallel given appropriate computer resources. Fortunately these are becoming increasingly available, in the form of multi-core processors, GPU hardware and, of course, the cloud.

Here we make use of IPython's parallel computing capabilities to run the cross-validation procedure in parallel on the locally available core processors. On my machine (Intel Core i5 CPU 760 @ 2.8 GHz x 4) there are four processors. The script classify_cv.py will use them all to run a 10-fold cross-validation on the training data. We will use a neural network classifier trained with the scaled conjugate gradient algorithm as described in Appendix B of my book.

After starting four IPython engines on the Notebook Homepage, we run the script with option -a 3 for the neural network:

In [1]:
run classify_cv -p [1,2,3,4] -a 3 imagery/may0107_pca.tif imagery/train.shp
=========================
supervised classification
=========================
Tue Jan 13 15:37:14 2015
image:     imagery/may0107_pca.tif
training:  imagery/train.shp
algorithm: NNet(Congrad)
reading training data...
7173 training pixel vectors were read in
training on 7173 pixel vectors...
elapsed time 68.2563719749

classifying...
elapsed time 1.83002114296
thematic map written to: imagery/may0107_pca_class.tif
submitting cross-validation to 4 IPython engines
parallel execution time: 225.784959078
misclassification rate: 0.050885
standard deviation:     0.003856

The initial training phase took about 68 seconds, and the cross-validation required 226 seconds. If we now reduce the number of engines to 1, the computation time increases accordingly:

In [2]:
run classify_cv -p [1,2,3,4] -a 3 imagery/may0107_pca.tif imagery/train.shp
=========================
supervised classification
=========================
Tue Jan 13 16:06:35 2015
image:     imagery/may0107_pca.tif
training:  imagery/train.shp
algorithm: NNet(Congrad)
reading training data...
7173 training pixel vectors were read in
training on 7173 pixel vectors...
elapsed time 60.4456489086

classifying...
elapsed time 1.79800701141
thematic map written to: imagery/may0107_pca_class.tif
submitting cross-validation to 1 IPython engines
parallel execution time: 608.415960073
misclassification rate: 0.051305
standard deviation:     0.007461