Loren Hansen, Ernestine A. Lee, Kevin Hestir, Lewis T. Williams and David Farrelly Pages 514 - 519 ( 6 )
Feature selection is an important challenge in many classification problems, especially if the number of features greatly exceeds the number of examples available. We have developed a procedure - GenForest - which controls feature selection in random forests of decision trees by using a genetic algorithm. This approach was tested through our entry into the Comparative Evaluation of Prediction Algorithms 2006 (CoEPrA) competition (accessible online at: http://www.coepra.org). CoEPrA was a modeling competition organized to provide an objective testing for various classification and regression algorithms via the process of blind prediction. In the competition GenForest ranked 10/23, 5/16 and 9/16 on CoEPrA classification problems 1, 3 and 4, respectively, which involved the classification of type I MHC nonapeptides i.e. peptides containing nine amino acids. These problems each involved the classification of different sets of nonapeptides. Associated with each amino acid was a set of 643 features for a total of 5787 features per peptide. The method, its application to the CoEPrA datasets, and its performance in the competition are described.
Decision trees, random forests, feature selection, genetic algorithms, evolutionary computation
National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA.