Barry K. Lavine, Charles E. Davidson and William S. Rayens Pages 115 - 131 ( 17 )
Motivation: Microarrays have allowed the expression level of thousands of genes or proteins to be measured simultaneously. Data sets generated by these arrays consist of a small number of observations (e.g., 20-100 samples) on a very large number of variables (e.g., 10,000 genes or proteins). The observations in these data sets often have other attributes associated with them such as a class label denoting the pathology of the subject. Finding the genes or proteins that are correlated to these attributes is often a difficult task since most of the variables do not contain information about the pathology and as such can mask the identity of the relevant features. We describe a genetic algorithm (GA) that employs both supervised and unsupervised learning to mine gene expression and proteomic data. The pattern recognition GA selects features that increase clustering, while simultaneously searching for features that optimize the separation of the classes in a plot of the two or three largest principal components of the data. Because the largest principal components capture the bulk of the variance in the data, the features chosen by the GA contain information primarily about differences between classes in the data set. The principal component analysis routine embedded in the fitness function of the GA acts as an information filter, significantly reducing the size of the search space since it restricts the search to feature sets whose principal component plots show clustering on the basis of class. The algorithm integrates aspects of artificial intelligence and evolutionary computations to yield a smart one pass procedure for feature selection, clustering, classification, and prediction.
genetic algorithms, pattern recognition
Department of Chemistry, Clarkson University, Potsdam, NY 13699-5810, USA.