PharmaDD Top News: Business, Technology, Strategic Briefings - Tracking leading techniques and approaches in therapeutic drug discovery and development

 

Sponsored Links:
Prescription Drug Addiction

 

 

Pharmaceutical Discovery, Oct 1, 2005 
Immobilization of Oligonucleotides on a Silicon Surface
By Patrizia Di Pietro , Enrico Alessi , Floriana San Biagio , Luigi La Magna , Gaetano Panvini , Gianfilippo Scicolone , Salvatore Oliveri , Salvo Coffa

Gene Expression Analysis Using Methods of Computational Intelligence
The author investigates the use of evolutionary computation and artificial neural networks for pattern recognition of microarray data. The combination of these methods holds great promise for automated feature selection and data analysis.
Gary B. Fogel
Pharmaceutical Discovery

 

Gary B. Fogel
The ever-increasing application of microarrays for gene expression analysis has led to similar expansion in the number of algorithms and methods used for data interpretation. Microarray techniques make it possible to monitor the expression of thousands of genes simultaneously from a given cell type. Differential expression between these cell types (e.g., benign and cancerous) can lead to the development of markers useful for classification, where each experiment monitors thousands of potential markers that differ over time. A central problem in the microarray community is the reduction of these potential markers into reduced sets that retain high predictive accuracy for cell-type classification. Accurate diagnosis of cancer type may be essential because the treatments, responses, and prognoses vary significantly depending on the diagnosis. In addition, methods that work well for cell-type classification should also work well to predict prognosis and diagnosis if given the appropriate input data.

The problem of cancer classification has been divided into two related but separate challenges: class prediction and class discovery (1). Class prediction refers the assignment of samples to one of several previously defined classes. Class discovery refers to defining a previously unrecognized tumor subtype(s) in expression data. Both of these tasks are challenging and require computational assistance. Class prediction via cluster analysis is typically used to infer the function of novel genes by grouping them with genes of well-known functionality in gene expression profiling. Genes that show similar activity patterns (said to be coexpressed) are often related functionally and are controlled by the same mechanisms of regulation (said to be coregulated). A major obstacle to the eventual utility of microarrays is the lack of efficient methods for cataloging the data into coexpressed groups. This article highlights recent advancements in the coupling of evolutionary computation and neural networks for microarray class prediction and discovery.

Neural Networks

Computational intelligence is commonly defined as the accumulation of artificial neural networks, evolutionary computation, and fuzzy logic (2) and their combination. Neural networks have been noted elsewhere in the literature as particularly useful for microarray data clustering and classification. For instance, Khan et al. (3) demonstrated the robustness of this technique, providing evidence to suggest that neural networks could accurately classify data on the expression profiles of 96 transcripts characterizing four distinct subtypes of small round blue cell tumors. Misclassification of cell type was minimized to zero when a set of 96 highest-ranked genes was used. Of the 96 genes, 41 were not identified previously as having any association with these 4 cancer types.

While it is clear that neural-network methods are well suited to microarray analysis, their proper training and optimization is a prerequisite for superior performance. A standard approach to neural network training is the use of backpropagation to optimize the weight assignments for a fixed neural network topology. This approach generally forces the user to choose the appropriate number of features to use and a fixed neural network topology. Backpropagation itself can also lead to suboptimal weight assignment if there are many local optima in the search space. Optimizing neural networks with stochastic optimization methods such as evolutionary computation, however, can outperform these classic methods by avoiding local optima and simultaneously identifying the most appropriate features to use for prediction.

Evolutionary Computation

Natural evolution is a population-based optimization process. The three main avenues of research in simulated evolution (evolutionary programming (EP), evolution strategies (ES), and genetic algorithms (GA)), are broadly similar in that each maintains a population of trial solutions, imposes random changes on those solutions, and incorporates selection to determine which solutions to maintain into future generations and which to remove from the pool of trials (4).

 

Figure 1. Evolution of neural networks for microarray analysis. Statistical feature assessment helps reduce the feature space to the set of features most relevant to the predictive outcome. Microarray data is then divided into training, testing, and validation sets for the purpose of evolving neural networks. Evolution of neural networks is performed over the training examples, with assay on the testing examples at repeated intervals during the evolution so as to monitor possible overtraining. The best evolved neural network from the iterated loop of selection and variation is used for processing on validation set data as an example of a real-world setting.
In the context of microarray analysis, the approach generally proceeds as follows (Figure 1). Standard statistical analysis methods are used to prune features from the microarray data that have least correlation to the desired class prediction. The typical user wishes to identify a model that has highest predictive accuracy in class prediction but also uses a small subset of the available features (gene expressions). A training set of examples is used where the true cell type is known. Neural networks are then evolved using this training data to minimize the mean squared error (MSE) relative to known examples relating input patterns of gene expression to an output decision of class prediction (or other decision such as patient diagnosis, small molecule toxicology, etc.). A random choice of model parameters over a feasible range is used to generate an initial population of neural network solutions. Each of these solutions can have a different topology, weight assignment for the connections within the neural network, features used as input, etc (Figure 2). Each neural network is then scored with respect to MSE (or other suitable scoring function). Neural networks of sufficient worth in the population are retained as "parents" and those of insufficient worth are culled from the population. "Offspring" neural networks are then derived from the parent neural networks using variation (mutation or recombination), reconstituting the entire population of neural networks. All solutions in this generation are scored once again with respect to the fitness function and the process is iterated until either a sufficient number of generations of evolution have been completed or until overtraining is observed on a held-out set of testing examples. Similar approaches have been applied to problems in breast cancer detection (both from histologic and radiographic data), pharmaceutical design, structure prediction, quantitative structure-activity analysis, gene recognition, and a variety of industrial settings. Fogel and Corne (5) provide additional information for the application of these algorithms to problems in bioinformatics and biochemistry.

Evolved Neural Networks for Class Prediction

 

Figure 2. Evolution of neural networks can simultaneously adjust weights on the connections, number and type of connections, number and type of nodes, processing elements, etc. to produce a superior neural network topology. Optimized feature sets can be generated with a reduction of the genes used as input to the network.
Optimization of neural networks with simulated evolution also allows simultaneous optimization of the neural network topology. This significantly increases the versatility of the approach, especially when the researcher would like to reduce the feature space to only the most important features for accurate prediction. After sufficient computation, the best-evolved neural network can be examined to determine which features were used for classification.

Keedwell and Narayanan (6) used evolutionary algorithms to identify gene expression combinations useful for classification and evaluated their predictive accuracy through an artificial neural network. Others have used a combination of evolutionary algorithms and neural networks simultaneously to optimize the structure and weights of neural networks for classification based on gene expression. Ooi and Tan (7) developed a similar hybridized approach for this problem, using an evolutionary algorithm to determine the optimal number and set of gene expressions that maximized classification using a maximum likelihood method. To develop such a classifier, they separated the data into training and testing sets and used a discriminant function (in this case maximum likelihood) to assign samples to the class with the highest conditional probability. The evolutionary algorithm then was used to search for subsets of genes that satisfied this discriminant function in the training set. The worth of the overall approach was examined by prediction on a test set corresponding to nine tumor types. The approach generated higher classification accuracies than other published predictive methods on the same multi-class test dataset.

Automated Class Discovery

Hierarchical clustering is the most commonly used approach to cluster microarray data in the hopes of identifying previously known or unknown classes of similar gene expression. Hierarchical clustering utilizes a binary tree, with similar expression patterns clustered in nested subsets. Several authors have noted, however, that the method of hierarchical clustering can depend on the order of the input data: some clusters appear to be the result of local rather than global clustering decisions.

It is desirable to have an algorithm that can determine the correct number of classes in the data on its own without a priori knowledge. One way to do this is through a self-organizing tree algorithm that can produce a binary tree output independent on the data order. K-means clustering is a completely unstructured approach that proceeds in an entirely local fashion and produces an unorganized collection of clusters but may not be conducive to easy user interpretation. Bayesian clustering is a highly structured approach that is appropriate in where a strong prior distribution of the data is available, but this is uncommon in microarray data, especially where class discovery is concerned.

Evolutionary computation (without neural networks) can be used to group similar gene expression patterns together in an attempt to determine the correct number of classes in the data. For this purpose, microarray data can be represented as a fully connected weighted graph with a similarity measure. The vertices of the graph correspond to the objects to be clustered (gene expression profiles) and the edges of the graph correspond to similarity between these objects. A suitable fitness function that maximizes inner-cluster similarity and simultaneously maximizes intra-cluster difference can be used to evolve and optimize the placement of incisional hyperplanes that separate the objects in the weighted graph. The resulting output represents a best clustering of the data. When tested relative to standard methods such as self-organizing maps for this purpose, this approach is quite competitive.

Conclusion

As the number of microarray experiments continues to increase drastically and as these techniques are becoming more and more a part of personalized healthcare, computational methods to support this expansion must also occur. Two central problems in gene expression analysis are class prediction (where the number of classes is already known a priori) and class discovery (where the number of classes is not known a priori). Given that the researcher likely has hundreds or thousands of genes that can be related with statistical confidence in a linear fashion to class prediction, a central problem is automatic reduction of that feature space while simultaneously maximizing predictive accuracy. The resulting model may include both linear and nonlinear interactions of gene expression with time for successful prediction. Neural networks are favored in this regard because of their ability to model both linear and nonlinear interactions and for their proven ability for pattern recognition in a variety of contexts. Proper optimization of the neural network can be a difficult task, however. And in the case of microarray analysis, proper optimization includes not only weight assignment, but also feature selection and reduction. Evolutionary algorithms have recently been used to optimize neural networks where the features are reduced simultaneously with weight assignment optimization on training examples. The best-evolved neural networks can then be assayed relative to held-out examples in testing or validation sets, and then, if sufficient, put to market with kits associated with class prediction. In addition, in cases where the precise number of classes may not be known, evolutionary computation can be used to provide an approximation. The best resulting clusters from this approach can be used as the output classes to be predicted using a neural network approach.

It is clear that the healthcare industry requires methods to rapidly transition microarray data into practical use. A key roadblock remains the discovery of reduced feature sets that still retain high accuracy in classification. Combinations of computational intelligence approaches as indicated above hold promise for rapid, automated, feature selection and pattern recognition for a wide assortment of data. Combinations of evolutionary computation, neural networks, and fuzzy logic will play an increasingly significant role in the areas of gene expression analysis and gene network reconstruction in the near future.

Gary B. Fogel is Vice President of Natural Selection, Inc. and can be reached at 3333 N. Torrey Pines Ct., Suite 200, La Jolla, CA 92037 USA,

References

1. T.R. Golub et al. Science 286 , 531-537 (1999).

2. See the IEEE Computational Intelligence Society website at www.ieee.org

3. J. Khan et al. Nature Med. 7, 673-679 (2001).

4. D.B. Fogel (2000), Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, IEEE Press, Piscataway, NJ.

5. G.B. Fogel and D.W. Corne (eds.), (2003) Evolutionary Computation in Bioinformatics. Morgan Kauffman, San Francisco.

6. E. Keedwell and A. Narayanan, Applications of Evolutionary Computing: EvoWorkshops 2003, pp. 76-86 (2003).

7. C.H. Ooi and P. Tan, Bioinformatics 19 , 37-44 (2003).