PharmaDD Top News: Business, Technology, Strategic Briefings - Tracking leading techniques and approaches in therapeutic drug discovery and development

 

Sponsored Links:
Prescription Drug Addiction

 

 

Pharmaceutical Discovery, Aug 1, 2005 
RNAi: A Robust Tool For Target Identification And Validation

By Subrahmanyam Yerramilli , Eric Lader , Dirk Loeffert , Friederike Wilmer , Peter Hahn , Elizabeth Scanlan

In Silico Methods and Predictive Tools Along the Drug Discovery Value Chain
Pharmaceutical companies are beginning to make extensive use of informatics solutions aimed at handling challenges of today's data-rich environment. The goal of informatics was merely to improve decision making in the early stages of the drug discovery value chain; yet, efficiency improvements due to effective use of informatics tools are evident throughout the discovery pipeline. To illustrate current trends in drug discovery, an integrated in silico platform is presented.
Dragan A. Cirovic, Vijay Bhargava, Frank Harrison, Roy Vaz, Abdel Laoui
Pharmaceutical Discovery

 

Figure 1. The impact of the Predictive Foundation System along the value chain. (HEP = hit exploration program; COP = chemical optimization program; EDC = early development candidate).
The emerging drug discovery paradigm emphasizes the concept of in silico methods for predicting in vitro and in vivo absorption, distribution, metabolism, excretion and toxicological (ADMET) parameters. These methods are being applied as early in the drug discovery value chain as possible to achieve early attrition of unpromising candidate compounds (Figure 1). Performing lead optimization early and parallelizing lead optimization and lead generation are the key goals and, as such, research organizations are eager to boost innovation and reduce costs by using new approaches to drug discovery and development. While drug potency and specificity continue to be considered the primary means of compound selection, the emphasis increasingly is being placed upon assessing druggability of potential drug leads as early as possible.

Recent technological advances (1) are making it possible to enhance the value of information derived from in vitro assays for lead compound series characterization. For example, high-throughput in vitro cell culture assays are being used to predict ADMET properties in humans. As many otherwise promising compounds fail to pass the metabolic stability and toxicity tests, in vitro metabolic screens have a key role to play in early compound attrition (2). In addition, assays to reveal drugs that act by many pathways, as well as drug–drug interactions, are gaining in popularity. Finally, toxicity prediction (3) is an especially complex problem, but one that is receiving intense attention.

Ensuring data currency, validity and distinctness from other possibly conflicting data is one of the challenges facing researchers in the evaluation of large amounts of data. Any modeling system selected must be able to accommodate multiple heterogeneous sources of data and should have the ability to focus on one or more of the various tools available to analyze the data. An integrated chemical and biological informatics framework is needed to efficiently utilize these tools (4–6). Such a framework usually contains databases of chemical structures and the software to build and manage them. It also contains a collection of software tools that can be used to select and compute properties from these databases, as well as modeling applications that can visualize, model and predict activity from these structures.

Predictive Foundation System

The Predictive Foundation System (PFS) is an integral component of our corporate information technology infrastructure. It addresses the need for enterprise-wide access to computed and predicted data. These data are in silico complements to experimental chemical compound profiles. The solution substantially enhances capabilities of enterprise tools that are well positioned to access experimental data. Together they enhance the quality and speed of decision making and are consistent with the current industry trend to utilize in silico methodologies, which has proven to be less expensive and more scalable for improving efficiency in the drug discovery process.

 

Figure 2. The corporate information system framework. Drug discovery involves testing a large number of molecules against biological targets. Managing the resulting data involves registering compounds, tracking experimental measurements and managing in silico chemical profiles. The chemical structures are registered into chemical repositories, where they are associated with compound availability, quality control information and a large array of experimental properties that are direct outcomes of the drug discovery process. The computed properties are maintained as guidance and/or supplement to experimental properties, since the experimental data often are sparse. The data integration layer unifies disparate data sets.
PFS is comprised of a set of predictive tools, databases and application servers that collectively track and manage computed results and associated metadata required for meaningful interpretation. From a data integration perspective, there are three major data types (Figure 2):

Chemical structures. The chemical structures are registered in a form amenable to substructure and similarity searching. The structures are associated with information on the origin of their synthetic batches, their purity, and their current availability.

Experimental properties. The experimental properties range from physicochemical structure profile (solubility, logP, logD, molar refractivity, fluorescence) to activity against a wide range of biological targets (single point measurements, IC50, EC50, Ki) to pharmacokinetic, pharmacodynamic, safety and toxicity profiles.

Computed and predicted properties. The computed properties are derived from chemical structures — for example, the number of hydrogen bond donors and hydrogen bond acceptors. The predicted properties are estimates of experimental properties. They are produced by statistical models that define relationship between chemical structures and experimental measurements. For example, there are models for prediction of solubility, logD, logP, pKa, etc.

The data visualization and analysis component of the system combines these three data types and delivers the combined dataset to the project teams through a uniform user interface. An example of such an interface would be a page that displays chemical structure, and lists its availability, purity, a range of experimental measurements, along with its in silico profile.

Another equally important aspect of the system is the workflow that links the generation and management of chemical and biological data with the generation and management of in silico data. From a systems integration perspective, the former set of software modules operates independently of the latter ones. The common denominator is a set of predefined interfaces to the corresponding data repositories.

 

Figure 3. The architecture of the Predictive Foundation System. The PFS consists of three layers: in silico tools, a compute server and a data warehouse. The in silico tools are commercial or in-house software modules designed for calculation of molecular properties. Commercial tools are interfaced to the rest of the system via generic Web service extensions, which expose the SOAP interface to the PFS compute server. When the Web service receives a list of chemical structures and processing instructions, it buffers the input into a file on a local machine. Then it starts the tool, which in turn reads structures from the input file, performs calculations and deposits results into an output file. When the computation is finished the Web service returns results to the compute server. The compute server coordinates calculation of in silico properties, retrieves structures from the compound repository, calculates their in silico properties using a predefined set of tools and deposits results into the data warehouse. The compute server performs a range of preprocessing operations on the structures, as well as numerous post-processing operations on the results. The in silico molecular properties are buffered in a local database. From that database, a separate compute server module exports the calculated result in XML format to the warehouse subsystem.
From a software architectural perspective, the core of the PFS consists of three layers: 1) in silico tools, 2) a compute server and 3) a data warehouse (Figure 3). The in silico tools are software modules that calculate molecular properties. The compute server coordinates and drives the calculation process. It retrieves structural information from the compound repositories, submits that information to a series of in silico tools, collects the computed molecular properties and deposits them into the data warehouse. The process of structure submission includes customized pre-processing protocols. Depending upon the property at hand, preprocessing could include a selection of operations (e.g., defragmentation, desalting, charge removal, adding hydrogens, filtering, etc.). Furthermore, properties predictors have certain restrictions in terms of the molecular structures that could be processed; therefore, unsupported structures need to be filtered out. These applicability restrictions range from scientifically driven ones (i.e., a given model is applicable only to neutral molecules) to the more technical (software implementation driven) constraints (i.e., molecules cannot contain certain substructure, or they need to contain less than seven condensed rings, or spiro constructs are not supported).

Following the evaluation of predicted properties and prior to the deposition of the results into a transactional database, the calculation outputs are run through post-processing protocols. These protocols ensure data consistency and perform data conversions, if necessary. The results are reported along with their corresponding confidence intervals and error levels. A record is kept of the tool, method and model that are used for producing predictions. Where applicable, a series of additional attachments and annotations are associated with the results. For example, deductive estimation of risk from existing knowledge (DEREK) toxicity predictions are expressed in terms of alerts, which are structural fragments that are known to be toxic in a given chemical surrounding. Each alert is associated with a document that outlines alert definition and a list of literature references from which the reasoning behind given alert was derived. An appropriate error message is recorded in cases where property predictions were not made either because structures were noncompliant and filtered up front or because a calculation failure was reported.

The compute server contains a component that keeps track of the status of the original compound depositories. Whenever a new compound appears in the original structure source, it is retrieved and validated for structural integrity. Only those compounds that satisfy a predefined set of structural integrity rules are added to the PFS compound domain. Those compounds are scheduled for the property calculations, while the structures that fail the structural integrity test are marked for review. Because this component of the PFS also monitors changes in the structures of existing compounds in the primary depositories, corrected structures are reprocessed by the PFS as soon as the corrections are made.

The compute server database (Figure 3) is a relational database that has a dual purpose: it stores system configuration and static information, and it buffers computed results prior to their export to the warehouse. The former part of the database content defines relationships between predicted properties and their corresponding tools, models and methods. The latter part is a form of a transient staging area where results are annotated and validated.

The warehouse is a dedicated data depository that enforces overall data integrity. This database is optimized for fast retrieval of data. The staging area is a database that relieves the main data depository from computational-intensive data insertion, update and deletion operations during peak hours of use. This is particularly significant considering that during the full calculation cycle of a given property (i.e., calculation for the whole compound domain), millions of data insertions take place in a single day. The staging area also takes on the burden of ensuring data uniqueness and completeness. The software modules that perform these operations are referred to collectively as extract, transform and load (ETL). The extract part, which extracts data from a transactional database, is located in the compute server. The loader moves data from the staging area into the main warehouse.

The in silico ADMET tools currently in use originate from a variety of commercial and in-house sources. These tools often overlap in terms of supported molecular properties. The preferred tool is selected prior to inclusion of the predictions into the system. There are two notable aspects of the tool selection process: scientific validation and statistical profiling of predictions with respect to experimental measurements. While scientific validation mainly is concerned with interpretability of a predictive model, the statistical profiling defines model applicability, scope and accuracy.

 

Figure 4. The grid architecture of the subsystem for making toxicological predictions. DEREK toxicity prediction is an in silico property calculation that requires an above average volume of computing power. To perform these calculations on a large scale, an array of CPUs that work in parallel or a large number of independent personal computers could be used. Each computer would operate as a stand-alone computation node on which DEREK calculations are performed. The system operates in asynchronous mode, with a central node that receives job requests and queues them up for individual processing. Each job, which is a list of chemical structures, is partitioned by the central node, and the partial structure lists are assigned to the worker nodes. The worker nodes perform calculations and place results into their local output boxes. The central nodes collect results from all nodes, and repeat the cycle until all structures within a given job are processed.
Another aspect of predictive ADMET model integration into the corporate infrastructure is computational complexity. The demand for computational power greatly varies from model to model. Custom solutions often are needed in order to provide support for models requiring excessive computational resources. An example of such a solution is a subsystem with homegrown grid architecture for performing toxicological predictions (Figure 4).

In terms of result certainty, the in silico properties could be divided into deterministic and estimated categories. Examples of deterministic properties include counts of hydrogen bond acceptors and donors, and molecular weight. Estimated properties are derived by means of multivariate statistical methods, with regression and classification methods being the most common. Examples of such properties are solubility (logS), octanol/water partition coefficient (logP), pKa, and topological surface area (TSA). The latter model category more closely matches a commonly encountered narrower perception of predictive models.

The nature of predictive methods is such that the models capture experimentally obtainable chemical or biological information during its training phase. The models correlate experimental information to a set of molecular descriptors that are derived from chemical structures. Therefore, during the prediction phase, the primary source of variability is the chemical domain.

The PFS subsystem for in silico data generation is designed to monitor the changes in compound repositories, to produce predicted data for newly added chemical entities and to update predictions for the altered chemicals. The enrichment of the chemical and biological experimental domain is mirrored in the in silico realm on a discontinuous basis. Therefore, the predictive models periodically need to be retrained with datasets that include up-to-date experimental data, usually on an annual schedule. The model updates are followed by property recalculation cycles that produce sets of predictions using the new model versions.

From the drug discovery perspective, the real value of the predicted ADMET properties comes into play only when they are combined to form compound profiles. In its simplest form, compound profiling consists of listing key descriptors, such as logS, logP, logBB (blood/brain partitioning), Caco2 cell permeability, log IC50 for hERG K+ channel blockage, etc. The next step is derivation of summary statistics from the ADMET properties. Perhaps the best-known example of a druggability descriptor is Lipinski's rule of five, although there are a number of others. All descriptors attempt to give an estimate of bioavailability, which translates into the suitability of a chemical entity as a drug candidate. In silico platforms such as the PFS facilitate the compound profiling effort by characterizing the pre-computed ADMET properties for the entirety of the corporate chemical collection. Thus, it becomes possible to perform rapid searches on large compound domains by specifying desired ADMET profiles in combination with a variety of other criteria.

Solubility Predictions

 

Figure 5. A list of the most important descriptors for the PFS solubility model. Some of the numerous models for solubility predictions are available in commercial software solutions. However, their common weaknesses are that the experimental errors in their training and test datasets often are inadequately high, and that chemical space covered by those models only partially overlaps with the chemical space that is of primary interest to drug discovery project teams. To compensate, a solubility prediction study was performed using our in-house data, and a partial least squares (PLS)-based model was chosen. That model used 26 in silico predictors, and it regressed them against experimental solubility measurements. A descriptor set was determined to identify the most significant properties; the selected descriptors fell into the following categories: lipophilicity, flexibility, hetero atom counts and polar surface descriptors.
The aqueous solubility of drugs is one of the key factors affecting their bioavailability and, thus, the quality of solubility predictions has direct impact on bioavailability assessments. Numerous computational methods have been developed for aqueous solubility predictions (7, 8). A closer assessment of these models reveals that the significance of the descriptors used in various methods is dependent on the composition of the chemical space and the model used. For the PFS, a partial least squares (PLS) model trained on a custom training data set of four classes of descriptors (Figure 5) was found to give the most statistically significant predictions. The four descriptor classes are lipophilicity, hetero-atoms content, flexibility and polar surface descriptors.

The majority of publicly available models are based upon relatively small training data sets containing compounds from various literature sources with little information on underlying experimental noise. The development of more accurate and reliable predictive models necessitates compilation of an experimental database with highly accurate solubility data for a large, diverse collection of drug-like molecules.

Toxicity Predictions

 

Figure 6. The DEREK report for 1-hydroxyoxaunomycin. The DEREK-Alerts field lists toxicity end points (i.e., mutagenicity, skin sensitization), followed by alert ID and alert name (i.e., quinone). The Location field displays the position of structural fragments corresponding to the individual alerts. The Ranking/Category field marks the highest ranking DEREK alert for this compound. This field also contains an explanation of a given rank level. The remaining two fields contain description of alerts (Description), and recommended cause of action (Recommendation).
Toxicological predictions fall into a category of models that are the most difficult to formulate (2, 9–12). This modeling challenge is to a great extent met by knowledge-based expert systems. A notable example of such a system is DEREK. DEREK predictions are expressed in terms of alerts, which correspond to predefined substructure fragments. An alert is raised if its corresponding substructure fragment is found and a set of custom rules is satisfied. The alerts are grouped further into toxicological categories, such as mutagenicity and carcinogenicity, and annotated with supplemental information (Figure 6).

Due to the complexity of toxicological phenomena, the accuracy of predictions generated by the general-purpose models tends to leave a plethora of space for further improvements. Prediction quality enhancements are gained through incorporation of proprietary experimental toxicity measurements into the knowledge base, which involves definition of new alerts, customization of the rule base and definition of extended in-house annotation.

Conclusion

In silico methods continue to make profound contributions to drug discovery. The products of in silico methods — the estimations of ADMET properties — supplement traditional experimental measurements. The availability of in silico properties facilitates more efficient design of screening libraries and serves as a foundation for numerous ADMET compound profiling and druggability assessments along the drug discovery value chain.

Acknowledgements

Data and ideas presented in this article are the result of many productive collaborations and discussions with our colleagues at Sanofi Aventis, including: Alexander Amberg (Frankfurt, Germany), Alexander Sukharevsky (Boston, Massachusetts, USA), Barbara Butler (Bridgewater, New Jersey, USA), Christine Rudolph (Frankfurt, Germany), Claude Luttmann (Paris, France), Dieter Kreusel (Frankfurt, Germany), Elie Giraud (Bridgewater, New Jersey, USA), Friedemann Schmidt (Frankfurt, Germany), Hans-Peter Spirkl (Frankfurt, Germany), Helene Bourdon (Paris, France) and Paul Brown, Paul Whitehead, Stephan Reiling, Sundararajan Vijayakumar, Valery Polyakov and Xin-Hua Song (Bridgewater, New Jersey, USA).

Dragan A. Cirovic is a senior scientist in the informatics department, Vijay Bhargava is a vice president and global head of the drug metabolism and pharmacokinetics department, Frank Harrison is a senior director of the information services department, Roy Vaz is a head of the investigative product optimization group and Abdel Laoui is a head of the chemoinformatics group at Sanofi Aventis. Dragan Cirovic can be reached at Sanofi Aventis, 1041 Route 202-206 N, BRJR1-002A, Bridgewater, New Jersey 08807-0800.

References

1. N. Fay and D. Ullmann, Drug Disc. Today 7, 181–186 (2002).

2 . W.J. Egan, K.M. Merz Jr., and J.J. Baldwin, J. Med. Chem. 43, 3867–3877 (2000).

3 . D.E. Johnson et al., Curr. Opin. Drug Disc. Dev. 4, 92–101 (2001).

4. W. L. Jorgensen, Nature 303, 1813–1818 (2004).

5. J. Bajorath, Nature Rev. Drug Disc. 1, 882–894 (2002).

6. T. Kenakin, Nature Rev. Drug Disc. 2, 429–437 (2003).

7. W.L. Jorgensen and E.M. Duffy, Adv. Drug Deliv. Rev. 54, 355–366 (2002).

8. G.W. Caldwell, Curr. Opin. Drug Disc. Dev. 3, 30–41 (2000).

9. A. Bugrim, T. Nikolskaya and Y. Nikolsky, Drug Disc. Today 9, 127–135 (2004).

10. H. Waterbeemd and E. Gifford, Adv. Drug Deliv. Rev. 2, 192–204 (2003).

11. D. Zmuidinavicius, P. Japertas, A. Petrauskas and R. Didziapetris, Curr. Topics Med. Chem. 3, 1301–1314 (2003).

12. G.M. Pearl, S. Livingston-Carr and S.K. Durham, Curr. Topics. Med. Chem. 1, 247–255 (2001).