Pharmaceutical companies are beginning to
make extensive use of informatics solutions aimed at handling challenges
of today's data-rich environment. The goal of informatics was merely to
improve decision making in the early stages of the drug discovery value
chain; yet, efficiency improvements due to effective use of informatics
tools are evident throughout the discovery pipeline. To illustrate current
trends in drug discovery, an integrated in silico platform is presented.
| Aug
1, 2005 |
| By:
Dragan
A. Cirovic, Vijay
Bhargava, Frank
Harrison, Roy
Vaz, Abdel
Laoui |
| Pharmaceutical
Discovery |
|

Figure 1. The impact of the
Predictive Foundation System along the value chain. (HEP = hit
exploration program; COP = chemical optimization program; EDC =
early development candidate).
|
The emerging drug discovery paradigm emphasizes the concept of in
silico methods for predicting in vitro and in vivo
absorption, distribution, metabolism, excretion and toxicological (ADMET)
parameters. These methods are being applied as early in the drug discovery
value chain as possible to achieve early attrition of unpromising
candidate compounds (Figure 1). Performing lead optimization early and
parallelizing lead optimization and lead generation are the key goals and,
as such, research organizations are eager to boost innovation and reduce
costs by using new approaches to drug discovery and development. While
drug potency and specificity continue to be considered the primary means
of compound selection, the emphasis increasingly is being placed upon
assessing druggability of potential drug leads as early as possible.
Recent technological advances (1) are making it possible to enhance the
value of information derived from in vitro assays for lead compound
series characterization. For example, high-throughput in vitro cell
culture assays are being used to predict ADMET properties in humans. As
many otherwise promising compounds fail to pass the metabolic stability
and toxicity tests, in vitro metabolic screens have a key role to
play in early compound attrition (2). In addition, assays to reveal drugs
that act by many pathways, as well as drug–drug interactions, are
gaining in popularity. Finally, toxicity prediction (3) is an especially
complex problem, but one that is receiving intense attention.
Ensuring data currency, validity and distinctness from other possibly
conflicting data is one of the challenges facing researchers in the
evaluation of large amounts of data. Any modeling system selected must be
able to accommodate multiple heterogeneous sources of data and should have
the ability to focus on one or more of the various tools available to
analyze the data. An integrated chemical and biological informatics
framework is needed to efficiently utilize these tools (4–6). Such a
framework usually contains databases of chemical structures and the
software to build and manage them. It also contains a collection of
software tools that can be used to select and compute properties from
these databases, as well as modeling applications that can visualize,
model and predict activity from these structures.
Predictive Foundation System
The Predictive Foundation System (PFS) is an integral component of our
corporate information technology infrastructure. It addresses the need for
enterprise-wide access to computed and predicted data. These data are in
silico complements to experimental chemical compound profiles. The
solution substantially enhances capabilities of enterprise tools that are
well positioned to access experimental data. Together they enhance the
quality and speed of decision making and are consistent with the current
industry trend to utilize in silico methodologies, which has proven
to be less expensive and more scalable for improving efficiency in the
drug discovery process.

Figure 2. The corporate information
system framework. Drug discovery involves testing a large number
of molecules against biological targets. Managing the resulting
data involves registering compounds, tracking experimental
measurements and managing in silico chemical profiles. The
chemical structures are registered into chemical repositories,
where they are associated with compound availability, quality
control information and a large array of experimental properties
that are direct outcomes of the drug discovery process. The
computed properties are maintained as guidance and/or supplement
to experimental properties, since the experimental data often are
sparse. The data integration layer unifies disparate data sets.
|
PFS is comprised of a set of predictive
tools, databases and application servers that collectively track and
manage computed results and associated metadata required for meaningful
interpretation. From a data integration perspective, there are three major
data types (Figure 2):
Chemical structures. The
chemical structures are registered in a form amenable to substructure and
similarity searching. The structures are associated with information on
the origin of their synthetic batches, their purity, and their current
availability.
Experimental properties. The
experimental properties range from physicochemical structure profile
(solubility, logP, logD, molar refractivity, fluorescence) to activity
against a wide range of biological targets (single point measurements,
IC50, EC50, Ki) to pharmacokinetic, pharmacodynamic, safety and toxicity
profiles.
Computed and predicted properties.
The computed properties are derived from chemical structures — for
example, the number of hydrogen bond donors and hydrogen bond acceptors.
The predicted properties are estimates of experimental properties. They
are produced by statistical models that define relationship between
chemical structures and experimental measurements. For example, there are
models for prediction of solubility, logD, logP, pKa, etc.
The data visualization and analysis
component of the system combines these three data types and delivers the
combined dataset to the project teams through a uniform user interface. An
example of such an interface would be a page that displays chemical
structure, and lists its availability, purity, a range of experimental
measurements, along with its in silico profile.
Another equally important aspect of
the system is the workflow that links the generation and management of
chemical and biological data with the generation and management of in
silico data. From a systems integration perspective, the former set of
software modules operates independently of the latter ones. The common
denominator is a set of predefined interfaces to the corresponding data
repositories.

Figure 3. The architecture of the
Predictive Foundation System. The PFS consists of three layers: in
silico tools, a compute server and a data warehouse. The in silico
tools are commercial or in-house software modules designed for
calculation of molecular properties. Commercial tools are
interfaced to the rest of the system via generic Web service
extensions, which expose the SOAP interface to the PFS compute
server. When the Web service receives a list of chemical
structures and processing instructions, it buffers the input into
a file on a local machine. Then it starts the tool, which in turn
reads structures from the input file, performs calculations and
deposits results into an output file. When the computation is
finished the Web service returns results to the compute server.
The compute server coordinates calculation of in silico
properties, retrieves structures from the compound repository,
calculates their in silico properties using a predefined set of
tools and deposits results into the data warehouse. The compute
server performs a range of preprocessing operations on the
structures, as well as numerous post-processing operations on the
results. The in silico molecular properties are buffered in a
local database. From that database, a separate compute server
module exports the calculated result in XML format to the
warehouse subsystem.
|
From a software architectural
perspective, the core of the PFS consists of three layers: 1) in silico
tools, 2) a compute server and 3) a data warehouse (Figure 3). The in
silico tools are software modules that calculate molecular properties.
The compute server coordinates and drives the calculation process. It
retrieves structural information from the compound repositories, submits
that information to a series of in silico tools, collects the
computed molecular properties and deposits them into the data warehouse.
The process of structure submission includes customized pre-processing
protocols. Depending upon the property at hand, preprocessing could
include a selection of operations (e.g., defragmentation,
desalting, charge removal, adding hydrogens, filtering, etc.).
Furthermore, properties predictors have certain restrictions in terms of
the molecular structures that could be processed; therefore, unsupported
structures need to be filtered out. These applicability restrictions range
from scientifically driven ones (i.e., a given model is applicable
only to neutral molecules) to the more technical (software implementation
driven) constraints (i.e., molecules cannot contain certain
substructure, or they need to contain less than seven condensed rings, or
spiro constructs are not supported).
Following the evaluation of predicted
properties and prior to the deposition of the results into a transactional
database, the calculation outputs are run through post-processing
protocols. These protocols ensure data consistency and perform data
conversions, if necessary. The results are reported along with their
corresponding confidence intervals and error levels. A record is kept of
the tool, method and model that are used for producing predictions. Where
applicable, a series of additional attachments and annotations are
associated with the results. For example, deductive estimation of risk
from existing knowledge (DEREK) toxicity predictions are expressed in
terms of alerts, which are structural fragments that are known to be toxic
in a given chemical surrounding. Each alert is associated with a document
that outlines alert definition and a list of literature references from
which the reasoning behind given alert was derived. An appropriate error
message is recorded in cases where property predictions were not made
either because structures were noncompliant and filtered up front or
because a calculation failure was reported.
The compute server contains a
component that keeps track of the status of the original compound
depositories. Whenever a new compound appears in the original structure
source, it is retrieved and validated for structural integrity. Only those
compounds that satisfy a predefined set of structural integrity rules are
added to the PFS compound domain. Those compounds are scheduled for the
property calculations, while the structures that fail the structural
integrity test are marked for review. Because this component of the PFS
also monitors changes in the structures of existing compounds in the
primary depositories, corrected structures are reprocessed by the PFS as
soon as the corrections are made.
The compute server database (Figure
3) is a relational database that has a dual purpose: it stores system
configuration and static information, and it buffers computed results
prior to their export to the warehouse. The former part of the database
content defines relationships between predicted properties and their
corresponding tools, models and methods. The latter part is a form of a
transient staging area where results are annotated and validated.
The warehouse is a dedicated data
depository that enforces overall data integrity. This database is
optimized for fast retrieval of data. The staging area is a database that
relieves the main data depository from computational-intensive data
insertion, update and deletion operations during peak hours of use. This
is particularly significant considering that during the full calculation
cycle of a given property (i.e., calculation for the whole compound
domain), millions of data insertions take place in a single day. The
staging area also takes on the burden of ensuring data uniqueness and
completeness. The software modules that perform these operations are
referred to collectively as extract, transform and load (ETL). The extract
part, which extracts data from a transactional database, is located in the
compute server. The loader moves data from the staging area into the main
warehouse.
The in silico ADMET tools
currently in use originate from a variety of commercial and in-house
sources. These tools often overlap in terms of supported molecular
properties. The preferred tool is selected prior to inclusion of the
predictions into the system. There are two notable aspects of the tool
selection process: scientific validation and statistical profiling of
predictions with respect to experimental measurements. While scientific
validation mainly is concerned with interpretability of a predictive
model, the statistical profiling defines model applicability, scope and
accuracy.

Figure 4. The grid architecture of
the subsystem for making toxicological predictions. DEREK toxicity
prediction is an in silico property calculation that requires an
above average volume of computing power. To perform these
calculations on a large scale, an array of CPUs that work in
parallel or a large number of independent personal computers could
be used. Each computer would operate as a stand-alone computation
node on which DEREK calculations are performed. The system
operates in asynchronous mode, with a central node that receives
job requests and queues them up for individual processing. Each
job, which is a list of chemical structures, is partitioned by the
central node, and the partial structure lists are assigned to the
worker nodes. The worker nodes perform calculations and place
results into their local output boxes. The central nodes collect
results from all nodes, and repeat the cycle until all structures
within a given job are processed.
|
Another aspect of predictive ADMET model
integration into the corporate infrastructure is computational complexity.
The demand for computational power greatly varies from model to model.
Custom solutions often are needed in order to provide support for models
requiring excessive computational resources. An example of such a solution
is a subsystem with homegrown grid architecture for performing
toxicological predictions (Figure 4).
In terms of result certainty, the in
silico properties could be divided into deterministic and estimated
categories. Examples of deterministic properties include counts of
hydrogen bond acceptors and donors, and molecular weight. Estimated
properties are derived by means of multivariate statistical methods, with
regression and classification methods being the most common. Examples of
such properties are solubility (logS), octanol/water partition coefficient
(logP), pKa, and topological surface area (TSA). The latter model category
more closely matches a commonly encountered narrower perception of
predictive models.
The nature of predictive methods is
such that the models capture experimentally obtainable chemical or
biological information during its training phase. The models correlate
experimental information to a set of molecular descriptors that are
derived from chemical structures. Therefore, during the prediction phase,
the primary source of variability is the chemical domain.
The PFS subsystem for in silico
data generation is designed to monitor the changes in compound
repositories, to produce predicted data for newly added chemical entities
and to update predictions for the altered chemicals. The enrichment of the
chemical and biological experimental domain is mirrored in the in
silico realm on a discontinuous basis. Therefore, the predictive
models periodically need to be retrained with datasets that include
up-to-date experimental data, usually on an annual schedule. The model
updates are followed by property recalculation cycles that produce sets of
predictions using the new model versions.
From the drug discovery perspective,
the real value of the predicted ADMET properties comes into play only when
they are combined to form compound profiles. In its simplest form,
compound profiling consists of listing key descriptors, such as logS, logP,
logBB (blood/brain partitioning), Caco2 cell permeability, log IC50 for
hERG K+ channel blockage, etc. The next step is derivation of summary
statistics from the ADMET properties. Perhaps the best-known example of a
druggability descriptor is Lipinski's rule of five, although there are a
number of others. All descriptors attempt to give an estimate of
bioavailability, which translates into the suitability of a chemical
entity as a drug candidate. In silico platforms such as the PFS
facilitate the compound profiling effort by characterizing the
pre-computed ADMET properties for the entirety of the corporate chemical
collection. Thus, it becomes possible to perform rapid searches on large
compound domains by specifying desired ADMET profiles in combination with
a variety of other criteria.
Solubility Predictions

Figure 5. A list of the most
important descriptors for the PFS solubility model. Some of the
numerous models for solubility predictions are available in
commercial software solutions. However, their common weaknesses
are that the experimental errors in their training and test
datasets often are inadequately high, and that chemical space
covered by those models only partially overlaps with the chemical
space that is of primary interest to drug discovery project teams.
To compensate, a solubility prediction study was performed using
our in-house data, and a partial least squares (PLS)-based model
was chosen. That model used 26 in silico predictors, and it
regressed them against experimental solubility measurements. A
descriptor set was determined to identify the most significant
properties; the selected descriptors fell into the following
categories: lipophilicity, flexibility, hetero atom counts and
polar surface descriptors.
|
The aqueous solubility of drugs is one
of the key factors affecting their bioavailability and, thus, the quality
of solubility predictions has direct impact on bioavailability
assessments. Numerous computational methods have been developed for
aqueous solubility predictions (7, 8). A closer assessment of these models
reveals that the significance of the descriptors used in various methods
is dependent on the composition of the chemical space and the model used.
For the PFS, a partial least squares (PLS) model trained on a custom
training data set of four classes of descriptors (Figure 5) was found to
give the most statistically significant predictions. The four descriptor
classes are lipophilicity, hetero-atoms content, flexibility and polar
surface descriptors.
The majority of publicly available
models are based upon relatively small training data sets containing
compounds from various literature sources with little information on
underlying experimental noise. The development of more accurate and
reliable predictive models necessitates compilation of an experimental
database with highly accurate solubility data for a large, diverse
collection of drug-like molecules.
Toxicity Predictions

Figure 6. The DEREK report for
1-hydroxyoxaunomycin. The DEREK-Alerts field lists toxicity end
points (i.e., mutagenicity, skin sensitization), followed by alert
ID and alert name (i.e., quinone). The Location field displays the
position of structural fragments corresponding to the individual
alerts. The Ranking/Category field marks the highest ranking DEREK
alert for this compound. This field also contains an explanation
of a given rank level. The remaining two fields contain
description of alerts (Description), and recommended cause of
action (Recommendation).
|
Toxicological predictions fall into a
category of models that are the most difficult to formulate (2, 9–12).
This modeling challenge is to a great extent met by knowledge-based expert
systems. A notable example of such a system is DEREK. DEREK predictions
are expressed in terms of alerts, which correspond to predefined
substructure fragments. An alert is raised if its corresponding
substructure fragment is found and a set of custom rules is satisfied. The
alerts are grouped further into toxicological categories, such as
mutagenicity and carcinogenicity, and annotated with supplemental
information (Figure 6).
Due to the complexity of
toxicological phenomena, the accuracy of predictions generated by the
general-purpose models tends to leave a plethora of space for further
improvements. Prediction quality enhancements are gained through
incorporation of proprietary experimental toxicity measurements into the
knowledge base, which involves definition of new alerts, customization of
the rule base and definition of extended in-house annotation.
Conclusion
In silico methods continue to
make profound contributions to drug discovery. The products of in
silico methods — the estimations of ADMET properties — supplement
traditional experimental measurements. The availability of in silico
properties facilitates more efficient design of screening libraries and
serves as a foundation for numerous ADMET compound profiling and
druggability assessments along the drug discovery value chain.
Acknowledgements
Data and ideas presented in this
article are the result of many productive collaborations and discussions
with our colleagues at Sanofi Aventis, including: Alexander Amberg
(Frankfurt, Germany), Alexander Sukharevsky (Boston, Massachusetts, USA),
Barbara Butler (Bridgewater, New Jersey, USA), Christine Rudolph
(Frankfurt, Germany), Claude Luttmann (Paris, France), Dieter Kreusel
(Frankfurt, Germany), Elie Giraud (Bridgewater, New Jersey, USA),
Friedemann Schmidt (Frankfurt, Germany), Hans-Peter Spirkl (Frankfurt,
Germany), Helene Bourdon (Paris, France) and Paul Brown, Paul Whitehead,
Stephan Reiling, Sundararajan Vijayakumar, Valery Polyakov and Xin-Hua
Song (Bridgewater, New Jersey, USA).
Dragan A. Cirovic is a senior
scientist in the informatics department, Vijay Bhargava is a vice
president and global head of the drug metabolism and pharmacokinetics
department, Frank Harrison is a senior director of the information
services department, Roy Vaz is a head of the investigative product
optimization group and Abdel Laoui is a head of the
chemoinformatics group at Sanofi Aventis. Dragan Cirovic can be reached at
Sanofi Aventis, 1041 Route 202-206 N, BRJR1-002A, Bridgewater, New Jersey
08807-0800.
References
1. N. Fay and D. Ullmann, Drug
Disc. Today 7, 181–186 (2002).
2 . W.J. Egan, K.M. Merz Jr., and J.J.
Baldwin, J. Med. Chem. 43, 3867–3877 (2000).
3 . D.E. Johnson et al., Curr.
Opin. Drug Disc. Dev. 4, 92–101 (2001).
4. W. L. Jorgensen, Nature 303,
1813–1818 (2004).
5. J. Bajorath, Nature Rev. Drug
Disc. 1, 882–894 (2002).
6. T. Kenakin, Nature Rev. Drug
Disc. 2, 429–437 (2003).
7. W.L. Jorgensen and E.M. Duffy, Adv.
Drug Deliv. Rev. 54, 355–366 (2002).
8. G.W. Caldwell, Curr. Opin. Drug
Disc. Dev. 3, 30–41 (2000).
9. A. Bugrim, T. Nikolskaya and Y.
Nikolsky, Drug Disc. Today 9, 127–135 (2004).
10. H. Waterbeemd and E. Gifford, Adv.
Drug Deliv. Rev. 2, 192–204 (2003).
11. D. Zmuidinavicius, P. Japertas,
A. Petrauskas and R. Didziapetris, Curr. Topics Med. Chem. 3,
1301–1314 (2003).
12. G.M. Pearl, S. Livingston-Carr
and S.K. Durham, Curr. Topics. Med. Chem. 1, 247–255
(2001).
|