ISB Informatics:
Systems Biology in Action
By Nat Goodman
The Institute for
Systems Biology (ISB), where I work, is an independent
non-profit research institution located in
Seattle
. It is a multidisciplinary place and includes
biologists, physicians, computer folks, engineers,
physicists, mathematicians, education specialists, and
others. Most projects involve a combination of
large-scale data production, small-scale bench
biology, and computational analysis. Software is
developed or acquired to meet the needs of specific
projects or research groups. Software is routinely
shared among groups, but there’s no official ISB
informatics system that everyone uses.
In this review, I
describe a sample of the software used at ISB,
focusing on elements that have moved beyond
proof-of-principle and may be relevant to other groups
engaged in systems biology.
The array facility
provides processing pipelines for standard Affymetrix
expression arrays and certain types of two-color
spotted arrays. The Affymetrix pipeline uses
Affymetrix software (GCOS,
formerly MAS 5) for image acquisition and basic data
analysis, and Bioconductor
and Tibshirani’s SAM
for more advanced analysis. The spotted array pipeline
uses Buhler’s Dapple
for image analysis and Ideker’s VERA
and SAM for data analysis. Most users do
additional analysis beyond what the pipeline gives
them. Bioconductor
and TIGR
Multiexperiment Viewer (MeV) are widely used for
this purpose, Nieselt’s Mayday
is used by some people, and at least one group uses GeneSpring.
The proteomics group
provides the Trans-Proteomic
Pipeline for processing mass spec proteomics data
generated using a variety of instruments. The pipeline
includes tools for processing of raw mass spectra,
database search to identify peptides and proteins,
assessment of confidence in these identifications,
identification of proteins in cross-linked complexes,
and protein quantitation. The pipeline is based on the
mzXML standard and can accommodate additional tools
that conform to the standard.
Data from these
pipelines and others end up in the SBEAMS
database. SBEAMS stores many kinds of data but does
not attempt to integrate this data in any deep sense.
To integrate or analyze data, users export it into
files or databases constructed specifically for this
purpose. Caveat: Unlike most ISB software, which runs
on Linux and MySQL or Postgres, SBEAMS requires
Windows and Microsoft SQL Server.
Cytoscape
is a program for visualizing and analyzing biological
networks, such as protein-protein, protein-gene, and
gene-gene interactions. Cytoscape provides add-on
tools, called plug-ins, for doing integrated analysis
of interaction data with other kinds of data. Examples
include analysis of expression data to identify
sub-networks with highly correlated expression, and
annotation data, such as GO, to associate sub-networks
with biological functions. Originally developed by
Trey Ideker when he was at ISB, Cytoscape is now
produced by a consortium he leads that includes ISB,
the
University
of
California San Diego
, Sloan-Kettering, Institut Pasteur, and Agilent.
Similar capabilities
are provided by Ingenuity
Pathways Analysis (IPA), a commercial product that
the company provides to ISB through a collaboration. A
key strength of IPA is that it operates on
Ingenuity’s database of interactions, which is more
comprehensive and probably more accurate than
available public databases. A drawback is that ISB is
not allowed to integrate IPA or the Ingenuity database
with other software or databases, making it harder to
incorporate IPA into the main data-analysis flow.
Discussions are underway with GeneGo for access to
their MetaCore
product.
Gaggle
is a framework for integrating interactive software
tools to support data exploration. A controller
manages communication among the interactive tools
(collectively, geese),
which run as separate programs on the user’s desktop
computer. Components communicate with each other,
generally in response to user requests, by passing
simple messages via Java’s Remote Method Invocation.
Geese are constructed by modifying existing programs
to implement the Gaggle communication protocol, a
process which is generally straightforward for
well-written Java programs. Existing geese include Cytoscape,
TIGR
MeV, Data Matrix Viewer (for viewing and graphing
tabular data), R
command console (for statistical and mathematic
programming), Web interfaces to KEGG
and STRING,
and a Firefox extension that adds Gaggle communication
to any Web page. Gaggle was developed by the Baliga
laboratory at ISB and continues as a collaboration
with Bonneau at
New York
University
.
GDxBase is a framework
for disease-oriented Web sites developed by my group
at ISB in collaboration with Smink in Todd’s
laboratory at the
University
of
Cambridge
. The software integrates disease-specific and general
biological data, presenting the information in a form
suitable for disease researchers who are not experts
in the underlying data types. Tools are provided for
viewing data for lists of genes across the integrated
datasets. We use GDxBase for a large type 1 diabetes
Web site, T1DBase,
funded by the Juvenile
Diabetes Research Foundation, and a smaller
Huntington’s Disease Web site, HDBase,
funded by the Hereditary
Disease Foundation and High
Q Foundation. Other groups are using GDxBase for
type 2 diabetes, prion disease, bloodomics, and
diseases of energy metabolism.
Systems biology needs a
ton of software to digest the data upon which it
relies. This review gives a taste of the software used
at one leading systems biology institution. The most
important message, I think, is that diversity rules.
While computer folks always want a coherent
architecture to tie the software together, a less
organized approach is probably better for this dynamic
field at this point in time.
|