Databanks store biological raw data as repositories whereas databases provide additional annotation and functionality. Examples of databanks are GenBank (Benson et al. 2003) and EMBL (Stoesser et al. 2003) for primary DNA sequences and PDB (Berman et al. 2000) for protein structures. SwissProt (Boeckmann et al. 2003) and FlyBase (FlyBase Consortium 2003) are well-known databases which provide genetic and functional annotation. Examples of higher-level databases are PFAM (Bateman et al. 2002), SCOP (Murzin et al. 1995) and KEGG (Kanehisa et al. 2002). All of them have one feature in common: their fast growth. The aforementioned explosion of data can be quantified: DNA databases are currently doubling every 9 months. The PDB is expected to grow faster due to structural genomics efforts (3298 structures were deposited in 2001,3381 structures in 2002 with a total of 19,623 entries at the end of 2002).
An unsolved problem is the integration of biological databases. Since each database only contains a subset of biological knowledge, databases have to be combined to gather all of the available information. Several methods to integrate biological databases exist, but technical challenges are enormous (cf. review by Stein 2003). Link integration is the most common integration method so far, as employed in the sequence retrieval system (SRS) (Zdobnov et al. 2002) and Entrez (Schuler et al. 1996). Severe problems are naming clashes (e.g. genes and gene products using the same name) and stale hyperlinks to outdated database entries. When trying to combine information from several resources, scientists have to access several web sites (often using "copy & paste" within different browser windows). Obviously, this approach is tedious and cannot be scaled up.
The underlying data models of the databases are changing quickly in order to account for new technological developments and to describe the data in more detail. Unfortunately, this creates additional problems when accessing their content (software has to be rewritten, etc.). Furthermore, each database uses its own vocabulary to describe molecular function or cellular localisation. Even the meaning of attributes such as protein function may be different, e.g. one database may annotate the protein function of the human Titin protein as muscle protein,whereas another database may describe its function as kinase.
Ontologies give hope in overcoming these problems. In information science, an ontology is an explicit formal specification of how to represent objects, concepts, entities that are assumed to exist in some area of interest, and the relationships among them. Ontologies provide sophisticated vocabulary to describe the key concepts. They do not integrate databases themselves, but serve as a basis to help in the merging of several databases.
A major problem is error propagation in databanks and databases. DNA sequences may contain frame shifts, deletions, contaminations from cloning vectors, etc., functional annotations may be unverified or outdated. PDB structures often use non-standard atom names. NMR restraint files often show a different atom-name nomenclature than their PDB structure counterparts. This compromises the overall quality and usefulness of the stored data. Without expert knowledge, a lot of time and money could be wasted.
6.1 BioMagResBank and PDB/RCSB
For NMR, the principal databases for storage of NMR experimental data and solved structures are the BioMagResBank, and the Protein Data Bank (PDB) curated by the Research Collaboratory for Structural Bioinformatics (RCSB). The BMRB stores all non-coordinate biomolecular NMR data (Doreleijers et al. 2003): chemical shifts, NOEs, coupling constants, residual dipolar couplings (RDCs), hydrogen exchange rates and protection factors, order parameters, atomic relaxation parameters, and molecular correlation times. The PDB is the central repository for all coordinates and also manages restraint files used for NMR structure calculation (Berman et al. 2000). Most journals require structures and NMR data to be published in PDB and BMRB.
Exploiting the databases, several methods for the prediction of chemical shifts, dihedral angles, secondary and tertiary structure have been developed. A well-known example is the TALOS programme (Cornilescu et al. 1999) for the empirical prediction of phi and psi backbone torsion angles. The method exploits a subset of high-resolution X-ray PDB structures for which accurate NMR chemical-shift data are available. Since the difference between chemical shifts and their corresponding random coil values is often correlated with protein secondary structure, TALOS is able to make quantitative predictions for phi and psi, using only secondary shift and sequence information.
Was this article helpful?