Comparative Modeling

2.1 Sequence Gathering and Alignment

Before, comparative molecular modeling, i.e. three-dimensional structure building, can be initiated, sequence alignment of the target and (at least) one template is necessary. However, the lower the sequence identity, the harder it is to detect similarity and to align sequences. While obvious at high sequence identity (above 30 %), the detection might not be straightforward at lower sequence identity. A prerequisite is generally to find and align close homologues of the target.

2.1.1 Sequence Database Searches

Sequence database searches were efficiently automated one decade ago through the development of BLAST and its derivatives (Altschul et al. 1990, 1997). Most recent methods, such as fold recognition (see Sect. 2.2.1), include such searches prior to sequence-structure comparison and their efficiency heavily relies on the search output. The use of the template's homologues is also helpful, especially through profile-based methods (Rychlewski et al. 2000). Checking for the availability of a sufficient number of homologues in the sequence databases may be necessary to ascertain the quality of the outputs (alignment, fold recognition, secondary structure prediction). In some cases, this verification is highly recommended, especially, for eukaryotic sequences belonging to small families with no prokaryotic equivalent ( Ganem et al. 2003) or particular proteins specific to a phylogenic "niche" ( Carret et al. 1999). The number of fully sequenced genomes of prokaryotes usually warrants the construction of reasonable multiple sequence alignments for most proteins of bacterial or archaeal origin. However, some sequence subfamilies might lead to the convergence of PSI-BLAST searches, which is too rapid in the absence of "joining" intermediates between too distantly related subfamilies (Labesse et al. 2001).At the same time, the efficiency of the sequencing projects makes PSI-BLAST searches more and more successful. It may detect true sequence similarity even at a very low level of sequence identity (~15% over 60-90% of the protein length; see CASP5

results). In these cases, a reliable alignment is more likely to be achieved using sequence-structure comparison methods and/or the manual edition of sequence-structure alignment (hereafter, named structural alignment) by experts.

2.1.2 Multiple Sequence Alignments

Once similar sequences have been gathered, various sequence alignment methods are available (e.g. CLUSTALW, DIAILIGN, etc.) and can be directly connected to molecular modeling (Lambert et al. 2002). PSI-BLAST itself provides multiple sequence alignments. However, the latter correspond to similarity matches and do not always cover the full-length hit sequences. Compared to pairwise alignment, multiple alignments may reveal more meaningful sequence conservation (Labesse 1996). Computer programs such as MEME (Bailey et al 1997) are available to pick up among aligned sequences, common motifs that usually correspond to functionally or structurally important regions. However, fine functional assignment may require tracing subtle changes aside from common motifs that may not be automatically detected (Labesse et al. 1994; Reid et al. 2003).

The overall quality of the alignment depends mainly on the mean pairwise sequence identity. The statistical significance of a multiple alignment can now be estimated (Pei et al. 2003). At a low level of sequence identity (below 25 %), structural information will be needed to improve the alignment quality (e.g. avoiding insertion or deletion inside secondary structure elements; Gracy et al. 1993).

2.2 Structural Alignments

We wish to put, herein, strong emphasis on the essential step of sequence-structure alignment also called, fold recognition. This requirement is reinforced by the growing use of sequence-structure comparison methods to derive alignments in the so-called twilight and midnight zones (for sequence identity levels between 15-25 and 0-15%, respectively). We shall illustrate here, with several examples, the need for careful refinement of structural alignment as well as the usefulness of the crude models one can derive from these alignments. Fold recognition is usually performed to search structure databases using "frozen approximation" for speed. It allows rapid similarity detection. In contrast, true three-dimensional threading evaluates pairwise contacts (in between amino acids or atoms) instead of profile-profile matches. The enhanced sensitivity of pairwise contacts suggests that it should be used after profile-profile comparison. This strategy has been implemented in PROSPECT (Xu et al. 2000) or PROSPECTOR (Skolnick and Kihara 2001) and is also made available on the server @TOME. Various factors may interfere with the achievement of a correct sequence-structure alignment and their identification may require going through all the following steps: alignment refinement (Sect. 2.2.3), model building (Sect. 2.3) and model evaluation (Sect. 2.4).

2.2.1 Fold Recognition

Fold-recognition programs usually produce sequence alignments that are generally more reliable than those derived from purely sequence-based methods. Furthermore, they can detect distant homologues with sequence identity as low as 10 % ( Kinch and Grishin 2002). However, the current rate of the success of individual threaders reaches at best 40 % for distantly related structures (Bujnicki et al. 2001). This can be partially overcome by using consensus scoring schemes such as those provided by several web servers (http:// BioInfo.PL/meta/meta.html: Bujnicki et al. 2001; http://GeneSilico.pl/meta/: Kurowski and Bujnicki 2003; @TOME). On the server @TOME, structural alignments are further evaluated through a common threading tool (T.I.T.O.; Labesse and Mornon 1998) using a potential of mean force, PKB (Bryant and Lawrence 1993). The use of a common scoring scheme helps to choose a better template and/or a better structural alignment.When distinct folds are proposed to be compatible for the same region of the query sequence, the proposed similarity is doubtful and extra care must be taken before going through the following steps of structure modeling.

Usually, different threaders will find similar compatible folds but their sequence-structure alignments may differ locally. In case of high sequence similarities (above 25%, over more than 100 residues), discrepancies occur mainly in the vicinity of indels. A few amino acids on each side might be improperly aligned usually due to spurious sequence identity instead of the geometrical likelihood of the indels. Under the level of 25 % sequence identity or in the case of small proteins the significance of the alignment might be questioned (Sander and Schneider 1991). Below 10% sequence identity, it might be considered that a correct alignment cannot be achieved (except by chance). Difficulties in alignment refinement may arise from sequence divergence but also from structure changes and function variations.

2.2.2 Structural Alignment Refinement

Currently, few tools tackle the problem of automatic refinement of sequence alignments but promising approaches have been described recently (Deane et al. 2001; Pei et al. 2003). However,various internal controls may be used for the selection and refinement of structural alignments using available techniques including three-dimensional structure visualization.

One may evaluate the "stability" of a given alignment by adding new sequences significantly similar either to the template or to the target as well as experimentally solved structures that superpose well onto the template. Checking the agreement of secondary structure predictions (for the query sequence) with secondary structure assignment (for the template) is important for distantly related proteins (Errami et al. 2003; Callebaut et al. 1997). Other criteria may be taken into account (particular phi/psi angles, burial, hydrogen bonding capabilities, helix capping, etc.) and may be visualized on the structural alignment using the JOY format (Mizuguchi et al. 1998). However, at low levels of sequence conservation, structural alignment should also be evaluated more precisely at the three-dimensional level.

One may build (or rather extract) rapidly a "crude model" (e.g. using the program T.I.T.O. (Labesse and Mornon 1998)). Such a partial structure includes only strictly conserved residues (including both backbone and side-chain atoms) and the backbone of distinct but aligned residues. Neither optimization nor loop building at the indels are performed, adding no error due to the more complex model building methods, that could mask alignment errors. Clusters of strictly conserved residues (e.g. catalytic triad) and/or conservation of topohydrophobic residues (Poupon and Mornon 1998) would suggest functional conservation (e.g. catalytic mechanism) and/or indicate a lower global structure divergence, respectively. A related approach was implemented in THREADLIZE (Pazos et al. 1999). Visual evaluation of a structural alignment quality often suggests numerous local changes in the sequence alignment. These changes may be transposed into a new "crude model".A new round of alignment edition, common-core extraction and assessment is necessary for this trial-and-error optimization. Until recently, the various steps involved in this tedious and time-consuming process, have been performed by several programs, e.g. a multiple-alignment editor such as SEAVIEW (Galtier et al. 1996), T.I.T.O. (Labesse and Mornon 1998) and a macromolecu-lar structure visualization tool such as XmMol (Tuffery 1995), Swiss-PDB viewer (Guex and Peitsch 1997) or Rasmol (Sayle and Milner-White 1995). Two programs gathering most of the previous properties (i.e. editing and visualization) are now available to help this task (Modview: Ilyin et al. 2002; ViTO: Catherinot and Labesse, unpubl.).

2.2.3 Active Site Recognition

Determination of the active site location and prediction of the protein function are essential steps in the "post-genomic era". This may become automated soon based on both modeled structures and sequence conservation using "evolutionary traces" (Lichtarge et al. 1996; Aloy et al. 2001; Yao et al. 2003). Another methodology, based on sequence conservation and active site geometry analysis (Fetrow and Skolnick 1998) has been recently developed for comparative searches. The methods for recognition of active sites may also show loss-of-function evolution (Kniazeff et al. 2002). The significance of the conservation of a cluster of amino acids can also be used to identify subfami-

lies of related proteins. This can be performed using statistical tools such as PATTINPROT (Combet et al. 2000) or PHI-BLAST (Zhang et al. 1998) even at a low level of sequence conservation (~15%) to confirm fold recognition (Labesse et al. 2001) or to characterize the catalytic mechanism and/or ligand specificity (Carret et al. 1999; Ganem et al. 2003; Reid et al. 2003). Identification of the amino acids involved in the protein activity may also be useful at the model completion step by providing additional restraints (see Sect. 2.3).

2.2.4 A Biological Application

As an example, we have described the study of the human copper transporter Hah1, the crystal structure of which has been solved (Wernimont et al. 2000). Correct identification of the compatible folds may now be obtained using any sequence-structure comparison tools even at a very low sequence identity (e.g. 12 %). A similar approach was previously applied to correctly model this protein at 20 % sequence identity (Hung et al. 1998). Perfect alignment could be achieved by restraining, as much as possible, the deletions to lie in between positions close in space to each other (measured as Cai-Cai+1 distances in

Hah1(1FE4) --MPKHEFSVD-MTCGGCAEAVSRVLNKLGGV-KYDIDL

1AFJ -ATQTVTLAVPGMTCAACPITVKKALSKVEGVSKVDVGF

hah1_TITO --MPKHEFSV-DMTCGGCAEAVSRVLNKLG-GVKYDIDL

1AFJ_TITO -ATQTVTLAVPGMTCAACP ITVKKALSKVEGVSKVDVGF

hah1_mGT --MPKHEFSV-DMTCGGCAEAVSRVLNKLGGVK-YDIDL

1AFJ_mGT -ATQTVTLAVPGMTCAACPITVKKALSKVEGVSKVDVGF

hah1_3DP -MPKHE-FSV-DMTCGGCAEAVSRVLNKLGGV-KYDIDL

1AFJ_3DP -ATQTVTLAVPGMTCAACPITVKKALSKVEGVSKVDVGF

hah1_T99 --MPKHEFSV-DMTCGGCAEAVSRVLNKLGGVK-YDIDL

1AFJ_T99 AT - QTVTLAVPGMTCAACP ITVKKALSKVEGVSKVDVGF

hah1(1FE4) PNKKVCIESE---HSMDTLLATLKKTGKTVSYLGLE---

1AFJ EKREAVVTFDDTKASVQKLTKATADAGYPSSVKQ-----

hah1_TITO PNKKVCIESE---HSMDTLLATLKKTGKTVSYLGLE---

1AFJ_TITO EKREAVVTFDDTKASVQKLTKATADAGYPSSVKQ-----

hah1_mGT PNKKVCIESE---HSMDTLLATLKKTGKTVSYLGLE---

1AF J_mGT KREAVVTFDDTKAS VQKLTKATADAGYP SSVKQ-----

hah1_3DP PNKKVCIESEH---SMDTLLATLKKTGKTVSYLGLE---

1AFJ_3 DP EAVVTFDDTKASVQKLTKATADAGYP S S VKQ-----

hah1_T99 PNKKVCIESE---HSMDTLLATLKKTGKTVSYLGLE---

1AFJ_T99 EKREAVVTFDDTKASVQKLTKATADAGYP-----SSVKQ

Fig. 2. Sequence-structure alignments of Hahl (PDB1FE4) and PDB1AFJ. Sequence-structure alignment produced by optimal superposition or as published before the determination of the crystal structure PDB1FE4 or as computed by the programs mGen-Threader (Jones 1999), 3D-PSMM (Kelley et al 2000) or SAM-T99 (Karplus et al. 1998). Discrepancies among alignments are indicated by the asterisk (*)

Fig. 3. Stereographic view of the superposition of Ca traces of PDB1FE4 and PDB1AFJ according to the sequence-structure alignment produced by the program SAM-T99. Crystal structure of PDB1FE4 (Wernimont et al. 2000) is drawn in thin and black lines. PDB1AFJ (Steele and Opella, 1997) is in grey and thick (aligned) or thin ("indels") lines the resulting "crude model") and outside of secondary structure elements. The completion of the structure model highlighted additional features such as putative salt-bridges. Model-guided experiments (directed mutagenesis, DTNB labeling or UV-visible spectroscopy of the cobalt-Hah1 complex) quickly validated the proposed alignment (Hung et al. 1998).

2.3 Complete Model Achievement

The frequent need for manual refinement of sequence-structure alignments at a low level of sequence identity (<25 %; see Sect. 2.2), would suggest that no automatic modeling should currently be directly connected to sequence similarity searches. However, subsequent completion of the three-dimensional structure modeling may sometimes result in good models implying that, in this case, a correct structural alignment was achieved. Automatic modeling using several unrefined structural alignments may be performed in parallel using a pipeline dedicated to protein structure modeling such as @TOME. Otherwise, alternative alignments (e.g. suboptimal alignments according to the scoring schemes of automatic procedures) are to be generated and tested. Recognizing the correct model out of numerous incorrect ones will, then, be the next important step (see Sect. 2.4) before one might consider that the resulting macromolecular models are relevant for drug design (see Chap. 3).

2.3.1 Global Structure Modeling

Once a structural alignment is available, a common core is deduced (corresponding to aligned residues; see Sect. 2.2.2) and amino acid changes and indels are delineated. A complete structure may be built from this starting point using various approaches. Completion of the model implies either adding missing parts, or fragments, onto the common core or building and folding the whole structure at once. These methodologies were inspired by the manner in which structures are modeled by X-ray crystallographers or the approach to folding structures using NMR constraints. In between these two approaches, a hybrid methodology is based on databases of protein structure fragments which are used to build missing parts and also to rebuild (or optimize) any parts including the common core. At CASP5, difficult targets (e.g. T0130) were modeled in a better way by mixing large fragments from different but related three-dimensional structures. Such chimeric structures might appear also at a finer level as illustrated by the mycobacterial TMP kinase (Munier-Lehmann et al 2001). The extension of this technique is already available through the use of several templates by more popular modeling programs such as MODELER (Sali and Blundell 1993) and COMPOSER (Srini-vasan and Blundell 1993). Other programs and web servers are also available (e.g. SWISS-MODEL: Gueix and Peitsch 1997; Geno3D: Combet et al. 2003). The speed and efficiency of the current modeling software allow the building of models to improve gene detection in genomes (Gopal et al. 2001) or to set up databases such as ModBase (Sánchez et al. 1999) covering, so far, roughly 25 % of protein sequences.Twofold higher coverage can be obtained, but, at the expense of significantly lower structure alignment and structure model quality.

At high levels of sequence identity (above 25 %) little difference in the quality of the modeled structures is observed regardless of which software is used. However, more precise or particular modeling studies will require taking advantage of some specific features of these tools (additional restraints in MODELER such as inter-atomic distance or secondary structure predictions). Otherwise, dedicated programs may be required for specific tasks such as side-chain conformational searches (Sect. 2.3.2), indel building (Sect. 2.3.3) and/or energy minimization (Sect. 2.3.5). We emphasize, here, their general use, their complementarity as well as their potential use for ligand docking and drug design.

2.3.2 Optimization of Side-Chains Conformation

Several tools such as SMD (Tuffery et al. 1997) or SCWRL (Dunbrack and Karplus 1993) are available to build side-chains onto a fixed backbone. They use dedicated rotamer libraries and optimized space search procedures. SCWRL is one of the most popular and is currently made available on the

Fig. 4. View of the active site of the protein kinase AKT. The active site structure was modeled using as a template PKA (Engh et al. 1996). The Ca trace (thin) and the ligand H8 (thick) are indicated by grey lines. Side chains of three residues threonine T141, aspartate D142 and a methionine M131 (T183, D184 and L173 in PKA: PDB1YDS), are indicated by black and thick lines. Their orientations were computed by SCWRL (Dunbrack and Karplus 1993) using, in absence of H8, either no restraint or constraining the strictly conserved side-chains (e.g. T141 and D142). For clarity, the ligand H8 is shown in its position in PKA

Fig. 4. View of the active site of the protein kinase AKT. The active site structure was modeled using as a template PKA (Engh et al. 1996). The Ca trace (thin) and the ligand H8 (thick) are indicated by grey lines. Side chains of three residues threonine T141, aspartate D142 and a methionine M131 (T183, D184 and L173 in PKA: PDB1YDS), are indicated by black and thick lines. Their orientations were computed by SCWRL (Dunbrack and Karplus 1993) using, in absence of H8, either no restraint or constraining the strictly conserved side-chains (e.g. T141 and D142). For clarity, the ligand H8 is shown in its position in PKA

server @TOME. Predicted orientations of side-chains are up to 80 % correct (percent of dihedral angle chi1 within 40Aa of the actual value) for models built by homology. Current improvement now comes from the use of a huge number of conformers for each amino acid, to overcome potentially misleading small van der Waals clashes (but at the expense of the CPU time required). Optimized scoring functions are another way of improvement (Liang and Grishin 2002).

At a low level of sequence identity, active site residues (even those strictly conserved) are usually not properly optimized (generally due to a particular environment and specific conformational constraints). In our experience constraining the original side-chain orientations (to those observed in the template) is often more accurate. This approximation is valid only when similar ligands are expected to bind and/or similar conditions are modeled (e.g. similar allosteric conformations). The use of constraints on the strictly conserved residues has yet to be carefully evaluated (on a larger scale and ahead of lig-and docking experiments). Similarly, maintaining a bound ligand while optimizing side-chain conformations may be important prior to virtual screening or docking of ligand analogs. This is illustrated by the catalytic aspartate in protein kinases (D184 in PKA) whose orientation is dramatically changed in the presence of the inhibitor H8 compared to other ligands (Engh et al. 1996). The stabilization of this particular conformation comes from a neighboring threonine (T183 in PKA) hydrogen bonded to the side chain of aspartate D184. Similarly, to maintain the active site pocket "open" enough to allow lig-and docking, one may favor modeling a complex with a ligand kept bound.

Template choice (when possible) and specific constraints will depend on the conformation to target and/or the type of ligands to search. Setting up constraints should be carefully revised when significant structural rearrangements are expected in the vicinity (e.g. due to indels).

2.3.3 Insertions/Deletions Building

Different techniques are required according to the length of the "indels", which are generally considered to correspond to loop segments. However, this is no longer true at low levels of sequence identity (below 25 %) as secondary structure elements may vary in length and number among related structures. Modeling of substantial indels, taking into account local secondary structure predictions, is still in its infancy and mainly carried out manually (Aloy et al. 2000). Short indels (usually between three and eight amino acids in length) are modeled more accurately than longer ones. Modeling of indels may be based solely on their own sequences (Deane and Blundell 2001) or it may take into account the potential influence of the surrounding environment (Burke et al. 2001).

Short loops are mainly modeled by taking into account the flanking elements and the sequence of the loop itself. Families of short-loop structures have been defined showing some clear clusters (Kwasigroch et al. 1997; Wojcik et al. 1999) despite the known flexibility of these protein regions. This kind of loop is efficiently modeled using fragments sharing similar sequences and/or compatible geometries (fitting to flanking elements). The fragment-based approaches rely on protein structure databases that should be optimally set up due to high redundancies in the PDB (http://www.rcsb.org/pdb/). Criteria that are too stringent will remove closely related fragments from such pre-processed databases preventing a fine-grained search while ensuring higher speed.

For longer loops (above 12 amino acids in length), additional restraints are necessary to achieve convergence. Their construction may better rely on ab initio modeling (Bystroff and Baker 1998; De Pristo et al. 2003) rather than on comparative modeling despite the need to take into account the surrounding structural elements and the anchoring points. In some cases, very long indels correspond to subdomains that can be modeled independently and are fused later on (see CASP5 results).

The most promising improvement comes from conformation optimization using a specific force field including terms from a potential of mean force at the atomic level. This force field is too CPU-intensive to be used on the global structure. This new loop building approach significantly improved the likelihood of the conformation and it was shown to lower the RMSD (down to 2 A) of most modeled loops (Fiser et al. 2000). Further improvements (Fiser et al. 2002; de Bakker et al. 2003) come from the use of Generalized-Born solvation approximation to select and/or optimize loop conformations.

2.3.4 Modeling Protein Quaternary Structures

Protein-protein associations play a major role in biology, notably in signaling cascades in eukaryotes or in complex biosynthetic pathways (ribosome, photosynthesis, etc.) and may represent therapeutic targets. The huge number of possible complexes, especially in eukaryotic cells, due to the large protein families involved (e.g. more than 200 human SH3 domains) calls also for the analysis of their specificity through quaternary structure modeling. Furthermore, active sites might be formed or stabilized through macromolecu-lar interactions (e.g. the dimer of the target T0132 at CASP5). Predictions of the quaternary structures have long been too demanding in CPU time and are also dependent on the experimental determination of complexes. However, potentially rapid experimental evaluation of the quaternary structure (or interactions) makes these predictions more attractive. Such predictions may also be performed in conjunction with low-resolution structure determination (Beckmann et al. 2001). Furthermore, the recently developed macromolecular structure database (PQS; http://pqs.ebi.ac.uk/) facilitates the retrieval of most likely quaternary structures from crystal structures. Our server @TOME provides an easy way towards the modeling of the quaternary structure, using MODELER, when structural data are available in the PQS.

In some particular cases, analysis of the putative quaternary structure may confirm putative similarity. For example, modeling of a trimeric structure of the major porin from Campylobacter jejuni has confirmed weak sequence similarities (~15% over 400 residues) with better-known enterobacterial malto- and sucroporins (Labesse et al. 2001). The best conserved sequence motifs in these bacterial porins lay at the monomer-monomer interface especially on the trimer axis. In contrast, the external loops as well as the strands facing the lipid membrane show little or no sequence conservation. Furthermore, a putative di-cation binding site at the interface in the model (each monomer providing an aspartate) could then be predicted (Labesse et al. 2001). MultiPROSPECTOR (Lu et al. 2002) represents an automation of this approach by taking advantage of the potential conservation of the quaternary structure to refine threading searches.

Modeling indels and positioning of side chains may be improved if performed in the correct macromolecular context. Furthermore, theoretical evaluation of a modeled structure (see Sect. 2.4) in an incorrect environment (exposing residues normally buried at the interface) might be misleading. The example of the CDK/cyclin complex (Davies et al. 2001) shows that the binding of a macromolecular partner can favorably influence the active site geometry.

All this would prompt us to predict and to build correctly the actual quaternary structure. At a high level of sequence identity, quaternary structure is likely conserved. It will be easily modeled using methods developed for monomeric structures. At lower sequence identity its conservation may be more questionable and model building will require additional skills.

Evolutionary traces (see Sect. 2.2.3) for large protein families is a convenient tool to predict common interfaces based on structural alignments. Servers are now available to perform rapidly such analysis (Armon et al. 2001). A posteriori analysis might also be convenient to identify a potential interface. One way is to evaluate each monomer first separately and then embedded in the putative complex using tools for model quality evaluation such as Verify3D (see Sect. 2.4.1), which is made available on the server @TOME.

Another way to model quaternary structure is to build partners independently and then try to bring them in contact. This field has been reviewed recently (Smith and Sternberg 2002) and several docking programs are available (Katchalski-Katzir et al. 1992; Smith and Sternberg 2003; Nussinov and Wolfson 1999; Goodsell et al. 1996; Lorber et al. 2002). The use of different methods in parallel and consensus scoring are convenient ways to improve current performance. Low-resolution protein-protein docking (Vakser 1996) is a convenient tool for docking modeled structures (screening out small discrepancies in the monomeric models; Tovchigrechko et al. 2002). Some applications have been recently published such as the modeling of vitronectin, a multi-domain protein, using threading, modeling and docking (Xu at al. 2001). However, the results of the experiment CAPRI (http://capri.ebi.ac.uk) suggest that more developments are necessary before protein-protein docking can be used in routine (Janin et al. 2003).

2.3.5 Energy Minimization and Molecular dynamics

Additional steps may be required to regularize the geometry of the modeled structure, especially in the vicinity of indels (see Sect. 2.3.3). Energy minimization may improve bond length and valence angle values as well as eliminate severe van der Waals clashes. It will not bring atoms closer to their actual position. Due to the roughness of the energy landscape, energy minimizations are easily trapped in local minima. These limitations explain why energy-minimized structures, generally, show slightly increased global deviation (as measured by atomic root-mean-square deviation versus the actual structure) compared to the un-minimized models (or the starting template).

Besides energy minimization, trajectory simulation (molecular dynamics) may be also performed with similar master equations. Molecular dynamics may be used to explore the conformational space. Snapshots in the trajectory may result in models as good as the starting ones (according to various structural criteria; Flohil et al. 2002). This may be used to show the precision (or error) of the models. In MODELER (Sali and Blundell 1993), energy minimization and molecular dynamics are used to optimize and generate distinct models of the same query sequence. Largely deviating regions generally cor respond to long indels and may be considered to be incorrectly modeled. Further improvements in available CPU and forcefields may lead, in the near future, to more suitable energy simulation for models optimization.

2.4 Model Validation

2.4.1 Theoretical Model Validation

Several tools are now available to validate three-dimensional structures at different levels of accuracy. At a very high level of sequence identity (above 50 %), small deviations from actual coordinates may be achieved and programs dedicated to experimental structure evaluation are suitable (e.g.: WHAT-CHECK; Hooft et al. 1996). At lower sequence identity (25-50 %), deviation from standard stereochemistry may not correlate with the overall quality of the model (especially after energy minimization; see Sect. 2.3.5). Non-bonding interatomic interactions may be more suitable using atomic statistical potentials such as ERRAT (Colovos and Yeates 1993),ANOLEA (Melo and Feytmans 1998) or SOESA (Wall et al. 1999). Below 25% sequence identity, model evaluation should rather be performed at the residue level. PROSA II (Sippl 1993) and Ver-ify3D (Eisenberg et al. 1997) are used to assess automatic modeling by MODELER on the server @TOME. In our experience, mainly at low levels of sequence identity (15-25 %), good models have a mean score between 0.3 and 0.4 using Verify3D and between -0.7 and -1.0 in PROSA.

Precise and local analysis may be required in particular cases. Simultaneous visualization of the score and the three-dimensional structure may be done using visualization programs (using the B-factor values to input scores). Specific features remain to be implemented to handle original configurations, which are mostly observed in the active sites (or binding sites). Residues contacting ions (especially, those involved in metal coordination) and/or deeply buried ligands (especially co-factors) have a non-classical environment resulting in disturbed evaluation. Interactions with charged compounds may imply clustering of similarly charged residues (e.g. lysines and arginines for phosphate binding). Similarly, particularities may be observed in thermostable proteins, which may be stabilized by buried salt bridges (or even a buried ion binding site such as -amylases). When buried in the modeled structure, charged or highly hydrophilic residues are often considered to be incorrectly modeled. Attention must be paid to the conservation of these polar and buried residues and/or looking at counterbalancing residues (especially correlated substitution) or chemical groups (backbone atoms, and substrate or co-factor). When such particular features are observed, evaluation of the model quality requires the assessment of the template structure as well.

When a protein structure has been determined under various conditions and shows some rearrangements, models of homologues built using the vari ous known forms might indicate some preferred conformations. To what extent this technique can be generalized remains an open question. However, application of this strategy to the eukaryotic cyclin-dependent kinase CDK7 suggested that it might not require cyclin binding for full activity due to subtle amino acid changes in the vicinity of the activation loop. Among these changes, one is a tyrosine to phenylalanine substitution (tyrosine Y15 in CDK2) in the glycine-rich loop and other changes occurred at the N-terminus of the activation loop. The predicted higher stability of the active form due to these correlated changes is in agreement with the observed behavior of this CDK (Martinez et al. 1997).

2.4.2 Ligand-Based Model Selection

Methods testing the complementarity with known ligands may better rank protein models than general structural criteria (e.g. sequence identity, intermolecular energies, etc.). This has been applied recently by Johnson et al. (2003) in the case of the anti-Shigella flexneri Y monoclonal antibody com-plexes.Virtual docking methods (described in Sect. 3.2) may be used on a limited set of experimentally characterized binders (or derived obviously from clear protein homology).

The docking of a common substrate (e.g. TMP) in three TMP kinases (from Haemophilus influenzae, Yersinia pestis, Bacillus subtilis, respectively) modeled using the related TMP kinase from Escherichia coli, (75,75 and 30 % identical, respectively) was used to check the quality of the modeled active site structure (Pochet et al. 2002). Correct docking scores and position were obtained for the enterobacteria while a poor docking score was obtained for the enzyme from B. subtilis. This discrepancy is due to a van der Waals clash with a buried proline not present in the template structure as shown by docking on a modeled mutant form (P104A) of the same TMP kinase. This suggested some difficulties in taking into account structural constraints due to the substitution toward a proline in a buried helix. Remodeling this TMP kinase locally would be necessary prior to further ligand screening at high resolution.

2.4.3 Experimental Evaluation of Models

Several biochemical and biophysical characterizations of proteins structures are likely to provide restraints to evaluate a theoretical model at a very low cost in time and in material. However, one should make sure to use methods eliminating alternate models (Hurle et al. 1987). As an example limited prote-olysis can be extremely powerful, especially when the cleavage site lies in the protein active site (Bucurenci et al. 1996) or one particular face of the protein (Labesse et al. 2001). Similarly, tryptophan fluorescence may help to monitor substrate orientation and/or a putative induced-fit in the active site (Mar-

rakchi et al. 2002, Cohen-Gonsaud et al. 2002). Mass spectrometry is currently the method of choice in conjunction with other techniques including specific labeling, cross-linking (Young et al. 2000), endo- and exo-proteolysis or, in the case of small proteins, oxidation/reduction (Hung et al. 1998). When quaternary structures are predicted, model evaluation might be easily performed using cross-linking or gel permeation. This, in turn, may highlight some instability or the importance of some conformational change (Marrakchi et al. 2002; Cohen-Gonsaud et al. 2002). Directed mutagenesis is an alternative way to check the functional role of particular residues (Labesse et al. 2001; Kniazeff et al. 2002; Ganem et al. 2003) but it is usually more demanding while at risk of pleiotropic effects making the results difficult to analyze. Chimera of closely related proteins with distinct ligand specificities are an elegant means of building new targets to assess precisely predicted modes of binding (Malherbe et al. 2003). The most precise and most useful validation may be functional assessment through enzymology or affinity measurements especially prior to drug design (Carret et al. 1998; Ganem et al. 2003). With a significantly larger amount of sample (~10 mg), SAXS and ultracentrifugation might be used to assess the overall structure of oligomers as well as the structure of monomers (Bada et al. 2000). Solving experimentally the protein structure, at atomic resolution, will correspond to a final assessment. Only good models are currently suitable to speed up X-ray crystallography using molecular replacement (Jones 2001). Models may potentially also facilitate NMR spectroscopy, in the near future. Experimental structures are usually more suitable for drug design and virtual screening (see Chap. 3) but are determined, currently, at a low output. Prior macromolecular modeling in connection with tuned ligand docking may lead to easier and faster experimental structure determination (e.g. by identifying or by providing a stabilizing lig-and) which, in turn, will help further ligand optimization.

+1 0

Post a comment