A method for constructing thousands of compact protein conformations from fragments and then connecting these structures to
form a network of physically plausible folding pathways.
We sample the conformational space of small proteins and link these conformations into a graph of physically plausible folding pathways. Conformational sampling is performed using HMMSTR, a hidden Markov model for local sequence-structure correlations. Our method uses only the amino acid sequence to bias the conformational sampling; no knowledge of the native structure is used. This sampling strategy drastically reduces the size of the conformational space to be searched. We then build a probabilistic roadmap (PRM) graph and find paths that have the lowest energy climb. We find that favored folding pathways exist, corresponding to deep valleys in the energy landscape. In contrast to previous PRM methods that used knowledge of the native structure to sample conformational space, this is the first attempt to merge the previous successes in fragment assembly methods with PRM methods.
HMMSTR [1] is a hidden Markov model for local sequence-structure correlation. It uses knowledge of preferred orientations of amino acid sequences from data in the Protein Data Bank (PDB) to predict the φ, ψ torsion angles of local sequences. Given the amino acid sequence code of a protein and a window size, HMMSTR generates a set of likely φ, ψ angles for each overlapping window for the entire sequence. Over 150 motifs are modeled as sequence-structure correlations in HMMSTR. HMMSTR uses PSI-BLAST to generate a list of similar protein sequences. Once we have these local structures, we then proceed to build a complete configuration out of these local structures. We start off with a random anchor window in the protein, and then choose a random angle set (φ, ψ angles for each residue in the window) with probability proportional to its score. We then walk towards both the ends of the protein from the anchor window, assigning angle sets for all remaining windows.

Figure 1: Sample generation is done in three steps. First we use amino acid sequence
to get local structures using HMMSTR. Then these local structures are linked together
to form complete structures. Finally we energy minimize each sample.
Although HMMSTR predicts the possible local structure of an amino acid sequence, it does not give us any information about non-local interactions. Hence we perform Monte Carlo minimization on each samples to minimize hydrogen bond energy (HBE) and. As a result of this, we get samples that are compact and show non-local hydrogen bonding. This energy minimization step is however not useful for all type of proteins, as shown in Figure 2.

Figure 2: Distance (measured as distance matrix error) of samples from the native
configuration, before (in red) and after (in green) energy minimization. 16 residue
long β-hairpin - 2GB1(16) samples are much closer to native after the minimization
step. However 1VII(36), which is entirely made out of α-helices shows minimal
changes after energy minimization. This is because most of the HMMSTR samples already
have α-helices that are compact and have hydrogen bonds. 1E0L(28) has two
β-hairpins and shows intermediate changes.
The C-space of a protein is the set of all possible conformations. However we are interested in just the set of valid conformations (C-free). One way to calculate the size of C-free (|Cfree|) is to discretize the C-space and then count only those low energy configurations that are biologically and physically plausible. This might be difficult to do because even for small proteins this will be a very large number. We instead compare the |C-free| of different proteins by looking at the rate at which new configurations are found (captured) during the sample generation process. Figure 3 show this capture/recapture curve for three different proteins. We notice that HMMSTR samples have a much smaller C-free size when compared to randomly generated samples. Another interesting observation is that although 1VII(36) has more residues than 1E0L(28), it has a much smaller C-free. This suggests that the size of C-free does not depend entirely on the number of residues for a given sequence of amino acids, but also on its secondary structures. In the case of 1VII, the smaller C-free size might be due to helices in its structure, which restrict the range of torsion angles, and thus reduce the overall number of degrees of freedom when compared to a β-sheet protein.
Figure 3: These curves represent the rate atwhich a new sample is found using
HMMSTR sampling. For comparison,random sampling of 2GB1(16) is also
shown. The almost linear line means that every sample generated is a new one.
HMMSTR-sampled 2GB1(16) however flattens out very quickly, indicating a small
C-free. 1VII(36) seems to have a smaller C-free as compared to 1E0L(28) even
though it has more residues.
The PRM [2] graph is a nearest neighbor graph that represents the protein folding funnel landscape. Each node of this directed graph represents a valid low energy configuration (no vdW collisions) of the protein, and each edge (u, v) represents a possible transition from configuration u to v. We use Dijkstra’s shortest path algorithm to find the folding pathways on the PRM graph. The weight of each edge is set to the difference in energy of the two configurations All negative edge weights are set to zero. As a result, we get pathways with lowest energy climb.

Figure 4: Illustration of a PRM graph. Blue nodes represent valid vdW collision free configurations. Red nodes represent invalid configurations
with vdW collisions. The green node is the native configuration. An arrow
represents a valid transition between the two configurations. A sample folding pathway is
shown with edges highlighted.
We evaluate our approach with three different proteins: a 16-residue long β-hairpin from Protein G – 2GB1(16), a 28 residue long Fbp28Ww Domain from Mus Musculus – 1E0L(28), and a 36 residue long subdomain from Chicken Villin Headpiece – 1VII(36). However we were unable to generate pathways for 1VII(36) due to difficulty in connecting the native configuration to other configurations in the PRM graph. After generating pathways, we built a highway graph [3], which is a subgraph of the PRM graph, with only the 100 most popular configurations. Popularity is simply measured as the number of times a configuration appears in all the pathways.

Figure 5: Highways for 2GB1(16) and 1E0L(28). Bigger nodes correspond to high
energy configurations, darker nodes correspond to the most popular configurations
among all pathways. The native configuration is denoted by red.
An alternative representation of folding pathways is shown in Figure 6, which shows 100 random pathways. Each point on the curve represents a configuration on the path. The X axis represents the cumulative length of the edge (in DME), and the Y axis represents normalized energy.

Figure 6: 100 random folding pathways for 2GB1(16) and 1E0L(28). Each node represents
a configuration on the path. X-axis represents change in 3D structure
(measured as cumulative distance matrix error from the previous configuration on the
path), and Y-axis represents energy. The node on bottom left represents the native
configuration.
[1] C. Bystroff, V. Thorsson, and D. Baker. HMMSTR: a hidden Markov model for local sequencestructure
correlations in proteins. Journal of Molecular Biology, 301(1):173–190, 2000.
[2] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. Overmars. Probabilistic roadmaps for path planning in
high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580,
1996.
[3] Y. A. Girdhar. Efficient Sampling of Protein Folding Funnels using HMMSTR and Pathway Generation
using Probabilistic Roadmaps. Master’s Thesis, Dept. of Computer Science, Rensselaer Polytechnic Institute,
May 2005.
[4] G. Song, S. Thomas, K. A. Dill, J. M. Scholtz, N. M. Amato. A Path Planning-based Study of Protein Folding With a Case Study of Hairpin Formation in Protein G and L, In Proc. Pac. Symp. of Biocomputing (PSB), pp. 240-251, Lihue, HI, Jan 2003.
[5] M.S. Apaydin, D.L. Brutlag, C. Guestrin, D. Hsu. and J.C. Latombe. Stochastic Roadmap Simulation: An Efficient Representation and Algorithm for Analyzing Molecular Motion. Proc. RECOMB'02, Washington D.C., pp. 12-21, 2002.
Acknowledgment: This work was supported in part by NSF under CAREER Award No. IIS-0093233.