Predicting protein structures and simulating protein folding motions are two of the most important problems in computational biology today. Modern folding simulation methods rely on a scoring function, which attempts to distinguish the native structure (the most energetically stable 3D structure) from one or more non-native structures. Decoy databases are collections of non-native structures that are widely used to test and verify these scoring functions.
We evaluate and improve the quality of decoy databases by adding novel structures and/or removing redundant structures. We test our approach on decoy databases of varying size and type and show significant improvement across a variety of metrics. Most improvement comes from the addition of novel structures indicating that our improved databases have more informative structures that are more likely to fool scoring functions. This work can aid the development and testing of better scoring functions, which in turn, will improve the quality of protein folding simulations.
Method Details and Results
There are two main phases in the improvement of decoy sets. First, samples are generated on the protein’s energy landscape. Second, in the decoy selection phase, some structures are chosen from the original set D to be removed and some are chosen from the sample set S to be added. The original decoy set D and the sample set S can be broken down into four subsets:
- redundant decoy structures DD from D,
- viable decoy structures DV from D,
- redundant sampled structures SD from S, and
- viable sampled structures SV from S.
We compare decoy sets based on the following metrics:
- Z-Score – The z-score (or standard score) indicates the number of standard deviations between the native structure energy and the average energy of a decoy set.
- MinDist – The minimum distance metric measures the average minimum distance from each decoy structure to any other decoy structure in the set.
- Improvement – Given an original decoy set and an improved decoy set, the improvement score returns the change in z-score per sample between the two sets.
We apply our methods to existing decoy sets from the Decoys ‘R’ Us database:
We measure the z-score, improvement score, and minimum distance value for each protein database studied. For each metric, we show the contribution from each operation (removing redundant decoys (DV), adding new samples (D U SV) and from their combination (DV U SV).
When the z-score approaches zero, the native structure energy is harder to distinguish among the energies of the other structures in the set. For every protein, the z-scores of D and DV are very similar. Thus, simply removing structures does not greatly impact the z-score. However, once we add new structures from our sampling approach (D U SV), the z-score drops drastically with comparable z-scores to the final set (DV U SV). Therefore, the main contributors to z-score improvement are the structures generated by our sampling approach.
The improvement score shows the change in z-score per sample between two sets. A higher value indicates that the change (either structure addition, removal, or both) has a greater impact on the z-score. We again see that adding structures provide a decoy set with better quality than simply removing redundant structures. Proteins 1ash and 1gdm with the smallest original sets show the largest improvement scores.
The minimum distance between neighboring structures indicates how varied the structures are. A larger distance signifies greater structural diversity and implies a greater ability to fool different scoring functions. As expected, when decoys are removed (D), the minimum distance increases, and when V adding decoys (D U SV), the minimum distance decreases. For all protein studied, the final decoy set (DV U SV) has smaller minimum distance than the original set (D) yielding a set with greater diversity.
In conclusion, our algorithms are able to generate sets with lower energies and more diverse structures that are more likely to fool scoring functions of protein folding algorithms.