
Introduction
Proteins are large molecules, which occur in abundance in every single living organism. They carry
This blog post is about a software package I wrote, MR-REX, which is part of a pipeline to determine protein structures using experimental data. In a benchmark of 1303 proteins, MR-REX succeeded in 699 cases, while the state of the art software Phaser succeeded in 632 cases. The paper describing MR-REX can be found at https://zhanglab.ccmb.med.umich.edu/papers/2018_9.pdf.
Determining Protein Structures
The most common way in which a protein’s structure is determined is X-ray crystallography. In X-ray crystallography the protein is first crystallized. A crystal is a solid in which a molecule forms a repeating pattern. The crystal can be divided into identical parallelpipeds called unit cells. Figure 1 gives an example of a unit cell and how the crystal is built from unit cells.

Figure 1. One unit cell is shown at the top left. The bottom right shows the unit cell extended in all directions. The protein is shown as a ribbon diagram, which makes it easy to see the main features of the protein. A typical crystal has thousands of unit cells in each direction. Almost all crystals have multiple proteins in the unit cell.
Once the protein has been crystallized a beam of X-rays is shined on it. The X-rays scatter and the intensity as a function of scattering direction is measured by an X-ray camera. Figure 2 displays an example of a raw X-ray image from an X-ray crystallographic experiment and the equation that can be used to calculate the X-ray scattering intensities, given the coordinates of the atoms in the unit cell. The X-ray light does not scatter in all directions, but only in certain discrete directions, which are specified by three indices h, k, and l, called Miller indices. The photographic image is converted to a table, in which each point of X-ray light is specified by its Miller indices and the intensity of the X-ray light. It is possible to compute the intensities of the scattered X-ray light given the protein structure, by doing a Fourier transform of the electron density of a unit cell. More specifically the Fourier transform of the electron density of the unit cell is a complex number F, the structure factor, and the intensity is the structure factor multiplied by its complex conjugate. Unfortunately, there is no way to calculate the structure a protein, from the intensities of the scattered X-ray light, because there is needed information missing from the experimental data. Only the intensity of the scattered X-ray light can be measured, but the real and imaginary components of the structure factor must be known to directly determine the structure of the protein.

So, how is X-ray crystallography used to determine protein structures? The most common method is MR (Molecular Replacement). In MR an initial guess for the protein structure is used. This guess is usually an evolutionarily related protein, with a similar amino acid sequence and therefore likely a similar structure. The idea is that the initial guess or model can be refined to match the correct structure. However, while the structure of the protein of interest may be very close to the initial guess, the unit cell, the orientation and position of the protein of interest may be different from that of the protein used for the initial guess. Thus the position and orientation of the protein in the unit cell must be determined before it can be refined.
The Problems
There are two problems here. One problem is what algorithm will be used to search for the correct placement. The other problem is how to evaluate different placement of the model.
Scoring Function
Lets look at this problem first. MR-REX needs a method to evaluate how well the calculated X-ray intensities match the experimental X-ray intensities. There are many options for this and MR-REX uses a linear combination of terms, called XScore.

The first four terms quantify how well the calculated intensities match the experimental intensities, while the last three terms quantify how much the proteins clash with other copies. The subscript Z denotes that the Z-score of the raw metrics are used. ML stands for maximum likelihood. The concept of maximum likelihood can be applied to other things, is very interesting and may become the subject of a future blog post. The optimization of these weights will be covered later in this blog post.
MR(Molecular Replacement) Search Algorithm
Next, we consider the MR search algorithm. In traditional MR the search is split into two phases. In the first phase the orientation of the protein is determined using a grid-based search. To determine the orientation of the protein without regard to the location, the fit to the experimental data is evaluated using a method which I will not go into here for the sake of brevity. In the second phase the location of the protein is determined. Splitting the 6-Dimensional search into two 3-Dimensional searches significantly speeds up MR, but reduces the chance that it will succeed. An alternative is to do a systematic 6-Dimensional grid-based search. However, a systematic search of 6-Dimensional space is slow. An alternative is to do a stochastic search. REMC simulations (Replica Exchange Monte Carlo) are perhaps the best stochastic search method in many cases. In a Monte Carlo simulation random changes are made to the orientation and position of the model. If the change causes an improvement to the scoring function the change is accepted. If the change causes the scoring function to become worse it is accepted or rejected with a probability given by the following equation. This enables the model to escape local minima and find the global minimum. The problem with Monte Carlo simulations is that the temperature used may be too low to get out of a local minimum or the temperature may be too high to find low energy states. Indeed there may not be any temperature that works well. One solution to this is to run many Monte Carlo simulations at different temperatures. Then if a high temperature replica finds a better solution than a low temperature replica, the replicas will switch temperatures. There are two major advantages that REMC simulations have over systematic search in MR. First we can search more than just orientation and position. The structure of the model can be modified during the search as well. Portions of the model can be randomly deleted or reinserted. The purpose of this is that portions of the model which are incorrect are more deleterious to the success of MR than deletions. Before the MR search segments of the model which are likely to be inaccurate are predicted. Then during the REMC search these segments are deleted and reinserted randomly in addition to randomly rotating and translating the protein model. The second advantage is that the clash score can be used the MR search not just as a filter at the end. The clash score is significantly slower than the other terms in the XScore, and MR is frequently successful without the clash score, so it is not used in the first round. In the second round the clash score is used. Deleting and reinserting segments of the protein is also computationally expensive and is not used in the first two rounds, but is used in the third round. The maximum likelihood score is the most expensive part of MR-REX and is only used in the last round. The algorithm used by MR-REX is displayed in Figure 3.

Figure 3. The flowchart of the MR-REX algorithm.
Training and Test Sets
To develop and evaluate this algorithm we needed to develop training and test sets. The training set consisted of 1387 protein structures. First we took 40 proteins randomly from the PDB (Protein Data Bank), a public repository of protein structures. For each of these proteins we generated protein structures of varying fidelity to the original protein structure. Some of these structures are very similar to the original structure, some are dissimilar from the original structure, and some are in between. We also developed a test set of 1303 protein structures base on 38 proteins. Figure 4 displays some of the protein structures in the training set for two proteins.

Figure 4. Ten protein structures generated for two proteins in the training set. The original protein structures are in green and the generated structures are in cyan. RMSD is a measure of similarity between protein structures. The higher the RMSD the more dissimilar the structures. TM-score is another measure of the similarity of two proteins. Two identical proteins have a TM-score of 1 and the minimum TM-score is 0.
Optimizing Weights
How are the weights used in the XScore determined? The output of each run of MR-REX is a set of placed models. From among these placements the best placement must be chosen. To evaluate how good the chosen model is, we use cRMSD. We choose the linear combination of weights which gives the lowest average cRMSD, when choosing the model placement with the lowest XScore, with cRMSDs greater than 8 angstrom set to 8 angstrom, since all cRMSDs greater than 8 angstrom are considered to be equally bad. More interesting uses of machine learning in this project may be the subject of a future blog post.
Running Jobs
MR-REX was written in C++. A typical MR-REX job takes 20 hours. It is not parallelized. During its development it was run on computer clusters with hundreds of nodes. Scripts to run the training and test sets, collect the results and analyze them were written in Perl and Python.
Results
We ran MR-REX and Phaser on these decoy structures and compared the results. We consider MR to be successful if the cRMSD is smaller than 2 angstroms. MR-REX succeeded in 699 out of the 1303 test cases while Phaser succeeded in 632 cases. It is interesting to examine which features of MR-REX contributed most to its success. After the first round in which only the first three terms of the XScore are used MR-REX succeeded in 536 out of 1303 cases. After the second round MR-REX succeeded in 639 cases. After optimizing potentially incorrect portions of the proteins MR-REX succeeded in 681 cases. After the final round in which maximum likelihood is used MR-REX succeeded in 699 cases.
About Author
Prior to Knoldus, Jouko was a postdoctoral fellow in the Department of Computational Medicine and Bioinformation at the University of Michigan. He has a