Maximum Likelihood

Reading Time: 7 minutes

Maximum likelihood is the procedure of finding the value of some parameters for a given statistic which makes the likelihood of the the known likelihood distribution a maximum. Maximum likelihood is a method with many uses. A classic example is linear regression. If it is assumed that the errors on the x variable follow a Gaussian distribution we can compute the probability density of the parameters used to fit the data. A Gaussian distribution is given by

\frac{1} {\sqrt{2\pi{\sigma}^2}} e^{-\frac{{(x-\mu)}^2}{2{\sigma}^2}}

where \mu is the center of the Gaussian distribution and \sigma is the standard deviation.

If we draw a line and ask what is the probability density that we see the observed data given that if we had an infinite amount of data and the mean of the Gaussian distributions for each y value is centered on a line, then we obtain the equation

p=\Pi  \frac{1} {\sqrt{2\pi{\sigma}^2}} e^{-\frac{{(\hat{y}_i-y_i)}^2}{2{\sigma}^2}}

where \hat{y} is the value of y given by the model and y is the observed value. We can simplify by getting rid of the normalization, since we are only interested in finding the location of the maximum.

p=\Pi e^{-\frac{{(\hat{y}_i-y_i)}^2}{2{\sigma}^2}}

It is more convenient to take the natural log of both sides, since that converts the product into a sum and does not change the location of the maximum. In that case we obtain

log(p)=\Sigma  {-\frac{{(\hat{y}_i-y_i)}^2}{2{\sigma}^2}}

Again we do not care about any constant factors so we can get rid of 2{\sigma}^2 . Thus, we obtain

log(p)=\Sigma  -{(\hat{y}_i-y_i)}^2

If we want to do a minimization instead of a maximization we can change the sign

-log(p)=\Sigma  {(\hat{y}_i-y_i)}^2

This is the purpose for doing least squares regression as opposed to say least quartic regression or minimizing the sum of the absolute values of the errors. We can now find the slope and y-intercept of the model.

Maximum Likelihood Applied to X-ray Crystallography

Let’s look at a more complicated use of the maximum likelihood technique. As a postdoctoral fellow I wrote a software package MR-REX, which helps determine protein crystal structures given X-ray crystallographic data and an initial guess for the structure of the protein, using a technique known as MR (Molecular Replacement). I described it in an earlier blog post here. In order for MR-REX to work it needs a quantitative measure of how well the calculated X-ray scattering amplitudes, of a hypothetical placement of a protein model, matches the observed scattering amplitudes. For this, MR-REX used a liner combination of terms, one of which is likelihood. I did not invent the use of maximum likelihood for X-ray crystallography, but I did spend a lot of time trying to understand it and deriving it by myself. You can find some interesting information about the use of maximum likelihood in X-ray crystallography here. If the maximum likelihood score was a true maximum likelihood score, it should be the only term used to assess how well the calculated and observed match, but there are some approximations used in its derivation in this case. In a test set of 320 runs of MR-REX, the maximum likelihood score alone picked out the correct placement of the protein model in 123 cases compared to 114 for the R factor. We start off with the equation for the intensity of the scattered X-ray light.

{F(h, k, l)} =  \Sigma {O_j}{f_j}(h,k,l)e^{2{\pi}i(hx_j+ky_j+lz_j)-\frac{q^2B_j}{24{\pi}^2}}

where F is the structure factor, j specifies the index of the atom, h, k, and l are the Miller indices, x, y, and z specify the coordinates of the atoms, f is the structure factor of the atom, O is the occupancy of the atoms, and B accounts for the thermal motions of the atoms. The intensity is calculated by multiplying the structure factor by its complex conjugate and the amplitude is the square root of the intensity. The structure factor is a complex number with real and imaginary components. Only the intensity of the scattered X-ray light can be measured. The real and imaginary components of the structure factor cannot be measured. If we assume that there is some error in the positions of the atoms with a Gaussian distribution, there will be an error in the real and imaginary components of the structure factors with a Gaussian distribution. The errors on the real and imaginary components are either completely uncorrelated (acentric reflection) or the real and imaginary components lie on a straight line when plotted (centric reflection). Centric reflections result from symmetry relations between different copies of the protein in the unit cell. Centric reflections are easier to analyze so let’s start with them.

Likelihood for Centric Reflections

Let’s say that the real component of a centric reflection is $latex{Fave}$ and the standard deviation of the real component of a centric reflection is \sigma and the imaginary component and its standard deviation are both 0. In this case the amplitude is just the absolute value of the real component of the structure factor. The probability density that the structure factor is F is given by

P(F)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(F-Fave)^2}{2\sigma^2}}

The probability density that the structure factor is -F is then given by

P(-F)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(-F-Fave)^2}{2\sigma^2}}=\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(F+Fave)^2}{2\sigma^2}}

Since P(\lvert F\rvert)=P(F)+P(-F)

P(\lvert{F}\rvert)=\frac{1}{\sqrt{2\pi\sigma^2}} (e^{-\frac{(F-Fave)^2}{2\sigma^2}} + e^{-\frac{(F+Fave)^2}{2\sigma^2}})

This is known as the Woolfson distribution.

On the left is P(F) where Fave is 1.0 and \sigma is 1.0. The amplitude |F| is always positive, so P(|F|) is obtained by flipping over P(F) for negative values of F and adding that to P(F) for positive values of F.
P(|F|) when Fave is 1.0 and \sigma is 0.3.

The overall likelihood of all of the centric reflections is

P_{centric}=\Pi_{centric=h,k,l} \frac{1}{\sqrt{2\pi\sigma(h,k,l)^2}} (e^{-\frac{(\lvert{Fobs(h,k,l)}\rvert-\lvert{Fcalc(h,k,l)}\rvert)^2}{2\sigma(h,k,l)^2}} + e^{-\frac{(\lvert{Fobs(h,k,l)}\rvert+\lvert{Fcalc(h,k,l)}\rvert)^2}{2\sigma(h,k,l)^2}})

This involves multiplying thousands of very small numbers together, resulting in floating point underflow. Thus, we take the log of both sides.

\log(P_{centric})=\Sigma_{centric=h,k,l} \log(\frac{1}{\sqrt{2\pi\sigma(h,k,l)^2}} (e^{-\frac{(\lvert{Fobs(h,k,l)}\rvert-\lvert{Fcalc(h,k,l)}\rvert)^2}{2\sigma(h,k,l)^2}} + e^{-\frac{(\lvert{Fobs(h,k,l)}\rvert+\lvert{Fcalc(h,k,l)}\rvert)^2}{2\sigma(h,k,l)^2}}))

Acentric Reflections

Let’s next look at acentric reflections. Let’s say that the expected real component of the structure factor is Fave, that its standard deviation is \sigma , the expected imaginary component is 0, and the standard deviation of the imaginary component is \sigma . If the expected imaginary component is not 0 the complex plain can be rotated so that it is 0. The following figure demonstrates the situation.

The x-axis is the real component of the structure factor. The y-axis is the imaginary component of the structure factor. The blue blur represents the probability density of the structure factor, given some error in the positions of the atoms. |F| is the amplitude of one of the reflections.

In the case under consideration

P(F)=\frac{1}{2{\pi}\sigma^2} e^{{-(Fave-\Re(F))^2}/2\sigma^2} e^{-\Im(F)^2/2\sigma^2}= \frac{1}{2{\pi}\sigma^2} e^{{-((Fave-\Re(F))^2+\Im(F)^2)}/2\sigma^2}

where \Re(F) is the real component of the structure factor and \Im(F) is the imaginary component of the structure factor. The amplitude \lvert F\rvert=\sqrt{\Re(F)^2+\Im(F)^2}.

\Re(F)=\lvert F\rvert \cos(\theta)

\Im(F)=\lvert F\rvert \sin(\theta)

To find P(\lvert F\rvert) we need to convert to polar coordinates and integrate over \theta

P(\lvert F\rvert)=\int^{2\pi}_{0} \frac{\lvert F\rvert}{2{\pi}\sigma^2} e^{{-((Fave-\lvert F\rvert\cos(\theta))^2+(\lvert F\rvert\sin(\theta))^2)}/2\sigma^2} d\theta

WolframAlpha tells me that the result is

P(\lvert F\rvert)=\frac{\lvert F\rvert}{\sigma^2}e^{-\frac{\lvert F\rvert^2+Fave^2}{2\sigma^2}} I_0(\frac{\lvert F\rvert Fave}{\sigma^2})

where I_n is the modified Bessel function of the first kind. This is known as the Rice distribution. You can find two examples of the Rice distribution below.

Two examples of the Rice distribution. For both plots Fave is 1.0. On the left \sigma is 1.0 and on the right \sigma is 0.3.

We can then find the overall log likelihood for the acentric reflections

\log(P_{acentric})=\Sigma_{acentric=h,k,l} \log(\frac{\lvert{Fcalc(h,k,l)}\rvert}{\sigma(h,k,l)^2}e^{-\frac{\lvert{Fobs(h,k,l)}\rvert^2+\lvert{Fcalc(h,k,l)}\rvert^2}{2\sigma(h,k,l)^2}} I_0(\frac{ \lvert{Fobs(h,k,l)}\rvert \lvert{Fcalc(h,k,l)}\rvert}{\sigma(h,k,l)^2}))

The overall log likelihood is then

\log(P)=\log(P_{centric})+\log(P_{acentric})

Calculating \bold{\sigma(h,k,l)}

We assume that there is some error with the x, y, and z coordinates of the atoms with standard deviations \sigma_x , \sigma_y , and \sigma_z. The probability density of finding an atom at X, Y, Z is then

P(X, Y, Z)=\frac{1}{(2\pi)^{3/2}\sigma_x\sigma_y\sigma_z}e^{-(X-x)^2/2\sigma^2_x-(Y-y)^2/2\sigma^2_y-(Z-z)^2/2\sigma^2_z}

where the atom is located at (x, y, z) in the model. The standard deviations themselves are estimates, which is typically based on the sequence identity of the model. The higher the sequence identity the smaller the estimated standard deviations.

Let’s return to the equation for the structure factor

{F(h, k, l)} =  \Sigma {O_j}{f_j}(h,k,l)e^{2{\pi}i(hx_j+ky_j+lz_j)-\frac{q^2B_j}{24{\pi}^2}}

Let’s rewrite this for just one atom

F(h, k, l)_j =  {O_j}{f_j}(h,k,l)e^{2{\pi}i(hx_j+ky_j+lz_j)-\frac{q^2B_j}{24{\pi}^2}}

We need to look at the standard deviations of the real and imaginary components of the structure factor, so we need to look at the real and imaginary components of the structure factor.

\Re(F(h,k,l)_j)= {O_j}{f_j}(h,k,l)\cos(2{\pi}i(hx_j+ky_j+lz_j)) e^{-\frac{q^2B_j}{24{\pi}^2}}

\Im(F(h,k,l)_j)= {O_j}{f_j}(h,k,l)\sin(2{\pi}i(hx_j+ky_j+lz_j)) e^{-\frac{q^2B_j}{24{\pi}^2}}

The standard deviation, \sigma, is the square root of the variance, \sigma^2. The variance is defined as the average of the squares minus the square of the averages. Therefore the variance of the real component of the structure factor is

\sigma(h,k,l)_j^2=\left\langle{\Re(F(h,k,l)_j)^2}\right\rangle- \left\langle{\Re(F(h,k,l)_j)}\right\rangle^2

We can find the average real component of the structure factor using

\left\langle{\Re(F(h,k,l)_j)}\right\rangle=\frac{\int\int\int{P(X,Y,Z)_jRe(F(h,k,l,X,Y,Z)_j)dXdYdZ}}{\int\int\int{P(X,Y,Z)dXdYdZ}}

Similarly

\left\langle{\Re(F(h,k,l)_j)^2}\right\rangle=\frac{\int\int\int{P(X,Y,Z)_jRe(F(h,k,l,X,Y,Z)_j)^2dXdYdZ}}{\int\int\int{P(X,Y,Z)dXdYdZ}}

Variances are additive, assuming that the errors on the positions of the atoms are uncorrelated. Therefore we can find the total variance by summing up the variances due to each individual atom.

\sigma(h,k,l)^2=\Sigma_j \sigma(h,k,l)^2_j

I leave it up to the interested reader to find the full equation for \sigma(h,k,l) as an exercise. Note that the analysis in this section is only valid for when there is one protein in the unit cell and the angles formed by the edges of the unit cells are all perpendicular. It is possible to derive the equations for when there are multiple copies of the protein in the unit cell and the edges of the unit cell do not form perpendicular angles, but for the sake of brevity and simplicity that analysis was not done here. It is also possible to account for the errors from the B-factors, errors from the effects of water, errors from missing atoms, etc. I hope this blog post inspires you to find uses for maximum likelihood in your own work.

Leave a Reply

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!