Computational Challenges in Computing Nearest Neighbor Estimates of - - PowerPoint PPT Presentation

computational challenges in computing nearest neighbor
SMART_READER_LITE
LIVE PREVIEW

Computational Challenges in Computing Nearest Neighbor Estimates of - - PowerPoint PPT Presentation

Computational Challenges in Computing Nearest Neighbor Estimates of Entropy for Large Molecules Home Page Title Page Contents E. James Harner, Harshinder Singh Shengqiao Li, and Jun Tan Page 1 of 40 Research


slide-1
SLIDE 1

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 1 of 40 Go Back Full Screen Close Quit

Computational Challenges in Computing Nearest Neighbor Estimates of Entropy for Large Molecules

  • E. James Harner, Harshinder Singh

Shengqiao Li, and Jun Tan Research supported by: Biostatistics Branch, National Institute for Occupational Safety and Health, Morgantown, WV

September 19, 2003

slide-2
SLIDE 2

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 2 of 40 Go Back Full Screen Close Quit

Probabilistic Modelling

  • f Molecular Vibrations

⋆ Modelling random vibrations in molecules is important for studying their prop- erties and functions. ⋆ Entropy is a measure of freedom of a system to explore its available configuration space. ⋆ Entropy evaluation is important in order to understand the factors involved in the stability of a conformation and the change from one conformation to another.

slide-3
SLIDE 3

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 3 of 40 Go Back Full Screen Close Quit

Entropy in Protein Folding

⋆ Proteins are biological molecules that are of primary importance to all living

  • rganisms.

⋆ Protein are made up of many amino acids (called residues) linked together. ⋆ A human body contains over 30,000 different kinds of proteins. ⋆ Protein misfolding is the cause of protein-folding diseases: Alzheimers disease, mad cow disease, cystic fibrosis and some types of cancer. ⋆ It is important to study the stability of a protein and the key is to find a small molecule (a drug) that can stabilize the normally folded structure.

slide-4
SLIDE 4

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 4 of 40 Go Back Full Screen Close Quit

Insulin Protein

slide-5
SLIDE 5

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 5 of 40 Go Back Full Screen Close Quit

Entropy

⋆ Entropy of a molecular conformation depends on the coordinates of the confor-

  • mation. These are:

– Bond lengths – Bond angles – Torsional angles (dihedral or rotational degrees of freedom) ⋆ 1. and 2. are rather hard coordinates, entropy is mainly determined by fluctua- tions in torsional angles. ⋆ Probability modeling of torsional angles of a molecular system is important for entropy evaluation.

slide-6
SLIDE 6

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 6 of 40 Go Back Full Screen Close Quit

Methanol Molecule

slide-7
SLIDE 7

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 7 of 40 Go Back Full Screen Close Quit

Probabilistic Modelling of Torsional Angles

⋆ In molecular biology literature, torsional angles are assumed to have multivariate Gaussian (Normal) distribution (Karplus and Kushik (1981), Macromolecules, Levy et al (1984), Macromolecules). The entropy is then given by Sc = mkB 2 + kB 2 ln[(2π)m|Σ|] ⋆ Sc is estimated by using the maximum likelihood estimate of the determinant of the variance-covariance matrix of torsional angles using data on torsional angles

  • f the molecular system
slide-8
SLIDE 8

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 8 of 40 Go Back Full Screen Close Quit

Probability Modeling of Torsional Angles

⋆ There are common situations where assuming a Gaussian distribution for tor- sional angles is not realistic, e.g., – Modeling a torsional angle which has more than one peak. – Modeling a torsional angle where there is more free movement, e.g., in gases. ⋆ In Demchuk and Singh(2001, Molecular Physics) – We proposed a circular probability modeling approach for modeling torsional angles. – The torsional angle of the methanol molecule was modeled by using a von Mises distribution (most commonly used distribution on the circle).

slide-9
SLIDE 9

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 9 of 40 Go Back Full Screen Close Quit

Probability Modeling of Tosional Angles

⋆ A circular random variable Θ follows l-mode von Mises distribution if its proba- bility density function is given by: f(θ) = 1 2πI0(κ)eκ cos[l(θ−θ0)], −π ≤ θ < π. κ = concentration parameter, l = number of modes I0 = Modified Bessel function of order 0 θ0 = Position of first mode For l > 2, the modes are 2π/l radians apart. ⋆ For l = 1: – Mean angle is θ0. – If κ = 0, it is uniform distribution – For large κ, it is approximately Gaussian dist.

slide-10
SLIDE 10

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 10 of 40 Go Back Full Screen Close Quit

von Mises Distribution

slide-11
SLIDE 11

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 11 of 40 Go Back Full Screen Close Quit

von Mises Distribution

slide-12
SLIDE 12

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 12 of 40 Go Back Full Screen Close Quit

Probability Modeling of Torsional Angles

We assumed independent von Mises distributions for torsional angles. Let Θi have an li-mode von Mises distribution, i = 1, 2, , m with concentration parameter κi. Then the entropy of the system is given by: Sc = kB[m ln 2π +

  • ln I0(ki)] −
  • ki

I1(ki) I0(ki) where I1 is the modified Bessel function of order 1. From the Boltzman Gibbs distri- bution, the potential energy of the system is given by V (θ1, θ2, . . . , θm) = 1 B

m

  • i=1

ki[1 − cos(li(θi − θi0))],

slide-13
SLIDE 13

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 13 of 40 Go Back Full Screen Close Quit

Modeling Torsional Angle of Methanol

As a case study, we considered the torsional angle of a methanol molecule. We assumed a 3-mode von Mises distribution for its torsional angle Θ i.e.: f(θ) = 1 2πI0(κ)eκ cos[3(θ−θ0)], −π ≤ θ < π. The potential energy V (Θ) = k B[1 − cos(3(θ − θ0))] = V0 2 [1 − cos 3(θ − θ0)] where V0 = maximum potential energy.

slide-14
SLIDE 14

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 14 of 40 Go Back Full Screen Close Quit

A Bathtub Shaped Distribution for Potential Energy

For methanol molecule, the potential energy is V = V0 2 [1 − cos 3(θ − θ0)] Assuming Θ to have a 3-mode von Mises distribution, we derived the following p.d.f. for V : g(v) = 1 πI0(κ)eκ(1− 2v

v0 )v−1 2(v0 − v)−1 2

This is a bathtub shaped probability distribution. For κ = 0, V/V0 has beta(1/2, 1/2) distribution

slide-15
SLIDE 15

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 15 of 40 Go Back Full Screen Close Quit

A Bath-tub Shaped Distribution

slide-16
SLIDE 16

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 16 of 40 Go Back Full Screen Close Quit

Histograms of Torsional Angle and Energy

slide-17
SLIDE 17

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 17 of 40 Go Back Full Screen Close Quit

Fitting von Mises and Bath-tub Shaped Distributions

slide-18
SLIDE 18

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 18 of 40 Go Back Full Screen Close Quit

A Bivariate Circular Model(Singh et al, 2002, Biometrika)

⋆ Let Θ1 and Θ2 be the two circular random variables. We introduced a joint probability distribution for Θ1 and Θ2 with pdf given by f(θ1, θ2) = Ceκ1 cos(θ1−µ1)+κ2 cos(θ2−µ2)+λ sin(θ1−µ1) sin(θ2−µ2), −π = θ1, θ2 < π, where κ1, κ2 ≥ 0, −∞ < λ < ∞, −π ≤ µ1, µ2 < π and C is normalizing constant ⋆ If fluctuations in Θ1 and Θ2 are sufficiently small, then (Θ1, Θ2) follows approxi- mately a bivariate normal distribution with σ2

1 =

κ2 κ1κ2 − λ2, σ2

2 =

κ1 κ1κ2 − λ2, ρ = λ √κ1κ2 .

slide-19
SLIDE 19

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 19 of 40 Go Back Full Screen Close Quit

A Bivariate Circular Model

⋆ The normalizing constant C is given by 1 C = 4π2

  • m=0

2m m λ2 4κ1κ2 m Im(κ1)Im(κ2) where Im is a modified Bessel function of order m. ⋆ E[sin(Θi − µi)] = 0, i = 1, 2 implies that ?i is the circular mean of Θi. ⋆ Circular variance of Θ1 is given by 1 − E[cos(θ1 − µ1)] = 1 − 4Cπ2

  • m=0

2m m λ2 4κ1κ2 m Im+1(κ1)Im(κ2)

slide-20
SLIDE 20

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 20 of 40 Go Back Full Screen Close Quit

A Bivariate Circular Model

⋆ The conditional distributions of Θ1 and Θ2 are von Mises ⋆ The marginal distribution of Θ1 is symmetric around θ1 = µ1 and unimodal (bimodal) when A(κ2) = I1(κ2) I0(κ2) ≤ (≥)κ1κ2 λ2 ⋆ A generalization which allows multiple peaks in marginal distributions f(θ1, θ2) = Ceκ1 cos(l1(θ1−µ1))+κ2 cos(l2(θ2−µ2))+λ sin(l1(θ1−µ1)) sin(l2(θ2−µ2)), −π ≤ θ1, θ2 < π, where l1, l2 are positive integers.

slide-21
SLIDE 21

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 21 of 40 Go Back Full Screen Close Quit

Nearest Neighbor Estimates of Entropy(Singh et al., 2002)

⋆ Let X1, X2, .., Xn be a random sample from a population with pdf f(x). ⋆ Ri,k = Euclidean distance from Xi to its kth closest neighbor. ⋆ Then a reasonable estimate of f(Xi) is given by ˆ f(Xi) Rp

i,kπp/2

kΓ(p/2 + 1) = k n ⋆ The above equation gives ˆ f(Xi) = kΓ(p/2 + 1) nRp

i,kπp/2 , i = 1, 2, 3, . . . , n,

slide-22
SLIDE 22

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 22 of 40 Go Back Full Screen Close Quit

Nearest Neighbor Estimates

⋆ Thus a reasonable estimator of entropy of the population with pdf f(x) is given by ˆ Gk = −1 n

n

  • i=1

ln[ ˆ f(Xi)] = −1 n

n

  • i=1

Ti where ˆ f(Xi) = kΓ(p/2 + 1) nRp

i,kπp/2 , i = 1, 2, 3, . . . , n,

⋆ The asymptotic mean of the estimator is given by lim

n→∞ E[ ˆ

Gk] = Lk−1 − γ − ln k + E[− ln f(X)] where L0 = 0, Lj =

n

  • i=1

1 i, j ≥ 1, γ = 0.5772.

slide-23
SLIDE 23

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 23 of 40 Go Back Full Screen Close Quit

Nearest Neighbor Estimates

⋆ Thus Thus the estimator ˆ Gk is asymptotically biased. ⋆ We consider the modified estimator ˆ Hk = ˆ Gk − Lk−1 + γ + ln k ⋆ In terms of kth nearest neighbor distances, ˆ Hk = p n

n

  • i=1

ln Ri,k + ln πp/2 Γ(p/2 + 1) + γ − Lk−1 + ln n ⋆ ˆ Hk is asymptotically unbiased. For k = 1, this reduces to the estimator proposed by Kozachenko and Leonenko (1987). ⋆ ˆ Hk is consistent estimator of the entropy E[− ln(f(X)]

slide-24
SLIDE 24

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 24 of 40 Go Back Full Screen Close Quit

Nearest Neighbor Estimates

⋆ For sample sizes of 10, 25, 50 and for some standard distributions we studied the performance of estimators ˆ Hk for various ks under the mean square criterion using simulations. ⋆ The estimator based on k = 4 performs reasonably well. ⋆ We used these estimators to evaluate entropy of methanol molecule and of diethyl ether.

slide-25
SLIDE 25

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 25 of 40 Go Back Full Screen Close Quit

Nearest Neighbor Estimates

⋆ For methanol at room temperature, based on 15, 000 observations, the values

  • f ˆ

Hk for k=1,2,3, 4 are given by 1.840, 1.777, 1.770, 1.756 . Parametric fitting using von Mises distribution yielded estimate 1.744 for entropy. ⋆ For diethyl ether having four torsional angles, the estimates of entropy for k=1,2,3,4 are given by 3.216, 3.236, 3.196, 3.199 respectively based on 15, 000

  • bservations and at room temperature.
slide-26
SLIDE 26

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 26 of 40 Go Back Full Screen Close Quit

Nearest Neighbor Estimates

Root Mean Square Error of Estimators Dist n k = 1 k = 4 kb k = kb U[0, 1] 25 0.292 0.166 4 0.166 t, df=4 25 0.345 0.242 20 0.198 N(0, 1) 25 0.320 0.196 19 0.167 Bivnormal, ρ = 0.5 25 0.364 0.251 8 0.232

slide-27
SLIDE 27

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 27 of 40 Go Back Full Screen Close Quit

Summary of Proposed Approaches for Modelling Torsional Angles

⋆ Circular probability approaches based on von Mises, bivariate and multivariate circular models, ⋆ Fourier series expansion of the potential function approach ⋆ A nonparametric method of estimation of entropy based on nearest neighbor distances. ⋆ These methods will be used for efficient estimation of entropy of proteins.

slide-28
SLIDE 28

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 28 of 40 Go Back Full Screen Close Quit

Computational of Nearest Neighbor Distance

⋆ Direct Method O(n2) ⋆ ANN Method O(n log(n))

slide-29
SLIDE 29

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 29 of 40 Go Back Full Screen Close Quit

Direct Method

⋆ Simple (brute-force) R code ⋆ Used Rmpi, a wrapper to MPI APIs ⋆ Provides interactive R slave functionality ⋆ Requires LAM/MPI runtime environment

slide-30
SLIDE 30

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 30 of 40 Go Back Full Screen Close Quit

R code for the Master Processor

library(Rmpi); mpi.spawn.Rslaves(hosts=c(0, 0, 1, 1)); mpi.remote.exec(load, file = "Projects/comp/RNnbMpi.Rdata"); k <- 10; G <- mpi.remote.exec(knnb.part, "Projects/comp/dim14_2k.dat", c(2000, 14), k, 1); n <- G[1, 1]; p <- G[2, 1]; G <- p * rowSums(G) / n; G <- G[3:length(G)] + p / 2 * log(pi) - log(gamma(p / 2 + 1)) + log(n) - log(1:k); lk <- function(k){ if(k == 0){ return(0); } sum(1/1:k); } EulerGamma <- 0.577216; H <- G - apply(rbind(0:(k-1)), 2, lk) + EulerGamma + log(1:k);

slide-31
SLIDE 31

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 31 of 40 Go Back Full Screen Close Quit

R code for the Slave

knnb.part <- function(data.file = "", dimension = NULL, k = 1, comm = 1){ x <- scan(data.file); dim(x) <- dimension; size <- mpi.comm.size(comm); rank <- mpi.comm.rank(comm); if(is.matrix(x)){ n <- nrow(x); p <- ncol(x); } else{ n <- length(x); p <- 1; } k2 <- k + 1; R <- matrix(1, nrow = n, ncol = k);

slide-32
SLIDE 32

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 32 of 40 Go Back Full Screen Close Quit

R code for the Slave (cont.)

i <- rank; while(i <= n){ if(p == 1){ R[i,] <- sort(abs(x[i] - x), method = "quick")[2:k2]; } else{ R[i,] <- sort(rowSums(sweep(x, 2, x[i,])^2), method = "quick")[2:k2]; } i <- i + size - 1; } R <- colSums(log(R)); return(c(n, p, R)); }

slide-33
SLIDE 33

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 33 of 40 Go Back Full Screen Close Quit

ANN Method (Approximate Nearest Neighbors)

⋆ C++ library for exact and approximate nearest (k) neighbor searching in p-dim spaces ⋆ Points are preprocessed into a data structure (e.g., a tree) ⋆ Data is stored in main memory ⋆ Running time is exponential in p ⋆ Distance can be any Minkowski metric

slide-34
SLIDE 34

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 34 of 40 Go Back Full Screen Close Quit

Parallelizing ANN

Suppose the sample size is n and m slaves are available. Let l = n/m. More specifically, the details are: ⋆ Build the tree on each slave concurrently ⋆ Find the nearest neighbors for data samples i + j ∗ m (0 ≤ j < l) for slave i (1 ≤ i ≤ m); (for example, if i = 1 and m = 4, the nearest neighbors of data samples 1, 5, 9, 13, . . . will be searched on slave 1) ⋆ Send the results of searching back to the master for calculating the final result The advantages of this solution are that it: ⋆ Fully utilizes the ability of each slave ⋆ Minimizes the communication between the master and slaves

slide-35
SLIDE 35

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 35 of 40 Go Back Full Screen Close Quit

Brute-force Algorithm without MPI on One G4 Processor

Sample Size Dimension Seconds 1000 1 1.22 2000 1 2.77 10000 1 42.46 1000 7 5.27 2000 7 17.94 10000 7 717.06 1000 14 8.75 2000 14 32.87 10000 14 1518.88

slide-36
SLIDE 36

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 36 of 40 Go Back Full Screen Close Quit

Brute-force Algorithm with MPI: 2 Slaves on One 2-CPU G4 Computer

Sample Size Dimension Seconds 1000 1 2.11 2000 1 2.99 10000 1 28.27 1000 7 4.65 2000 7 12.85 10000 7 464.11 1000 14 7.05 2000 14 22.35 10000 14 991.65

slide-37
SLIDE 37

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 37 of 40 Go Back Full Screen Close Quit

Brute-force Algorithm with MPI: 4 Slaves on Two 2-CPU G4 Computers

Sample Size Dimension Seconds 1000 1 2.04 2000 1 2.48 10000 1 16.78 1000 7 3.35 2000 7 7.39 10000 7 231.74 1000 14 4.64 2000 14 12.44 10000 14 499.46

slide-38
SLIDE 38

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 38 of 40 Go Back Full Screen Close Quit

ANN Algorithm with MPI: 2 Slaves on One 2-CPU Computer

Sample Size Dimension Seconds 1000 1 0.61 2000 1 0.64 10000 1 0.88 1000 7 0.73 2000 7 0.88 10000 7 2.22 1000 14 0.90 2000 14 1.31 10000 14 5.82

slide-39
SLIDE 39

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 39 of 40 Go Back Full Screen Close Quit

ANN Algorithm with MPI: 4 Slaves on Two 2-CPU Computers

Sample Size Dimension Seconds 1000 1 0.67 2000 1 0.74 10000 1 0.91 1000 7 0.79 2000 7 0.96 10000 7 2.11 1000 14 0.98 2000 14 1.30 10000 14 4.72

slide-40
SLIDE 40

Home Page Title Page Contents ◭◭ ◮◮ ◭ ◮ Page 40 of 40 Go Back Full Screen Close Quit

Conclusions

Computing entropy for moderately sized molecules, such as peptides, is feasible de- pending on the number of MD simulations. The ANN algorithms greatly improve the run time for these macromolecules relative to the brute-force method. However, it is less clear how much cluster computing improves performance for ANN

  • algorithms. Although the preliminary results above show little improvement, this is

likely to change when the number of slaves is increased for large n and/or p. For example, on a data set with 1.44M observations, the ANN kd-tree algorithm took 727.52 seconds to run with 2 slaves and 435.12 seconds to run with 4 slaves (using the same computers as the above tests). These results look promising, but more testing must be done before definitive answers can be given on the benefits of using MPI for large n and/or high dimensionality p.