today two populations which population
play

Today. Two populations. Which population? DNA data: Population 1: - PowerPoint PPT Presentation

Today. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] = .6 human1: A C T A Population 2: snp 843: Pr[A] = .6 , Pr[T] = .4 human2: C C A T human3: A G


  1. Today. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] = .6 human1: A ··· C ··· T ··· A Population 2: snp 843: Pr[A] = .6 , Pr[T] = .4 human2: C ··· C ··· A ··· T human3: A ··· G ··· T ··· T Individual: x 1 , x 2 , x 3 .., x n . Population 1: snp i : Pr[ x i = 1] = p ( 1 ) Single Nucleotide Polymorphism. i Modelling. Population 2: snp i : Pr[ x i = 1] = p ( 2 ) Same population? i An Analysis of the Power of PCA. Simpler Calculation: Model: same populution breeds. Musing (rant?) about algorithms in the real world. Population 1: Gaussion with mean µ 1 ∈ R d , variance σ in each dim. Population 1: snp 843: Pr[A] = .4 , Pr[T] = .6 Population 2: Gaussion with mean µ 2 ∈ R d , variance σ in each dim. Population 2: snp 843: Pr[A] = .6 , Pr[T] = .4 Individual: x 1 , x 2 , x 3 .., x n . Which population? Comment: snps could be movie preferences, populations could be types. E.g., republican/democrat, shopper/saver. Gaussians Projection Don’t know much about... Population 1: Gaussion with mean µ 1 ∈ R d , std dev. σ in each dim. Population 1: Gaussion with mean µ 1 ∈ R d , variance σ in each dim. Population 2: Gaussion with mean µ 2 ∈ R d , std dev. σ in each dim. Population 2: Gaussion with mean µ 2 ∈ R d , variance σ in each dim. Difference between humans σ per snp. Difference between humans σ per snp. Difference between populations ε per snp. Difference between populations ε per snp. How many snps to collect to determine population for individual x ? Project x onto unit vector v in direction µ 2 − µ 1 . x in population 1. E [(( x − µ 1 ) · v ) 2 ] = σ 2 if x is population 1. Don’t know µ 1 or µ 2 ? E [( x − µ 1 ) 2 ] = d σ 2 E [(( x − µ 2 ) · v ) 2 ] ≥ ( µ 1 − µ 2 ) 2 if x is population 2. E [( x − µ 2 ) 2 ] ≥ ( d − 1 ) σ 2 +( µ 1 − µ 2 ) 2 . √ If ( µ 1 − µ 2 ) 2 = d ε 2 >> σ 2 , then different. Std deviation is σ 2 ! versus d σ 2 ! → take d >> σ 2 / ε 2 No loss in signal! Variance of estimator? d ε 2 >> σ 2 . Roughly d σ 4 . → d >> σ 2 / ε 2 Signal is difference between expecations. Versus d >> σ 4 / ε 4 . roughly d ε 2 √ Signal >> Noise. ↔ d ε 2 >> d σ 2 . A quadratic difference in amount of data! Need d >> σ 4 / ε 4 .

  2. Without the means? Principal components analysis. Nets Remember Projection! Sample of n people. “ δ - Net”. Some (say half) from population 1, Set D of directions some from population 2. where all others, v , are close to x ∈ D . Don’t know µ 1 or µ 2 ? x · v ≥ 1 − δ . Which are which? δ - Net: Near Neighbors Approach [ ··· , i δ / d , ··· ] integers i ∈ [ − d / δ , d δ ] . Principal component analysis: Compute Euclidean distance squared. Find direction, v , of maximum variance. � O ( d ) Cluster using threshold. � d Total of N ∝ vectors in net. Maximize ∑ ( x · v ) 2 (zero center the points) δ Signal E [ d ( x 1 , x 2 )] − E [ d ( x 1 , y 1 )] Recall: ( x · v ) 2 could determine population. Signal >> Noise times log N = O ( d log d δ ) to isolate direction. should be larger than noise in d ( x , y ) Typical direction variance. n σ 2 . Where x ’s from one population, y ’s from other. log N is due to union bound over vectors in net. Direction along µ 1 − µ 2 , Signal is proportional d ε 2 . Signal (exp. projection): ∝ nd ε 2 . ∝ n ( µ 1 − µ 2 ) 2 . Noise (std dev.): √ n σ 2 . √ d σ 2 . ∝ nd ε 2 . Noise is proportional to nd >> ( σ 4 / ε 4 ) log d and d >> σ 2 / ε 2 works. Need d >> σ 2 / ε 2 at least. d >> σ 4 / ε 4 → same type people closer to each other. Nearest neighbor works with very high d > σ 4 / ε 4 . d >> ( σ 4 / ε 4 ) log n suffices for threshold clustering. When will PCA pick correct direction with good probability? � n PCA can reduce d to “knowing centers” case, with reasonable � log n factor for union bound over pairs. Union bound over directions. How many directions? 2 number of sample points. Best one can do? Infinity and beyond! PCA calculation. Computing eigenvalues. Sum up. Power method: Matrix A where rows are points. Choose random x . First eigenvector of B = A T A is maximum variance direction. Repeat: Let x = Bx . Scale x to unit vector. Clustering mixture of gaussians. Av are projections onto v . x = a 1 v 1 + a 2 v 2 + ··· vBv = ( vA ) T ( Av ) is ∑ x ( x · v ) 2 . Near Neighbor works with sufficient data. x t ∝ B t x = a 1 λ t 1 v 1 + a 2 λ t 2 v 2 + ··· First eigenvector, v , of B maximizes x T Bx . Projection onto subspace of means is better. Mostly v 1 after a while since λ t 1 >> λ t 2 . Principal compent analysis can find subspace of means. Bv = λ v for maximum λ . Cluster Algorithm: → vBv = λ for unit v . Power method computes principal component. Choose random partition. Eigenvectors form orthonormal basis. Repeat: Compute means of partition. Project, cluster. Generic clustering algorithm is rounded version of power method. Any other vector av + x , x · v = 0 Choose random + 1 / − 1 vector. Multiply by A T (direction between x is composed of possibly smaller eigenvalue vectors. means), multiply by A (project points), cluster (round to +1/-1 vector.) → vBv ≥ ( av + x ) B ( av + x ) for unit v , av + x . Sort of repeatedly multiplying by AA T . Power method.

  3. See you on Tuesday.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend