welcome back two populations which population
play

Welcome back. Two populations. Which population? DNA data: - PowerPoint PPT Presentation

Welcome back. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] = .6 human1: A C T A Population 2: snp 843: Pr[A] = .6 , Pr[T] = .4 human2: C C A T Projects


  1. Welcome back. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] = .6 human1: A ··· C ··· T ··· A Population 2: snp 843: Pr[A] = .6 , Pr[T] = .4 human2: C ··· C ··· A ··· T Projects comments available on Glookup! human3: A ··· G ··· T ··· T Individual: x 1 , x 2 , x 3 .., x n . Turn in homework! Population 1: snp i : Pr[ x i = 1] = p ( 1 ) Single Nucleotide Polymorphism. i I am away April 15-20. Population 2: snp i : Pr[ x i = 0] = p ( 2 ) Same population? i Midterm out when I get back. Model: same populution breeds. Simpler Calculation: Few days takehome. Population 1: Gaussion with mean µ 1 ∈ R d , variance σ in each dim. Population 1: snp 843: Pr[A] = .4 , Pr[T] = .6 Shiftable. Population 2: Gaussion with mean µ 2 ∈ R d , variance σ in each dim. Population 2: snp 843: Pr[A] = .6 , Pr[T] = .4 Have handle on projects before that. Individual: x 1 , x 2 , x 3 .., x n . Progress report due Monday. Which population? Comment: snps could be movie preferences, populations could be types. E.g., republican/democrat, shopper/saver. Gaussians Projection Don’t know much about... Population 1: Gaussion with mean µ 1 ∈ R d , std deviation σ in each dim. Population 1: Gaussion with mean µ 1 ∈ R d , variance σ in each dim. Population 2: Gaussion with mean µ 2 ∈ R d , std deviation σ in each Population 2: Gaussion with mean µ 2 ∈ R d , variance σ in each dim. dim. Difference between humans σ per snp. Difference between humans σ per snp. Difference between populations ε per snp. Difference between populations ε per snp. Project x onto unit vector v in direction µ 2 − µ 1 . How many snps to collect to determine population for individual x ? E [(( x − µ 1 ) · v ) 2 ] = 0 if x is population 1. Don’t know µ 1 or µ 2 ? x in population 1. E [(( x − µ 2 ) · v ) 2 ] ≥ ( µ 1 − µ 2 ) 2 if x is population 2. E [( x − µ 1 ) 2 ] = d σ 2 √ Std deviation is σ 2 ! versus d σ 2 ! E [( x − µ 2 ) 2 ] ≥ ( d − 1 ) σ 2 +( µ 1 − µ 2 ) 2 . No loss in signal! If ( µ 1 − µ 2 ) 2 = d ε 2 >> σ 2 , then different. d ε 2 >> σ 2 . → take d >> σ 2 / ε 2 → d >> σ 2 / ε 2 Variance of estimator? Versus d >> σ 4 / ε 4 . Roughly d σ 4 . Signal is difference between expecations. A quadratic difference in amount of data! roughly d ε 2 √ Signal >> Noise. ↔ d ε 2 >> d σ 2 . Need d >> σ 4 / ε 4 .

  2. Without the means? Principal components analysis. Nets Remember Projection! Sample of n people. “ δ - Net”. Some (say half) from population 1, Set D of directions some from population 2. where all others, v , are close to x ∈ D . Don’t know µ 1 or µ 2 ? x · v ≥ 1 − δ . Which are which? δ - Net: Near Neighbors Approach [ ··· , i δ / d , ··· ] integers i ∈ [ − d / δ , d δ ] . Principal component analysis: Compute Euclidean distance squared. � O ( d ) Find direction, v , of maximum variance. � Cluster using threshold. d Total of N ∝ vectors in net. Maximize ∑ ( x · v ) 2 (zero center the points) δ Signal E [ d ( x 1 , x 2 )] − E [ d ( x 1 , y 1 )] Recall: ( x · v ) 2 could determine population. Signal >> Noise times log N = O ( d log d δ ) to isolate direction. should be larger than noise in d ( x , y ) Typical direction variance. n σ 2 . Where x ’s from one population, y ’s from other. log N is due to union bound over vectors in net. Direction along µ 1 − µ 2 , Signal is proportional d ε 2 . Signal (exp. projection): ∝ nd ε 2 . ∝ n ( µ 1 − µ 2 ) 2 . Noise (std dev.): √ n σ 2 . √ d σ 2 . ∝ nd ε 2 . Noise is proportional to nd >> ( σ 4 / ε 4 ) log d and d >> σ 2 / ε 2 works. Need d >> σ 2 / ε 2 at least. d >> σ 4 / ε 4 → same type people closer to each other. Nearest neighbor works with very high d > σ 4 / ε 4 . d >> ( σ 4 / ε 4 ) log n suffices for threshold clustering. When will PCA pick correct direction with good probability? � n PCA can reduce d to “knowing centers” case, with reasonable � Union bound over directions. How many directions? log n factor for union bound over pairs. 2 number of sample points. Best one can do? Infinity and beyond! PCA calculation. Computing eigenvalues. Sum up. Power method: Matrix A where rows are points. Choose random x . First eigenvector of B = A T A is maximum variance direction. Repeat: Let x = Bx . Scale x to unit vector. Clustering mixture of gaussians. Av are projections onto v . x = a 1 v 1 + a 2 v 2 + ··· vBv = ( vA ) T ( Av ) is ∑ x ( x · v ) 2 . Near Neighbor works with sufficient data. x t ∝ B t x = a 1 λ t 1 v 1 + a 2 λ t 2 v 2 + ··· First eigenvector, v , of B maximizes x T Bx . Projection onto subspace of means is better. Mostly v 1 after a while since λ t 1 >> λ t 2 . Principal compent analysis can find subspace of means. Bv = λ v for maximum λ . Cluster Algorithm: → vBv = λ for unit v . Power method computes principal component. Choose random partition. Eigenvectors form orthonormal basis. Repeat: Compute means of partition. Project, cluster. Generic clustering algorithm is rounded version of power method. Any other vector av + x , x · v = 0 Choose random + 1 / − 1 vector. Multiply by A T (direction between x is composed of possibly smaller eigenvalue vectors. means), multiply by A (project points), cluster (round to +1/-1 vector.) → vBv ≥ ( av + x ) B ( av + x ) for unit v , av + x . Sort of repeatedly multiplying by AA T . Power method.

  3. See you on Thursday.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend