PCA and Admixture proportions for low depth NGS data Anders - PowerPoint PPT Presentation

PCA and Admixture ’proportions’ for low depth NGS data Anders Albrechtsen

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Analysis of low depth sequencing data Admixture proportions Individual allele frequencies (PCA) PCA thHan(40,919) alHan(80,714) SouthHan(20,969)

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Admixture clustering /PCA - which is more informative?

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies This morning Admixture model 1 Intro to the model likelihood based on called genotypes NGSadmix 2 ML inference based on genotype likelihoods Introduction to PCA 3 population structure and PCA Problems with PCA analysis NGS data PCA for NGS - genotype likelihood approach 4 The expectation of the covariance analysis based on individual allele frequencies 5 Admixture proportions vs. PCA Inbreeding

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Examples of known solutions and software • Several methods: • Bayesian: e.g. Structure (Pritchard et al. 2000) • Maximum Likelihood: e.g. ADMIXTURE (Alexander et al. 2009) • They all base their inference on called genotypes and infer Admixture proportions, Q 1 Allele frequencies for all loci for all K populations, F 2

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies ML solution • To find an ML solution we have to • Define a model/likelihood function p ( G | Q , F ) • Find an efficient way to find argmax p ( G | Q , F ) ( Q , F ) • The latter is usually solved using EM which I will no focus on • I will spend time describing the model/likelihood function G the genotype data F the ancestral frequencies Q the admixture proportions

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Visualized - if we know everything known ancestry

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Likelihood function (1 individual i , 1 diallelic locus j ) Assume K source populations and let • Q i = ( q i 1 , q i 2 , ..., q i K ) be i ’s genomewide admixture proportions • G ij be the genotype of i in j (measured in counts of allele A) • F j = ( f j 1 , f j 2 , ..., f j K ) denote the allele frequencies of allele A Then • for one of i ’s alleles: p ( allele | Q i , F j ) = q i 1 f j 1 + q i 2 f j 2 + ... q i K f j K = π ij • π is also called the individual allele frequency • all individual allele frequencies Π = QF T • Assuming HWE the probability of a observing genotype is: ( π ij ) 2  if G ij = 2 ,  p ( G ij | Q i , F j ) = 2 π ij (1 − π ij ) if G ij = 1 , (1 − π ij ) 2 if G ij = 0 . 

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Likelihood function (N individuals, M diallelic loci) • If we assume: • the individuals are unrelated and thus independent • loci are independent we can write the (composite) likelihood as N M � � p ( G ij | Q i , F j ) p ( G | Q , F ) = i j • ML estimate (like ADMIXTURE): ( ˆ Q , ˆ F ) = argmax p ( G | Q , F ). ( Q , F ) Very large number of parameters M × K + N × (K-1)

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies EM algorithm. A single site. New estimate of F

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Some problems (NGS data, variable depth)

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Genotype likelihoods The genotype likelihood p ( X | geno ) Summarise the data in 10 genotype likelihoods A C G T bases: A * * * * TTTCCTTTTTTTTTTTTT C * * * ֌ quality score: G * * BBGHSSBBTTTTGHRSBB T *

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Genotype likelihoods with inferred major and minor alleles The genotype likelihood p ( X | geno ) Summarise data for diallelic site bases: 0 p ( X | geno = 0) TTTCCTTTTTTTT 1 p ( X | geno = 1) ֌ quality score: 2 p ( X | geno = 2) BBGHSSBBTTTTG

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Solution for admixture for low depth : NGSadmix • Works on genotype likelihoods instead of called genotypes • I.e. input is p ( X ij | G ij ) for all 3 possible values of G ij , where X ij is NGS data for individual i at locus j • The previous likelihood is extended from N M � � p ( G ij | Q i , F j ) p ( G | Q , F ) = i j to N M N M � � � � � p ( X ij | Q i , F j ) = p ( X ij | G ij ) p ( G ij | Q i , F j ) p ( X | Q , F ) = i j i j G ij ∈{ 0 , 1 , 2 } • Note that for known genotypes the two are equivalent • A solution is found using an EM-algorithm

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Solution: NGSadmix • Does well even for low depth and variable depth data:

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Ultra low seq? - use reference data e.g. HGDP SNP chip FastNGSadmix, Jorsboe et al 2016 • same model as NGSadmix, but uses a allele frequencies from reference panel • similar to iAdmix (and ADMIXTURE projection) but takes reference size into account

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies PCA for NGS: Ancient Eskimo a a Rasmussen et. al. , 2010 Figure: First principal components of selected populations.

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies singular value decomposition SVD - singular value decomposition G = UDV T • G does not have to be symmetric PCA for a covariance matrix or pairwise distance √ DV T C = V • The first principal component/eigenvector accounts for as much of the variability in the data as possible • C is symmetric • Optimally the multidimensional data is identically distributed

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Genotype data 5 individuals, genotypes 5 individuals, allele counts SNP1 AG AG AG AA AA SNP1 1 1 1 0 0 SNP2 TT TA AA AT AA SNP2 0 1 2 1 2 SNP3 AA AC AC CC AC SNP3 2 1 1 0 1 SNP4 GG GG GC CC CC SNP4 0 0 1 2 2 SNP5 2 1 1 0 0 SNP5 TT TC TC CC CC SNP6 0 0 1 1 1 SNP6 AA AA AC AC AC SNP7 2 2 1 1 0 SNP7 TT TT TC TC CC

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies IBS distances Total Distance Ind1 Ind2 Ind3 Ind4 Ind5 5 individuals Ind1 0 3 7 10 11 Ind2 3 0 4 7 8 SNP1 1 1 1 0 0 Ind3 7 4 0 5 4 SNP2 0 1 2 1 2 Ind4 10 7 5 0 3 SNP3 2 1 1 0 1 Ind5 11 8 4 3 0 SNP4 0 0 1 2 2 SNP5 2 1 1 0 0 SNP6 0 0 1 1 1 1 dimensional projection SNP7 2 2 1 1 0 Ind1 Ind2 Ind3 Ind4 Ind5 1st 0.65 0.36 -0.08 -0.4 -0.53

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Multidimensional scaling Goal Based on pairwise distances reduce the number of dimension by a transformation that preserves the pairwise distances as best as possible. go from dimension N x N to N x S, where S < N

Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Principal component analysis for genetic data SVD - singular value decomposition ˜ G = UDV T PCA for a covariance matrix √ G T = C = V G ˜ ˜ DV T • The first principal component/eigenvector accounts for as much of the variability in the data as possible • Can be use to reduce the dimension of the data Goal Capture the population structure in a low dimensional space

PCA and Admixture proportions for low depth NGS data Anders - PowerPoint PPT Presentation

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Analysis of low depth

PCA and admixture proportions for NGS data Anders Albrechtsen Admixture model NGSadmix

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Structured

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Inferring sites with recent or ongoing selection for NGS data(+admixture/population structure)

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

Admixture of Poisson MRFs (APM): A Topic Model with Word Dependencies David Inouye*, Pradeep

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Estimating proportions of elements in finite symmetric and classical groups Alice Niemeyer UWA,

Small area estimation of proportions of Small area estimation of proportions of Arsenic affected

Lecture 22/Chapter 19 Part 4. Statistical Inference Ch. 19 Diversity of Sample Proportions

11/11/2014 Chapter 21 COMPARING TWO PROPORTIONS 1 THE STANDARD DEVIATION OF THE DIFFERENCE

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Teaching Mathematical Biology in High School Joseph Malkevitch Department of Mathematics York

Genome 562 February 2015 Week 7 Genome 562 p.1/5 East and Jones, 1918 Edward M. East

Analysis of Energy Relations between Noise and Vibration Signals in the Scanning Area of an

Health pathways for refugee and asylum seeker children RCH Immigrant Health Service June 2018

to Empirical Economics Lessons from Past Debates for Current Practice James J. Heckman Econ 312,

600.325/425 Declarative Methods Assignment 5: Dynamic Programming Spring 2006 Prof. Jason

Mendelian Genetics & Inheritance Patterns Multiple Choice Review www.njctl.org Slide 3 /

Course Evaluations 1. More examples Worked examples on whiteboard? Concrete examples of

Sambuz

Useful Links

Newsletter

Mail Us

PCA and Admixture proportions for low depth NGS data Anders - PowerPoint PPT Presentation

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Admixture model NGSadmix Introduction to PCA PCA for NGS - genotype likelihood approach analysis based on individual allele frequencies Analysis of low depth

PCA and admixture proportions for NGS data Anders Albrechtsen Admixture model NGSadmix

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Structured

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Inferring sites with recent or ongoing selection for NGS data(+admixture/population structure)

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

Admixture of Poisson MRFs (APM): A Topic Model with Word Dependencies David Inouye*, Pradeep

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Estimating proportions of elements in finite symmetric and classical groups Alice Niemeyer UWA,

Small area estimation of proportions of Small area estimation of proportions of Arsenic affected

Lecture 22/Chapter 19 Part 4. Statistical Inference Ch. 19 Diversity of Sample Proportions

11/11/2014 Chapter 21 COMPARING TWO PROPORTIONS 1 THE STANDARD DEVIATION OF THE DIFFERENCE

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Teaching Mathematical Biology in High School Joseph Malkevitch Department of Mathematics York

Genome 562 February 2015 Week 7 Genome 562 p.1/5 East and Jones, 1918 Edward M. East

Analysis of Energy Relations between Noise and Vibration Signals in the Scanning Area of an

Health pathways for refugee and asylum seeker children RCH Immigrant Health Service June 2018

to Empirical Economics Lessons from Past Debates for Current Practice James J. Heckman Econ 312,

600.325/425 Declarative Methods Assignment 5: Dynamic Programming Spring 2006 Prof. Jason

Mendelian Genetics &amp; Inheritance Patterns Multiple Choice Review www.njctl.org Slide 3 /

Course Evaluations 1. More examples Worked examples on whiteboard? Concrete examples of

Sambuz

Useful Links

Newsletter

Mail Us

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Mendelian Genetics & Inheritance Patterns Multiple Choice Review www.njctl.org Slide 3 /