PCA and admixture proportions for NGS data Anders Albrechtsen

Admixture model NGSadmix Introduction to PCA PCA for NGS Ancient Eskimo a a Rasmussen et. al. , Nature 2010 Figure: First principal components of selected populations.

Admixture model NGSadmix Introduction to PCA PCA for NGS Admixture: Inuit vs. European

Admixture model NGSadmix Introduction to PCA PCA for NGS 26.5% European admixture in Greenland

Admixture model NGSadmix Introduction to PCA PCA for NGS Admixture: PCA analysis

Admixture model NGSadmix Introduction to PCA PCA for NGS What hope you get out of this session Learning objectives • Be able to understand the underlying model for PCA and admixture analysis • Understand the problem with genotype uncertainty for NGS data • Learn some tricks and tools to overcome the problem • Be able to do PCA and to infer ancestry proportions from NGS data

Admixture model NGSadmix Introduction to PCA PCA for NGS Outline Admixture model 1 Intro to the problem Maximum likelihood (ML) solution based on called genotypes NGSadmix 2 ML inference based on genotype likelihoods Introduction to PCA 3 population structure and PCA Problems with PCA analysis NGS data PCA for NGS 4 The expectation of the covariance

Admixture model NGSadmix Introduction to PCA PCA for NGS Basic problem • Underlying assumption : All analyzed individuals have DNA that originates from one or more of K source populations. • Consequence : they will have some proportion of DNA from each of the K populations; these proportions are called admixture proportions • Examples with K=2: • A dataset consisting of samples from two different populations • Two populations that have mixed • Goal: to infer overall admixture proportions (+ identify clusters)

Admixture model NGSadmix Introduction to PCA PCA for NGS Overview of well known solutions • Several methods: • Bayesian: e.g. Structure (Pritchard et al. 2000, BABS - Mikko J Sillanp • Maximum Likelihood: e.g. Admixture (Alexander et al. 2009) • They all base their inference on called genotypes and infer Admixture proportions, Q 1 Allele frequencies for all loci for all K populations, F 2

Admixture model NGSadmix Introduction to PCA PCA for NGS ML solution • To find an ML solution we have to • Define a model/likelihood function p ( data | Q , F ) • Find an efficient way to find argmax p ( data | Q , F ) ( Q , F ) • The latter is usually solved using EM which I will no focus on • I will spend time describing the model/likelihood function

Admixture model NGSadmix Introduction to PCA PCA for NGS Likelihood function (1 individual i , 1 diallelic locus j ) Assume K source populations and let • Q i = ( q i 1 , q i 2 , ..., q i K ) be i ’s genomewide admixture proportions • G ij be the genotype of i in j (measured in counts of allele A) • F j = ( f j 1 , f j 2 , ..., f j K ) denote the allele frequencies of allele A Then • for one of i ’s alleles: p ( allele | Q i , F j ) = q i 1 f j 1 + q i 2 f j 2 + ... q i K f j K = h ij • therefore assuming HWE we get the likelihood function: ( h ij ) 2  if G ij = 2 ,  p ( G ij | Q i , F j ) = 2 h ij (1 − h ij ) if G ij = 1 , (1 − h ij ) 2 if G ij = 0 . 

Admixture model NGSadmix Introduction to PCA PCA for NGS Likelihood function (N individuals, M diallelic loci) • If we assume: • the individuals are unrelated and thus independent • loci are independent we can write the likelihood as N M � � p ( G ij | Q i , F j ) p ( G | Q , F ) = i j with G = ( G ij ), Q = ( Q 1 , Q 2 , ... Q N ) and F = ( F 1 , F 2 , ... F M ). • Based on this we can find ML estimates: ( ˆ Q , ˆ F ) = argmax p ( G | Q , F ). ( Q , F ) Very large number of parameters M × K + N × (K-1)

Admixture model NGSadmix Introduction to PCA PCA for NGS reformulation of the likelhood the likelihood - G ij = g 1 ij + g 2 ij N M � � p ( G ij | Q i , F j ) p ( G | Q , F ) = i j N M 2 � � � p ( g d ij | Q i , F j ) ∝ i j d 2 N M � � � � p ( g d ij | Q i , F j , A ) p ( A | Q i , F j ) = i j d A N M 2 � � � � p ( g d ij | F j , A ) p ( A | Q i ) = i j d A ij | F j , A ) = f j ij ) + (1 − f j p ( A | Q i ) = q i A and p ( g d A I 0 ( g d A ) I 1 ( g d ij )

Admixture model NGSadmix Introduction to PCA PCA for NGS Some problems (NGS data, variable depth)

Admixture model NGSadmix Introduction to PCA PCA for NGS Genotype likelihoods The genotype likelihood p ( data | geno ) Summarise the data in 10 genotype likelihoods A C G T bases: A * * * * TTTCCTTTTTTTTTTTTT C * * * ֌ quality score: G * * BBGHSSBBTTTTGHRSBB T *

Admixture model NGSadmix Introduction to PCA PCA for NGS Genotype likelihoods The genotype likelihood p ( data | geno ) Summarise data for diallelic site bases: 0 p ( data | geno = 0) TTTCCTTTTTTTTTTTTT 1 p ( data | geno = 1) ֌ quality score: 2 p ( data | geno = 2) BBGHSSBBTTTTGHRSBB

Admixture model NGSadmix Introduction to PCA PCA for NGS Solution: NGSadmix • Works on genotype likelihoods instead of called genotypes • I.e. input is p ( X ij | G ij ) for all 3 possible values of G ij , where X ij is NGS data for individual i at locus j • The previous likelihood is extended from N M p ( G ij | Q i , F j ) � � p ( G | Q , F ) = i j to N M N M � � p ( X ij | Q i , F j ) = � � � p ( X ij | G ij ) p ( G ij | Q i , F j ) p ( X | Q , F ) = i j i j Gij ∈{ 0 , 1 , 2 } • Note that for known genotypes the two are equivalent • A solution is found using an EM-algorithm

Admixture model NGSadmix Introduction to PCA PCA for NGS Solution: NGSadmix • Does well even for low depth and variable depth data:

Admixture model NGSadmix Introduction to PCA PCA for NGS Principal component analysis SVD X = UDV T PCA for a covariance matrix MM T = C = VDV T • The first principal component/eigenvector accounts for as much of the variability in the data as possible • Can be use to reduce the dimension of the data Goal Capture the population structure in a low dimensional space

Admixture model NGSadmix Introduction to PCA PCA for NGS Genotype data

Admixture model NGSadmix Introduction to PCA PCA for NGS The first principal component

Admixture model NGSadmix Introduction to PCA PCA for NGS Measure for pairwise differences Identical by descent (IBS) matrix - used in MDS The average genotype distance between individuals pros fast cons Ignores allele frequency cons Problems with some kinds of missingness Covariance / correlation matrix - used in PCA pros good weighting scheme for each site cons Slower and cannot easily deal with any kind of missing data

Admixture model NGSadmix Introduction to PCA PCA for NGS Approximation of the genotype covariance M number of sites G genotypes G j genotypes for individual j G j k genotypes for site k in individual j f k allele frequency for site k Known genotypes - covariance between individuals i and j a a Patterson N, Price AL, Reich D, plos genet. 2006 M k − 2 f k )( G j ( G i cov ( G i , G j ) = 1 k − 2 f k ) = 1 � G ˜ ˜ G T 2 f k (1 − f k ) M M k =1 G i k − 2 f k ˜ G i k = var ( G k ) = 2 f k (1 − f k ) � 2 f k (1 − f k )

Admixture model NGSadmix Introduction to PCA PCA for NGS IBS matrix genotypes Assuming the genotypes are coded as { 0 , 1 , 2 } Commonly used distance measure k = G j G i k ∧ G i  0 if k � = 1  k = G j  G i k ∧ G i 1 if k = 1  k , G j d ( G i k ) = k − G j | G i 1 if k | = 1   k − G j  | G i 2 if k | = 2 IBS distance between individual i and j M d ( G i , G j ) = 1 � k , G j d ( G i k ) M k =1

Admixture model NGSadmix Introduction to PCA PCA for NGS The two first principal component 0.1 Pop 1 0.0 ● PC 2 Pop 2 ● Pop 3 ● −0.1 −0.2 −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 PC 1

Admixture model NGSadmix Introduction to PCA PCA for NGS Early use of PCA in genetics Shown 1 PC at 400 locations Science 1978 Menozzi P, Piazza A, Cavalli-Sforza L. Data 38 loci

Admixture model NGSadmix Introduction to PCA PCA for NGS PCA mania Eigenstrat a Genetic map from PCA a a Price et. al, nat genet. 2008 a Novembre et. al, nat genet. 2008

Admixture model NGSadmix Introduction to PCA PCA for NGS Sample size/information bias Uneven sample sizes will bias both the distance and the pattern a a McVean G PLoS Genet. (2009)

Admixture model NGSadmix Introduction to PCA PCA for NGS Dealing with Missingness Covariance matrix - Eigensoft a a Patterson N, Price AL, Reich D, plos genet. 2006 If a genotype is missing then ˜ G k i is set to zero • E [˜ G i k ] = 0 for a random individual • E [ cov ( G i , G j )] = 0 without relatedness or admixture. or a site is discarded • Not possible for large samples • Will likely cause ascertainment bias IBS matrix The site is skipped for the pair of individuals • Missingness must be random

PCA and admixture proportions for NGS data Anders Albrechtsen - PowerPoint PPT Presentation

PCA and admixture proportions for NGS data Anders Albrechtsen Admixture model NGSadmix Introduction to PCA PCA for NGS Ancient Eskimo a a Rasmussen et. al. , Nature 2010 Figure: First principal components of selected populations. Admixture

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Admixture model

PCA and Admixture proportions for low depth NGS data Anders Albrechtsen Structured

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Inferring sites with recent or ongoing selection for NGS data(+admixture/population structure)

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Pathway Analysis Jenny Wu Outline Introduction to NGS data analysis in Cancer Genomics

Admixture of Poisson MRFs (APM): A Topic Model with Word Dependencies David Inouye*, Pradeep

Estimating proportions of elements in finite symmetric and classical groups Alice Niemeyer UWA,

Small area estimation of proportions of Small area estimation of proportions of Arsenic affected

Lecture 22/Chapter 19 Part 4. Statistical Inference Ch. 19 Diversity of Sample Proportions

11/11/2014 Chapter 21 COMPARING TWO PROPORTIONS 1 THE STANDARD DEVIATION OF THE DIFFERENCE

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Genomics infrastr Genomics infrastruc ucture f ure for NGS r NGS 2013 Winter School

The NGS WFS of MAORY Presented by Marco Bonaglia Adoni workshop Padova, 10th-12th April 2017

RADIO AND THE DEVELOPMENT OF MODERN BROADCASTING (THE SHORT VERSION) History of Information

Source: Standard and Chartered 2 Can instill confidence A decentralized database between actors

FROM HPC TO CLOUD AND BACK AGAIN? SAGE WEIL PDSW 2014.11.16 AGENDA A bit of history

The Emergent Church and Last Days Ecumenism By Dr. Andy Woods Adapted from Roger Oakland, Faith

Clinical Trials for Adaptive Intervention Designs: Workshop on the Design and Conduct of

S chool Rove r FINAL TERM FOR 2010 (above) Jawa Guligo preparing for the Term 4 battle The first

GUI Interaction Using GUI interfaces Benefits of using a GUI Direct Manipulation

Mo deling TCP Throughput: A Simple Mo del and its Empirical V alidation Jitendra

Sambuz

Useful Links

Newsletter

Mail Us