Genotype-Environment Effects Analysis Using Bayesian Networks Marco - PowerPoint PPT Presentation

Genotype-Environment Effects Analysis Using Bayesian Networks Marco Scutari 1 , Alison Bentley 2 and Ian Mackay 2 1 scutari@stats.ox.ac.uk Department of Statistics University of Oxford 2 National Institute for Agricultural Botany (NIAB) Cambridge, UK December 7, 2014

Integrative Analyses in Statistical Genetics Bayesian networks (BNs) represent a flexible tool for quantitative [6], qualitative and causal [9] reasoning, and are one of the building blocks used to specify complex models and Monte Carlo inference techniques in machine learning [8]. As such, they are well suited to integrative analyses in genetics and systems biology, that is, jointly modelling data from different sources: • various forms of sequence data (e.g. SNPs, full sequence data); • various qualitative and quantitative traits (e.g. disease scores, morphological characteristics); • epigenetic data (e.g. methylation); • products of gene transcriptions (e.g. RNA, proteins). Depending on the data at hand, such analyses are called GWAS, GS, eQTL, GxE GWAS, mQTL, etc. and make up the vast majority of literature in the field. Marco Scutari University of Oxford

Integrating Two Types of Data: GWAS and GS The baseline model for genome-wide association studies (GWAS) and genomic selection (GS) is the linear mixed model [3], rebranded as GBLUP (Genetic BLUP, [7]). It is typically fitted on a single trait X t at a time using a large number S of SNPs X S in the form of 0/1/2 allele counts from a genome-wide profile: u ∼ N ( 0 , K σ 2 X t = µ + Z S u + ε , u ) where µ is the population mean, Z S is the design matrix for the markers, u are random effects, ε is the error term and K is the kinship matrix encoding the relatedness between the individuals. When K can be expressed in the form T , GBLUP can be shown to be equivalent to the Bayesian linear X S X S regression S � � 0 , σ 2 g � X t = µ + X ∗ s i β i + ε with SNP effect prior β ∼ N , S I i =1 for some transformation of the X s i [10, 11]. Marco Scutari University of Oxford

Gaussian Bayesian Networks (GBNs) GBNs use a DAG G to represent the dependence structure of the multivariate distribution of X = { X 1 , . . . X p } under the following assumptions [6]: 1. X has a multivariate normal distribution; and 2. dependencies between the X i s are linear. Under these assumptions COV( X ) = Σ is a sufficient statistic for the GBN and: 1. if X i and X j are graphically separated in G (d-separation, [6]), then Ω ij = (Σ − 1 ) ij = 0 ; and 2. the local distribution associated with each X i is a linear regression on the parents Π X i of X i , i.e.: ε i ∼ N (0 , σ 2 X i = µ X i + X j β j + . . . + X k β k + ε i , i ) . Note that β j = − Ω ij / Ω ii in the above [2]. Marco Scutari University of Oxford

Assumptions for Genetic Data In the spirit of commonly used additive genetic models [5, 7], we make some further assumptions on the GBN to obtain a sensible causal model: 1. traits can depend on SNPs (i.e. X s i → X t j ) but not vice versa (i.e. not X t j → X s i ), and they can depend on other traits (i.e. X t i → X t j , i � = j ); 2. SNPs can depend on other SNPs (i.e. X s i → X s j , i � = j ); and 3. dependencies between traits follow the temporal order in which they are measured. Under these assumptions, the local distribution of each trait is X t i = µ t i + Π X ti β t i + ε t i ε t i ∼ N (0 , σ 2 = µ t i + X t j β t j + . . . + X t k β t k + X s l β s l + . . . + X s m β s m + ε t i , t i I ) � �� traits SNPs and the local distribution of each SNP is ε s i ∼ N (0 , σ 2 X s i = µ s i + X s l β s l + . . . + X s m β s m + ε s i , s i I ) . � �� SNPs Marco Scutari University of Oxford

Learning GBNs from Genetic Data We used the R packages bnlearn [12] and penalized [4] to implement the following hybrid approach to GBN learning [13]. 1. Structure Learning. 1.1 For each trait X t i , use the SI-HITON-PC algorithm [1] and the t -test for correlation to learn its parents and children; this is sufficient to identify the Markov blanket B ( X t i ) because of the assumptions on the GBN. The choice of SI-HITON-PC is motivated by its similarity to single-SNP analysis. 1.2 Drop all the markers which are not in any B ( X t i ) . 1.3 Learn the structure of the GBN from the nodes selected in the previous step, setting the directions of the arcs as discussed above. We identify the optimal structure as that which maximises BIC. 2. Parameter Learning. Learn the parameters of the local distributions using ordinary least squares or ridge regression. Marco Scutari University of Oxford

A GWAS Model from a Wheat Mapping Population G1033 G1789 G775 G1263 G1941 50 nodes ( 7 traits, 43 SNPs) from 600 obs. and 3 . 2 K SNPs. G847 G2416 G1276 G942 G2318 G1896 78 arcs, interpreted as putative G266 FUS G1984 causal effects. G1906 G599 HT FT Thickness represents arc G832 G2953 strength, computed as the G261 G1294 G1338 frequency of each arc in the G2835 G1853 G383 100 GBNs used in model YR.FIELD G1800 G200 YLD averaging. G418 G2570 G2208 G257 G311 G260 Scutari M, Howell P, Balding G800 G2920 G1217 DJ, Mackay I (2014). MIL G43 G1373 YR.GLASS Multiple Quantitative Trait G2588 Analysis Using Bayesian Net- works. G1945 G1750 Genetics, 198(1), 129–137. G524 G866 G877 G795 Marco Scutari University of Oxford

Adding Environmental Effects: GxE Interactions The BN model in the previous slide has quite a few limitations, especially when interpreted as a causal model: • It only uses SNPs to explain traits; there are multiple levels of unobserved biological processes in the middle acting as confounders. • It assumes all observations are collected under the same conditions (environmental and/or exogenous), which is rarely the case for large experiments, and are homogeneous overall (e.g. no stratification or individuals from different ethnicities/subspecies). • It assumes all variables are continuous, so that they can be meaningfully modelled with linear regression on their natural scale. A step forward in addressing these concerns is moving from GBNs to conditional Linear Gaussian Bayesian networks (CLGBNs) to include environmental effects as discrete variables and model genotype-by-environment interactions (GxE) and those with the traits. Marco Scutari University of Oxford

Conditional Linear Gaussian Bayesian networks (CLGBNs) CLGBNs extend traditional GBNs using mixture of Gaussians under the following assumptions [6, 8]: 1. discrete variables can only have discrete parents; 2. the local distribution for a discrete variable is a conditional probability table (CPT); and 3. the local distribution for a continuous variable is a set of linear regressions, one for each configuration δ of the discrete parents ∆ X i (if any), with the continuous parents Γ X i as explanatory variables: ε iδ ∼ N (0 , σ 2 X iδ = µ iδ + X jδ β jδ + . . . + X kδ β kδ + ε iδ , iδ ) . Note that, unlike most literature on mixture models, the δ does not arise from a latent variable but from an observed one. Marco Scutari University of Oxford

Learning CLGBNs from Genetic Data In addition to the assumptions used to learn GBNs, now we also assume that: • traits and genes can depend environmental effects and experimental variables but not vice versa. And the hybrid learning approach from [13] is modified as follows. 1. Structure Learning. 1.1 For each trait X t i , use the SI-HITON-PC algorithm [1] and the t -test for correlation to learn its parents and children among the genes; then do a second pass also considering the environmental effects using SI-HITON-PC and a log-likelihood ratio test. 1.2 Drop all the markers which are not in any B ( X t i ) . 1.3 Learn the structure of the CLGBN from the nodes selected in the previous step, setting the directions of the arcs as discussed above. We identify the optimal structure as that which maximises BIC. 2. Parameter Learning. Learn the parameters of the local distributions using empirical frequencies for the discrete variables and ordinary least squares or ridge regression for the continuous variables. Marco Scutari University of Oxford

Another Wheat Data Set, From Multiple Countries We prototyped this approach on the wheat population described in: Bentley AR, Scutari M, Gosman N et al. (2014). Applying Association Mapping and Genomic Selection to the Dissection of Key Traits in Elite European Wheat. Theoretical and Applied Genetics, 127(12), 2619–2633. This data set contains 376 wheat varieties from different countries ( 210 FRA, 90 DEU, 75 GBR) trialled in the same set of fields in GBR, DEU and FRA to produce a variety of gene-environment interactions. After preprocessing marker profiles include 2 . 1 K DaRTs and SNPs and 3 known genes: PpdD1 297 (flowering time) and Rht1 267 / Rht2 400 (dwarfing genes). Traits include: • Grain Protein Content (GPC, %) • Yield (YLD, t/ha ) • Thousand Grain Weight (TGW, weight/ hl ) • Flowering time (FT, days) • Specific Weight (SPWT, weight/ hl ) • Height (HT, cm ) • Earing (EAR, ears/ m 2 ) • Winter Kill (WK, 1 – 9 ) • Awns (AWNS, 0 – 1 ) Marco Scutari University of Oxford

Genotype-Environment Effects Analysis Using Bayesian Networks Marco - PowerPoint PPT Presentation

Genotype-Environment Effects Analysis Using Bayesian Networks Marco Scutari 1 , Alison Bentley 2 and Ian Mackay 2 1 scutari@stats.ox.ac.uk Department of Statistics University of Oxford 2 National Institute for Agricultural Botany (NIAB)

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

Design of WHO Genotype Panels for HBsAg and HBV-DNA and of WHO anti-HBc Standard WHO Genotype

Application of the GGE biplot to Application of the GGE biplot to evaluate Genotype, Environment

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Effects and State Liam OConnor CSE, UNSW (and Data61) Term 2 2019 1 Effects State IO

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian analysis using Stata Yulia Marchenko Executive Director of Statistics StataCorp LP

Bayesian hierarchical models Bruno Nicenboim / Shravan Vasishth 2020-03-14 1 Bayesian

Bayesian Analysis using Stata Bill Rising StataCorp LP 2016 Brazilian Stata Users Group Meeting

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Lecture 3: Biology Basics Continued Spring 2020 January 28, 2020 Genotype/Phenotype Phenotype:

Photons and Schr odinger Cats: Quantum Optomechanics Lajos Di osi Wigner Center for

A short report on: joint thwg-cswg WP on Definition and selection of non- standard

Evaluating Complement-Modifier Distinctions in a Semantically Annotated Corpus Mark McConville

THE STANDARD C LIBRARY THE STANDARD C LIBRARY Common functions we dont need to write

Bayesian Networks in Reliability The Good, the Bad, and the Ugly Helge Langseth Department of

Understanding and Using Research Lakes Country Service Cooperative Dominique Bradley 11/ 8 / 2018

WHAT IS IS PROCESS TRACING AND WHEN CAN WE USE IT IT? PVM-Y302 Advanced research methods:

Explanation and Reisss Trilemma Philosophy of Economics University of Virginia Matthias