1 Analysis of SNP data with R The data we will be working with here - PDF document

Statistical methods in bioinformatics April 9, 2018 Exercises day 2 Claus Ekstrøm These exercises will use a combination of R and specific programmes. The exercises are build up over two types of problems: introductory text that you should run through to make sure you understand how to use and a few additional questions for you to explore. 1 Analysis of SNP data with R The data we will be working with here can be read into R using the following two lines (ensure that you use upper and lower case exactly as listed below). You will need internet access since the data are read down directly from a web-site. geno <- read.table(url("http://www.biostatistics.dk/Genotypes.txt")) pheno <- read.table(url("http://www.biostatistics.dk/Phenotypes.txt")) y <- pheno$V1 The rows in the two files match up such that the genotypes for row 1 matches the first phenotype. • How many genes are in our ”small” example? • How many individuals are in the dataset? • What type of output variable/phenotype do we have here? • Make an analysis of the association between the phenotype and the first SNP. Are they associated? [Hint: You can either use a logistic regression model or just analyze this as a Gaussian model here — it should probably be a logistic regression model.] • Make a full SNP analysis as shown at the lectures. • Create a Manhattan plot with your results. • Try to analyze the data using lasso. Compare the results to the previous analysis. 1

2 Find linkage disequilibrium (LD) blocks The following text explains how to make a heat map in R . Heat map plots are used to visualize data from a two-dimensional array using colors to indicate the values in the array. In R , the function heatmap plots a heat map, and it requires a numeric matrix as its first argument. There are several formulas for computing linkage disequilibrium between two genes. A simple measure that is fast to compute and easy to interpret is the correlation coefficient. cor computes the correlation matrix for all pairwise correlations. • Read in the hapmap1 and qt data from the web page (see instruction in the slides). • Compute the correlation matrix for the first 100 genes/SNPs. • Use heatmap(x, Rowv=NA, Colv=NA) ( x is the matrix of correlation values computed above). What do you see in the graph? • What happens if you do not include the arguments Rowv=NA, Colv=NA ? The color scheme of the heat map plot can be set with the col option. By default it uses heat colors obtained from the heat.colors function, but that can be changed to give other color schemes. If the options ColSideColors and RowSideColors are provided, they should contain a vector of color names for a horizontal/vertical side bar that annotates the columns and rows of the data matrix, respectively. This can be used to indicate known groupings by color code as shown below. 3 Population admixture The researchers are nervous that there is population admixture in the data. This essentially means that the full population in reality contains two or more ”hidden” sub-populations. • Why is population admixture a potentially big problem? We can use principal component analysis based on all SNPs to produce a combined measure (of the SNPs). If there is population substructure then the allele frequencies are likely to be different among different sub-populations and across many SNPs and this effect is likely to be captured by the principal component since they are computed regardless of the outcome. 2

• Read in the hapmap1 and qt data from the web page (see instruction in the slides). • Make a full analysis where you also take potential population structure into account. How does this change the results? [ Hint: compute a principal component or two and check if they con- tribute anything to the analyses. ] • Which markers should ideally be used for the principal components? 4 Rare variants We want to see if there is any reason to consider a burden/SKAT test in order to find association between rare variants and the outcome. • Compute the minor allele frequency for each of the genes. What is the distribution of allele frequencies? You can use the colMeans function to compute the mean of the columns in the genotype data. • For each of the rare SNPs: Identify the SNPs found to be in the same clusters using the heatmap from above and make a simple burden test for each of the clusters identified. An example of this can be analyzed in R as follows (needs to be modified): > library("lme4") > # Include, say, V6 and V7 as random variable > y <- pheno$V3 > fullmodel <- lmer(y ~ 1 + (1|V6) + (1|V7), data=geno2, REML=FALSE) > smallmodel <- lm(y ~ 1, data=geno2) > logLik(fullmodel) > logLik(smallmodel) • How should we compare the results? 5 ”Heritability” In this exercise we will try to estimate the proportion of variance explained. To do this we will use the lmekin function from the coxme package. • We should start by computing the allele frequencies for each of the SNPs. Use the colMeans function. 3

• The mean of each SNP is 2 p where p is its allele frequency, and it has a � standard deviation of 2 p (1 − p ). Standardize the values for each gene by subtracting the correct mean and dividing by its standard deviation. Call the result W in R . See the code below for inspiration. • Run the code MAF <- 1-colMeans(geno2)/2 # Minor allele frequency W <- t((t(geno2)/2 - MAF)/sqrt(2*MAF*(1-MAF))) N <- length(y) A <- tcrossprod(W) / ncol(geno2) library(coxme) id <- 1:N res <- lmekin(y ~ 1 + (1|id), varlist=list(id=A)) The first couple of lines compute the correlation matrix we should use for the calculations. The last line fits the model. • Explain what goes on in the code above. • Estimate the proportion of variance explained. 6 plink This exercise is to try the software program plink. You need to download and install that before you can run the exercise. plink is super versatile and has many more options than what we can consider here. Instead we will try to make small extensions to the analyses we made above. • Download the plink program from http://zzz.bwh.harvard.edu/plink/download.shtml and install it. The plink executable file should be placed in either the current working di- rectory or somewhere in the command path. All commands involve typing plink at the command prompt followed by a number of options (all starting with -option ) to specify the data files / methods to be used. plink needs two standard files: a ped-file and a map-file. • Check the files and make sure you understand their formats. The map file contains chromosome number, the SNP name, the position in mor- gans on the chromosome, and the position in base pairs on the chromosome. 4

• Type plink --noweb --file hapmap1 and look at the output. It takes two files (a ped file and a map file) as input. They can be down- loaded from the website. • Type plink --noweb --missing --file hapmap1 and look at the output. What information is contained in the missing files. • Type plink --noweb --freq --file hapmap1 and look at the output. What information is contained in the frequency files. How does the results compare to the results you computed manually earlier in R? • Type plink --noweb --assoc --file hapmap1 and look at the output. What information is contained in the association files. • Type plink --noweb --assoc --adjust --file hapmap1 and look at the output. What information is contained in the adjusted association files. 5

1 Analysis of SNP data with R The data we will be working with here - PDF document

Statistical methods in bioinformatics April 9, 2018 Exercises day 2 Claus Ekstrm These exercises will use a combination of R and specific programmes. The exercises are build up over two types of problems: introductory text that you should run

Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference

SNP SNP ho hone ney be y bee di diagno gnostic pa c pane nel Tugrul Giray, Ph.D.

Today. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] =

Welcome back. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 ,

MULTIPLE ACCESS COMMUNICATION PROTOCOL (proposal of a new SnP) SpaceWire networking Protocol

Genome Wide SNP Selection with Entropy Based Methods Zhenqiu Liu University of Maryland

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

s Superimposition of virtual and real images TH10Z501 TH10Z501 TH10Z005 SNP NDA3 / Folie 1

Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits Martin Zhang joint work with:

Analysis of SNP Marker Data for Predictions Some remarks about various methods/software Fikret

SNP Chip Data Analysis for Whole Genome Association Studies I nforSense Limited Sha Liao , Life

Galaxy for SNP and Variant Data Analysis Plant and Animal Genome XXIV (PAG 2016) January 12,

East Asian mtDNA haplogroup determination in Koreans: Haplogroup-level coding region SNP analysis

Inferring sites with recent or ongoing selection for SNP chip or NGS data

Good practice SILA NATIONAL PARK ITALY Analysis of the present situation Biomass potentials

Quartet Inference from SNP Data Under the Coalescent Model

CS681: Advanced Topics in Computational Biology Week 8 Lecture 1 Can Alkan EA224

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

Christine Pecci, MD University of California, San Francisco Annual Review in Family Medicine

Schema Theory David White Wesleyan University November 30, 2009 Building Block Hypothesis

Mining the semantics of genome super-blocks to infer ancestral architectures Macha Nikolski

Dynamic Programming Part 2 Algorithm Theory WS 2012/13 Fabian Kuhn Dynamic Programming

Garnett Wilson and Wolfgang Banzhaf Memorial University of Newfoundland, St. Johns, NL, Canada

Short Reads Alignment to a Reference Genome Joanna Krupka CRUK Summer School in Bioinformatics