Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for - - PowerPoint PPT Presentation

introduction to plink
SMART_READER_LITE
LIVE PREVIEW

Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for - - PowerPoint PPT Presentation

Introduction Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Engineering University of the Witwatersrand Johannesburg 2014 Scott Hazelhurst Sydney Brenner


slide-1
SLIDE 1

Introduction

Introduction to PLINK

Scott Hazelhurst

Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Engineering University of the Witwatersrand Johannesburg 2014

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E

Introduction to PLINK

slide-2
SLIDE 2

Introduction

Data format

Standard tool for manipulating genotype data vcftools PLINK/PSEQ Plink has multiple data formats Other tools for converting to/from other formats pngu.mgh.harvard.edu/~purcell/plink/ https://www.cog-genomics.org/plink2/

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-3
SLIDE 3

Introduction

PLINK in transition to PLINK 2 Current version of plink: 1.90b2 Previous version: 1.07 New version: Much faster Has more features Missing some features Data compatible

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-4
SLIDE 4

Introduction

PLINK primarily aimed at genotype data SNPs “short” indels Some support for CNV A leading tool for GWAS, structure analysis – many

  • ther tools support format.

Not appropriate for many SVs, or when great variability

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-5
SLIDE 5

Introduction

PED format

PED files with individuals information, MAP file with SNP information PED file One row per individual. Columns are Family ID, Individual ID Paternal ID, Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype Missing: −9, 0; Control: 1; Case 2. (or QT) Pair of columns per SNP: different encodings possible

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E

Introduction to PLINK

slide-6
SLIDE 6

Introduction

HCB181 1 0 0 1 1 2 2 2 2 2 2 1 2 2 2 2 2 HCB182 1 0 0 1 1 2 2 1 2 2 2 1 2 1 2 2 2 HCB183 1 0 0 1 2 2 2 1 2 2 2 1 2 1 1 2 2 HCB184 1 0 0 1 1 2 2 1 2 2 2 1 1 2 2 2 2 HCB185 1 0 0 1 1 2 2 1 2 2 2 2 2 2 2 2 2 HCB186 1 0 0 1 1 2 2 2 2 2 2 1 1 2 2 2 2 HCB187 1 0 0 1 1 2 2 2 2 2 2 1 2 1 2 2 2 HCB188 1 0 0 1 1 2 2 1 2 2 2 1 1 2 2 2 2 HCB189 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 HCB190 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 HCB191 1 0 0 1 2 1 2 2 2 2 2 1 2 1 2 2 2 HCB192 1 0 0 1 1 2 2 2 2 2 2 1 1 2 2 2 2 HCB193 1 0 0 1 1 1 2 2 2 2 2 2 2 2 2 2 2 HCB194 1 0 0 1 1 2 2 2 2 2 2 1 2 2 2 2 2 HCB195 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-7
SLIDE 7

Introduction

Can be used to model family studies AFAM 1 0 0 . . . AFAM 2 0 0 . . . AFAM 3 1 2 . . . AFAM 4 1 2 . . . AFAM 5 0 0 . . . AFAM 6 1 5 . . . AFAM 7 0 0 . . . AFAM 8 3 0 . . . AFAM 9 0 4 . . .

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-8
SLIDE 8

Introduction

NB: Some commands/toolsx: Expect sex information by default

  • -allow-no-sex / --must-have-sex

Want phenotype data

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-9
SLIDE 9

Introduction

MAP file MAP file has one row per SNP Chromosome: 1..26 (X, Y, XY, MT) SNP id genetic distance (morgans) base pair (which build!) Newer versions of PLINK have support for some non-human genomes

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E

Introduction to PLINK

slide-10
SLIDE 10

Introduction

1 rs3094315 742429 1 rs3131972 742584 1 rs12562034 758311 1 rs12124819 766409 1 rs11240777 788822 1 rs6681049 789870 1 rs4970383 828418 1 rs4475691 836671 1 rs7537756 844113 1 rs13302982 851671 1 rs1110052 863421 1 rs2272756 871896

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-11
SLIDE 11

Introduction

Binary PED format

Faster, more compact FAM file: one row per individual – identification information (first 6 columns of PED file). Human readable BIM file: one row per SNP. MAP file + two variants for that SNP. Human readable. BED file: one row per individual – genotype information (rest of the columns of the PED file). Not human readable Don’t confuse with UCSC BED format for genomic data – can have both in a study

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-12
SLIDE 12

Introduction

BIM file

1 rs2185539 0 566875 C C 1 rs11510103 0 567753 A A 1 rs11240767 0 728951 C C 1 rs3131972 0 752721 G G 1 rs3131969 0 754182 G G 1 rs1048488 0 760912 T T 1 rs12562034 0 768448 A G 1 rs12124819 0 776546 A A 1 rs4040617 0 779322 A A 1 rs2905036 0 792480 T T 1 rs4245756 0 799463 C C 1 rs12086311 0 808769 G G

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-13
SLIDE 13

Introduction

Other formats:

transposed long format Not commonly used – typically when you need to import from another format. May be easy to write a script that does the conversion.

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-14
SLIDE 14

Introduction

Transposed data

tped/tfam files. tped: one row per SNP with SNP info followed by genotype of each individual; tfam: info about individuals

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-15
SLIDE 15

Introduction

Long format

Very inefficient – but may be useful in conversion MAP file FAM file LGEN file containing genotypes LGEN family ID, individual ID SNP ID two alleles

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E

Introduction to PLINK

slide-16
SLIDE 16

Introduction

A 1 rs123 A C A 2 rs28782 C G A 3 rs919878 T T A 2 rs123 A C B 7 rs123 A C B 8 rs123 A C B 9 rs28782 C T

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-17
SLIDE 17

Introduction

Phenotype/Cluster file

FID IID PHE FID IID PHE FID IID PHE FID IID PHE

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-18
SLIDE 18

Introduction

Can have multiple phenotypes FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-19
SLIDE 19

Introduction

Tri-allelic alleles

PLINK can represent tri-allelic alleles Only very limited ability to analyse them Same SNP may appear several times in the MAP or BIM file Usually filter out tri-allelic alleles Often an issue when merging data sets

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK

slide-20
SLIDE 20

Introduction

Strandedness

Different chips or experiments may record a SNP using different strand A C T G See when merge data — appears to be multi-allelic PLINK reports apparently multi-allelic SNPs You can flip them – create a new data set Try merge again – if really multi-allelic should work Filter out remaining May incorrectly flip a few

Scott HazelhurstSydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng

Introduction to PLINK