In silico blood genotyping from exome sequencing data Silvio - - PowerPoint PPT Presentation

in silico blood genotyping from exome sequencing data
SMART_READER_LITE
LIVE PREVIEW

In silico blood genotyping from exome sequencing data Silvio - - PowerPoint PPT Presentation

In silico blood genotyping from exome sequencing data Silvio Tosatto BioComputing UP, Department of Biology, University of Padova, Italy URL: http://protein.bio.unipd.it/ Today Personalized genetics has been upon us for some time How


slide-1
SLIDE 1

In silico blood genotyping from exome sequencing data

Silvio Tosatto

BioComputing UP, Department of Biology, University of Padova, Italy URL: http://protein.bio.unipd.it/

slide-2
SLIDE 2

Today

  • Personalized genetics has been upon us for some time
  • How good are we at actually identifying phenotype from whole genome?
slide-3
SLIDE 3

The CAGI Personal Genom e Project ( PGP) Challenge

  • Few goals are more pure to genome interpretation than predicting traits

from raw sequence (or genotype) data

  • In this CAGI challenge, phenotypes/traits are

predicted for real people with genetic data

  • 10 individual’s genetic information from the

Personal Genome Project are provided (PGP-10)

Dataset provided by George Church

slide-4
SLIDE 4

Personal genome project (PGP) ‐ Predict individuals’ phenotype

Numerical traits

  • 33. Birth weight (in g)
  • 34. HDL level (in mg/dL) *
  • 35. LDL level (in mg/dL) *
  • 36. Triglyceride level

(in mg/dL) *

  • 37. Fasting blood glucose level

(in mg/dL)

  • 38. Warfarin dose (in mg)
  • 39. Age at Menarche
  • 40. Annual income (in $)
slide-5
SLIDE 5

Numerical traits

  • 33. Birth weight (in g)
  • 34. HDL level (in mg/dL) *
  • 35. LDL level (in mg/dL) *
  • 36. Triglyceride level

(in mg/dL) *

  • 37. Fasting blood glucose level

(in mg/dL)

  • 38. Warfarin dose (in mg)
  • 39. Age at Menarche
  • 40. Annual income (in $)

Personal genome project (PGP) ‐ Predict individuals’ phenotype

slide-6
SLIDE 6

Blood Groups

  • Clear genetic cause of

phenotypes

  • Model system for phenotype

prediction

  • Good description in literature
  • High relevance, especially

for blood transfusions

(Blood. 2009;114: 248-256)

slide-7
SLIDE 7

Exam ple: ABO glycosyltransferase

Blood Grp Genes Antigens

ABO ABO A, B, O

Amino acid residues differing between blood group A- and B-active transferases, respectively (Arg176Gly; Gly235Ser; Leu266Met; Gly268Ala) are shown with the single-letter code and their positions indicated.

slide-8
SLIDE 8

Relevant Blood Types

Blood Grp Genes Antigens

ABO ABO A, B, O RH RHCE, RHD D, E, C plus 50 minor DUFFY DARC FY(a), FY(b) Kell KEL K1, K2 plus 23 minor Diego SLC4A1 Dia, Dib, Wra, Wrb Kidd SLC14A1 Jk(a), Jk(b) Lewis FUT3 a, b Lutheran BCAM Lu(a), Lu(b) plus 15 minor MNS GYPA, GYPB, GYBE M, N, S plus 40 minor Bombay FUT1, FUT2 H, secretor

10 out of ca. 30 blood groups are relevant for transfusions

slide-9
SLIDE 9

BOOGI E: BlOOd Group I dEntifier

  • A knowledge-based system to predict blood groups from sequencing data
  • All 10 groups relevant for blood transfusions are predicted
  • A specialized genotype-phenotype knowledge base is required
slide-10
SLIDE 10

BOOGI E: Know ledge representation

  • Stored in tree-like structure
  • Rules expressed in “if <mutation(s)>

then <phenotype(s)>” form

slide-11
SLIDE 11

BOOGI E: Know ledge collection

– Manually curated – 580 rules derived

Blood G rp G enes Antigens

ABO ABO A, B, O RH RH CE, RHD D, E, C plus 50 m inor DUFFY DARC FY(a), FY(b) Kell KEL K1, K2 plus 23 m inor Diego SLC4A1 Dia, Dib, Wra, Wrb Kidd SLC14A1 Jk(a), Jk(b) Lewis FUT3 a, b Lutheran BCAM Lu(a), Lu(b) plus 15 m inor M NS GYPA, GYPB, GYBE M , N, S plus 40 m inor Bom bay FUT1, FUT2 H, secretor

slide-12
SLIDE 12

Relevant variants Gene‐based annotation of variants Select conserved positions Remove unrelated genes

ANNOVAR ANNOVAR

(Wang et al., Nucleic Acids Research 2010)

Millions of SNVs

ANNOVAR is used to reduce the SNVs to manageable number.

Few relevant SNVs

slide-13
SLIDE 13

BOOGI E Pipeline

Blood G rp G enes Antigens

ABO ABO A, B, O RH RHCE, RHD D, E, C plus 50 m inor DUFFY DARC FY(a), FY(b) Kell KEL K1, K2 plus 23 m inor Diego SLC4A1 Dia, Dib, W r a, Wr b Kidd SLC14A1 Jk(a), Jk(b) Lewis FUT3 a, b Lutheran BCAM Lu(a), Lu(b) plus 15 m inor M NS GYPA, GYPB, GYBE M , N, S plus 40 m inor Bom bay FUT1, FUT2 H, secretor

slide-14
SLIDE 14

Benchm arking

  • BOOGIE covers all known blood group variants
  • Difficulty in finding genome sequences with known blood phenotypes
  • Personal Genome Project (PGP) as annotated benchmark set
slide-15
SLIDE 15

Personal Genom e Project ( PGP)

The mission of the PGP is to encourage the development of personal genomics

  • 10 individual’s genetic information from the

Personal Genome Project are provided (PGP-10)

  • A larger dataset (PGP-1K) aims to cover at least

1,000 genomes

Unfortunately, only ABO and Rh blood group information is available

slide-16
SLIDE 16

PGP-1 0 Data

Back row (left to right): James Sherley, Misha Angrist, John Halamka, Keith Batchelder, Rosalynn Gill. Front row (left to right): Esther Dyson, George Church, Kirk Maxey. Not shown: Stan Lapidus and Steven Pinker.

slide-17
SLIDE 17

PGP-1 0 Data

slide-18
SLIDE 18

PGP-1 0 Results

PGP1 PGP4 PGP8 Known O + A - B + ABO O A B Rh c; e; weak D c; e; weak D c; e; weak D DUFFY FY(a+); FY(b-) FY(a-); FY(b+) FY(a-); FY(b+) KELL K2; K21+; K4-; K3-; K11; K17; K14; K24; K6+; K7- K2; K21+; K4-; K3-; K11; K17; K14; K24; K6+; K7- K2; K21+; K4-; K3-; K11; K17; K14; K24; K6+; K7- Diego Dib; Memph neg Dib; Memph neg Dib; Memph neg KIDD Jk(a-); Jk(b+) Jk(a-); Jk(b+) Jk(a+); Jk(b-) Lewis negative negative negative Lutheran Lu(a-); Lu(b+); Lu6+; Lu9-; Lu4; Lu8+; Aua+;Aub- Lu(a-); Lu(b+); Lu6-; Lu9+;Lu4-; Lu8+; Aua-;Aub+ Lu(a-); Lu(b+); Lu6+; Lu9-;Lu4-; Lu8+; Aua+;Aub- MNS M; S M; s M,s Bombay H+; secretor H+; secretor H+; secretor

BOOGIE predicts correctly all ABO types and all except one (PGP-4) Rh groups

slide-19
SLIDE 19

PGP-1 K Results

  • A second dataset was built from all PGP-1K participants with available

blood group information for a total of 22 individuals

  • This dataset contains micro array data (23&me SNPs)

P = predicted R = real

* = missing blood group relevant SNPs from dataset

slide-20
SLIDE 20

Conclusions

  • We developed a method, called BOOGIE, to predict the ten blood

groups relevant for transfusions from sequencing data

– Specialized knowledgebase with 580 genotype to phenotype rules – Novel variants can be easily considered

  • Benchmarking was (so far) only possible on PGP data for the ABO and

Rh blood groups

– The ABO and Rh systems are correctly predicted in 85-100% of cases – The Rh- type presents some additional difficulties

slide-21
SLIDE 21

Acknowledgements Acknowledgements

Manuel Giollo Giovanni Minervini Marta Scalzotto (not shown) Emanuela Leonardi Carlo Ferrari

URL: URL: http:// http://protein.bio.unipd.it protein.bio.unipd.it/ /

Funding

FIRB Futuro in Ricerca

Università di Padova CARIPLO AIRC