Large-scale machine learning for genotype / phenotype association - - PowerPoint PPT Presentation

large scale machine learning for
SMART_READER_LITE
LIVE PREVIEW

Large-scale machine learning for genotype / phenotype association - - PowerPoint PPT Presentation

Large-scale machine learning for genotype / phenotype association Aidan OBrien Health Data Analytics 2018 HEALTH AND BIOSECURITY aydun1 By 2025 it is estimated that 50% of the world population will have been sequenced. Frost&Sullivan


slide-1
SLIDE 1

HEALTH AND BIOSECURITY

Aidan O’Brien Health Data Analytics 2018 aydun1

Large-scale machine learning for genotype / phenotype association

slide-2
SLIDE 2

By 2025 it is estimated that 50% of the world population will have been sequenced.

20 EB Storage / year

Stephens et al. BigData: Astronomical or Genomical (2015)

Data acquisition of BigData disciplines in 2025 Genomics

YouTube Astronomy Twitter

Frost&Sullivan

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 2 |

slide-3
SLIDE 3

Understanding disease and finding biomarkers

https://www.projectmine.com/about/ Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 3 |

slide-4
SLIDE 4

cases controls Gene1 Gene2

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 4 |

Finding the disease gene(s)

slide-5
SLIDE 5

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1

cases controls

Need an approach to capture feature- interactions

Complex diseases are driven by multiple genes

5 |

slide-6
SLIDE 6

Machine learning on 1.7 Trillion datapoints

80 Million features Individuals

Genomic profile Disease status 22,500 samples

Disease genes

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 6 |

A B C

slide-7
SLIDE 7

Machine learning can capture complex features

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1

Individuals Genomic profile Predictive variants Predictive variants Individuals Genomic profile

  • Trad. GWAS (logistic regression) Required Solution

7 |

slide-8
SLIDE 8

Random forest – a collection of decision trees

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 8 |

slide-9
SLIDE 9

Population-scale genomic data analysis requires BigData solutions

High-performance compute cluster Hadoop/Spark compute cluster Focus Compute-intensive Data-intensive Fault tolerant No Yes Node-bound Yes No Parallelization 100+ CPU 1000+ CPU Parallelization procedure bespoke standardized CSIRO solution

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 9 |

slide-10
SLIDE 10

Spark Core

SparkML MLlib

Variant Spark

Solution: VariantSpark - “Wide” machine learning for population- scale cohorts

low

Accuracy high

low

Speed high

“Analyzes 3000 individuals with 80M features in 30 minutes“

BMC Genomics 2015, 16:1052 PMID: 26651996 (citation=16)

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 10 |

slide-11
SLIDE 11

VariantSpark – amplifies association in the signal

  • Bone Mineral Density (BMD) as the

phenotype: 1,936 individuals with 7.2 Million variants (imputed from array)

  • Replicate known BMD genes identified by

traditional GWAS (single loci regression).

  • Amplify signal over traditional methods so

smaller cohorts give robust insights

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 11 |

More accurate biomarker discovery

slide-12
SLIDE 12

Hipster Index Synthetic dataset

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 12 |

HipsterScore =

(2 * B6) + (0.2 * B2) + (1.5 * R1) + (0.1 * C2) + (3 * B6 * B2) + (2.5 * R1 * C1) + noise

independent interacting Hipster? Y Y N Y N N N Y N Genome 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-13
SLIDE 13

Share research notebooks

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 13 |

  • Databricks
  • AWS EKS
  • Try it on your data

https://docs.databricks.com/applications /genomics/variant-spark.html

slide-14
SLIDE 14

CSIRO’s cloud- based solutions

Understanding relationships can lead to clinical applications

Innovation In Digital Health - Open Floor Forum | Denis C. Bauer | @allPowerde 14 |

Finding Disease Genes Correcting Genomes Treating Individuals

slide-15
SLIDE 15

Three things to remember

  • Complex diseases need software to detect gene-interactions
  • VariantSpark detects gene-interactions
  • Bringing findings into clinical practise requires new cloud technologies

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1 15 |

slide-16
SLIDE 16

Let’s build a healthier world together

Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’Brien Laurence Wilson, PhD Collaborators News Software Arash Bayat Lynn Langit Natalie Twine, PhD

Top 10 Australian IT stories of 2017

You?

We are hiring… …email Denis Brendan Hosking

Keynote Aidan O’Brien, CSIRO

Large-scale Machine Learning for Gen-Phen Association | Aidan O’Brien | @aydun1