GENETICS FOR
DUMMIES
APPLICATIONS OF NEXT GENERATION SEQUENCING IN HUMAN GENETICS
ANNEMIEKE VERKERK j.verkerk@erasmusmc.nl
DUMMIES APPLICATIONS OF NEXT GENERATION SEQUENCING IN HUMAN GENETICS - - PowerPoint PPT Presentation
GENETICS FOR DUMMIES APPLICATIONS OF NEXT GENERATION SEQUENCING IN HUMAN GENETICS ANNEMIEKE VERKERK j.verkerk@erasmusmc.nl www.woesvanhaaften.com/weblog/joost-van-haaften/ HUMAN GENOMES _in the Human Genome Project DNA full genome was
APPLICATIONS OF NEXT GENERATION SEQUENCING IN HUMAN GENETICS
ANNEMIEKE VERKERK j.verkerk@erasmusmc.nl
_in the Human Genome Project DNA full genome was sequenced from a mix of male and female individuals _this sequence served as the human genome reference sequence and does not represent any one’s personal genome _it took 13 years and cost $ 3 – 5 billion _with the new techniques of NGS, sequencing became much faster and people wanted to sequence personal genomes _lot of data – stored in databases – visualised in genome browsers
_the first personal genome sequenced with the new method was the DNA of James Watson was generated within 4 months cost $ 1.5 million published in Nature in April 2008 _but... Craig Venter’s personal genome was sequenced first, although with the previous-generation machines cost $ 100 million published in Plos Biology Sept 2007 despite all sequencing advances, we are still learning how to read all this data.....
James Watson did not want to know the status of his ApoE gene, because sequence within it could indicate his risk for Alzheimer’s disease – not present in genomebrowser
and this also started the debate about privacy issues on human genome sequences and the “right” to know or not to know
chr 19
rs429358 locus: APOE gene rs7412 E2 T T T T E2 C C C C E4 E4 increased risk for Alzheimer
STORE DATA IN
_sequencing produces a lot of data _to share it with others 3 genome browsers were developed to store it in – and make it visible
NCBI_map viewer Ensembl UCSC
USE OF NGS DATA
_find (rare) variants that can be used in Genome Wide Association studies _solving Mendelian disorders that were until now impossible to solve especially rare diseases – diseases in small families _solving de novo diseases (newly arisen in the child, parents are healthy) _ somatic mutations in cancer _learn and understand the sequence/structure of DNA
solving Mendelian disorders
disease function gene location on genome/chromosome linkage analysis / positional cloning
_you look for the chromosomal region that is shared in all patients (in a family) and segregates with the disease _you follow polymorphic markers in the pedigree, can be CA repeats or SNPs _you need large families with (10) affected individuals to get a small region were you can search for the mutated gene _you need a lot of meioses to end up with a small region _was very time consuming with the old techniques (using CA repeats) _faster with SNPs as markers (SNP arrays) _small families with rare diseases can not be solved in this way
This chromosome segregates with the disease in this family and contains a gene mutation = principle of a linkage study
ca10 ca16 ca15 ca18 ca12 ca10 ca14 ca18 ca10 ca16 ca14 ca16 ca14 ca15 ca15 ca16 ca15 ca14 ca16 ca14 ca15 ca16GERM CELL DIVISION
during production of germ cells chromosomes in “metaphase” exchange pieces of DNA = recombination or crossing over to create genetic diversity germ cells end up with single chromatids from recombined chromosomes
A
A = affected
X recombination and segregation how it works in a family
the region still needs to be sequenced
A A X X X A A A A A A A
_you can determine the complete DNA sequence of all exons in one individual in
_only have to look in exons, because most Mendelian diseases have a mutation in the coding sequence _costs keep going down, more extensive platforms are developed = affordable _especially suited for small families and sporadic cases (without family history) de novo (dominant) cases (were impossible to solve, mapping with linkage not possible) continuing stuck positional cloning projects with large linkage areas
NEW WAY OF GENE / MUTATION FINDING
= EXOME SEQUENCING
_published in Nature Genetics in 2010 by Ng et al. _4 patients from 3 families with Miller syndrome (facial and limb abnormalities) _very rare disease, only 30 cases described in literature _linkage analysis was not an option, families are very small
FIRST PAPER ON EXOME SEQUENCING
3 families family 1: 2 affected individuals family 2: 1 affected individual family 3: 1 affected individual assumption mode of inheritance: autosomal recessive
_variants that cause aminoacid changes (NS = non synonmyous changes) _splice site mutations (SS = splice site changes) _short insertions and deletions in coding regions _new variants not present in dbSNP database _predicted to be damaging for data-analysis: filtered out all variants that were not relevant _mainly synonymous changes: change on DNA level = aminoacid does not change _intronic changes _known SNPs Exome Sequencing: 40-fold coverage of 37 Mb = all coding sequence with data-analysis interest in:
first only family 1 was analyzed for the recessive model it was required for each sibling to have variants in the same gene: 2800 genes _filtered out all known SNPs present in dbSNP129 and 8 HapMap samples: problem reduced to 9 genes _compared these 9 genes to variants in the two other unrelated patients from family 2 and 3: _problem reduced to 1 gene: DHODH = dihydro-orotate dehydrogenase BUT: it was not a classical recessive mutation, but a compound recessive mutation: the parents carried different mutations in the same gene!
Unexpected bonus: _also recessive mutations were found in another gene, DNAH5, but only in the 2 siblings from family 1, not in patients from family 2 and 3 _patients from family 1 had additional clinical problems with pulmonary infections (not part of Miller syndrome) and the second gene DNAH5 was responsible for that (= dynein, axonemal, heavy chain 5).
7 years: 2270 10 years: 1800 start of exome seq 12 years: 1000 first working draft of human genome
phenotype with known gene function phenotype without known gene function 1000 2000 3000 4000 5000 6000 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017OMIM data
it all seems very easy but….. _still not all genes or exons are recognized, and are therefore not present in the sequence capture kit _sometimes even known genes are not present in the capture kit, but this is improving _ not all targeted exons are well captured _not all targeted sequence can be aligned back to the ref genome _not all aligned sequences can be accurately called
_regulating DNA areas are not (yet) present in the sequence capture kit (whole genome seq better?) _copy number variation is (still) hard to recognize _mitochondrial DNA is sometimes present, sometimes not mitochondrial probes are not present in the capture kit mitochondrial sequence is a ‘’by-product” _micro RNAs are not all recognized and are not in the capture kit
A WHOLE EXOME IS NOT THAT WHOLE
= the disease in your group of patients is caused by more than 1 gene for exome sequencing this is a problem: _the patient population is not homogeneous _you are searching for variants in more than 1 gene _you need enough samples to find the gene mutation _especially important for sporadic cases
LOCUS HETEROGENEITY
PROBLEM OF
= the disease in your group of patients is caused by more than 1 variant in the same gene for exome sequencing this is an advantage: _the variants are spread over the same gene in different patients the risk of not finding the variant is spread over different sequencing targets and increases the chance of finding the causing gene
ALLELIC HETEROGENEITY
ADVANTAGE OF
How to find the right variant (mutation) …… _after exome sequencing you are left with around 25.000 variants per person _filtering steps are needed to filter out normal variants many databases with DNA variants
can be used to filter out common variants non-coding variants can be filtered out new variants, but synonymous ones (no change in amino-acid) can be filtered out after that you hope you do not have that many left to select from coding - stop - nonsynonymous - splice site variants - some indels
TO FIND THE RIGHT VARIANT
_you will end up with a list of variants that are the potential cause of the disease you are studying _there will be false positives (false negatives?) _every healthy person has 100 LOF variants in their genome, including non-penetrant disease mutations which one is it?? _you will need to investigate functional issues are the variants conserved in other species do the variants have an effect on the protein _a region – even though large – can help restrict the number of variants
dbSNP AGAIN
_use of dbSNP can be helpful but also dangerous _also variants associated with disease are in there _mutations of (“more common”) recessive disorders are present i.e. mutations in the CFTR gene very rare variants are probably not in there _sequence data of cohorts is deposited in dbSNP i.e. from the Exome Sequencing Project (ESP) Univ of Washington exome sequencing data from people with heart, lung and blood disorders _it would be good to have your own internal control set that you know best yourself _ExAC database is used by many researchers (http://exac.broadinstitute.org/)
WHAT DO YOU GET
variant location genotypes
HOW TO FIND THE RIGHT VARIANT
prediction programs conservation population freq
RECESSIVE DISORDER
EXAMPLE OF
WITH CONSANGUINITY SOLVED BY LINKAGE ANALYSIS AND EXOME SEQUENCING
Malpuech-Michels-Mingarelli-Carnevale syndrome
AJHG 87, 2012
Malpuech-Michels-Mingarelli-Carnevale syndrome
_disorder with developmental delay, ocular and abdominal defects, skeletal anomalies, facial characteristics, normal intelligence – mild intellectual disability, hearing problems _phenotypic variability but also overlap cases described by 4 mentioned authors _rare – 10-15 small families reported in literature _assumed autosomal recessive inheritance strategy used: _linkage analysis/homozygosity mapping in 2 families with related parents _sequenced exome of 1 patient
linkage/homozygosity mapping identified a region of 1.8 Mb (chr3) in family 1: homozygous in the 2 patients, not present in the unaffected children heterozygous in the parents (and in one unaffected) the same region, but larger (24Mb), was present in family 2 (confirmation of the region _one exome was sequenced of patient 101 - smallest homozygous region
exome data of 1 patient: filtered according to a recessive model: mutation on both alleles _only looked in the 1.8 Mb region – contains 20 genes _filtered on: sufficient good quality not present in dbSNP not present in data of 1000 genomes project 4 variants left: (1 was non-conserved intronic) (1 was non-conserved intergenic) 1 was non-synonymus, non-conserved 1 was non-synonymus, conserved in other species: in MASP1 = mannan-binding lectin serine protease 1 predicted to be damaging bij SIFT and PolyPhen programs Sanger sequencing identified a STOP mutation in family 2 in MASP1 exon 6
some functional evidence show with specific modeler software that the NS variant has an effect on the 3D model of the protein stop-codon missense – amino acid change
AUTOSOMAL DOMINANT
EXAMPLE OF
DISORDER SOLVED BY EXOME SEQUENCING
Nat Genet 42, 2010
KABUKI SYNDROME
_disorder with skeletal phenotypes and intellectual disability phenotypic variability _very rare 1/30.000 – 1/50.000 -- 400 cases worldwide reported _not solvable by positional cloning/linkage analyses _most cases are the novo, in some instances transmission from parent to child suggested autosomal dominant disorder
strategy used: _sequenced exomes of 10 unrelated patients _they could initially NOT find the mutation problems encountered: _more then one gene was involved: locus heterogeneity _the gene capture missed a number of exons
later resolved MLL2 gene 54 exons
LOSS OF FUNCTION VARIANTS IN HUMAN PROTEIN-CODING GENES
Science 335, 2012
in normal people
_loss of function variants severely disrupt the function of protein-coding genes through mutations in the DNA
_normally expected to be present in genes causing genetic diseases _BUT, every healthy individual has around 100 LOF variants ± 20 stopcodons, 6 in homozygous state ± 80 deletions/insertions/frameshifts, up to 14 in homozygous state
Classes of LOF variant affecting protein-coding regions
MacArthur D G , Tyler-Smith C Hum. Mol. Genet. 2010;19:R125-R130WHAT IS IN THERE
_mutations in heterozygous state, that would cause disease in homozygous state = recessive diseases _LOF variants in poorly evolutionary conserved genes – small or no effect _LOF variants in genes that belong to multi-gene families, suggesting that proteins in the same family have similar functions _a minority are associated with reduced expression of the corresponding gene question: what effect do these variants have on human phenotypes this was a first report about LOF in normal individuals in 2012 now it is found that also Mendelian disease variants, causing severe disorders, are present in healthy individuals
normal individuals carry SEVERE MENDELIAN CHILDHOOD DISEASE MUTATIONS
Nat biotechnology 34, 2016
HYPOTHESIS
_analysed existing sequence data - - WES and WGS - - identified: _13 individuals with mutations in 8 disease genes, normally causing severe Mendelian childhood disorders: 5 homozygous state for AR diseases 3 heterozygous state for AD diseases hypothesis: these individuals carry a genetic modulator or suppressor that protects against Mendelian disease
SOME ISSUES
_for 6 individuals health record was checked: no mention of disease _for 5 of the 13 genotypes could be confirmed, others no DNA available any more _because of original consent could not re-contact these individuals _ search for counterbalancing variants hampered by unavailability
_ data sharing important to investigate this further _linking participants to their medical records _consent for re-contacting needed
OTHER EXAMPLES
>1000 rare homozygous LOF variants in inbred 3222 Pakistani from UK predicted to be pathogenic, but without clinical phenotype 56 LOF mutations in ASXL1 in ExAC database in individuals without severe genetic childhood disorders _de novo (dominant) mutations cause severe Bohring-Opitz syndrome
LARGEST GENOMIC DATABASE
New initiative: Million Veteran Program USA _participants donate blood for DNA extraction + give access to their electronic health records + agree to be contacted about participating in future research http://www.va.gov/opa/pressrel/pressrelease.cfm?id=2806 http://www.research.va.gov/mvp/ 2011: start aug 2016: 500.000 participants july 2017: 580.000 participants 2020:
1953 1990
INTERESTED IN SEQUENCING? you are welcome to contact us
Annemieke j.verkerk@erasmusmc.nl André a.g.uitterlinden@erasmusmc.nl Robert r.kraaij@erasmusmc.nl