 
              GENETICS FOR DUMMIES APPLICATIONS OF NEXT GENERATION SEQUENCING IN HUMAN GENETICS ANNEMIEKE VERKERK j.verkerk@erasmusmc.nl www.woesvanhaaften.com/weblog/joost-van-haaften/
HUMAN GENOMES _in the Human Genome Project DNA full genome was sequenced from a mix of male and female individuals _this sequence served as the human genome reference sequence and does not represent any one’s personal genome _it took 13 years and cost $ 3 – 5 billion _with the new techniques of NGS, sequencing became much faster and people wanted to sequence personal genomes _lot of data – stored in databases – visualised in genome browsers
CONTINUED RIVALRY _the first personal genome sequenced with the new method was the DNA of James Watson was generated within 4 months cost $ 1.5 million published in Nature in April 2008 _but... Craig Venter’s personal genome was sequenced first, although with the previous-generation machines cost $ 100 million published in Plos Biology Sept 2007 despite all sequencing advances, we are still learning how to read all this data.....
PRIVACY ISSUES and this also started the debate about privacy issues on human genome sequences and the “right” to know or not to know James Watson did not want to know the status of his ApoE gene, because sequence within it could indicate his risk for Alzheimer’s disease – not present in genomebrowser
chr 19 rs429358 T T C C locus: APOE gene rs7412 T T C C E2 E2 E4 E4 increased risk for Alzheimer
STORE DATA IN GENOME BROWSWERS _sequencing produces a lot of data _to share it with others 3 genome browsers were developed to store it in – and make it visible NCBI_map viewer Ensembl UCSC
USE OF NGS DATA _find (rare) variants that can be used in Genome Wide Association studies _solving Mendelian disorders that were until now impossible to solve especially rare diseases – diseases in small families _solving de novo diseases (newly arisen in the child, parents are healthy) _ somatic mutations in cancer _learn and understand the sequence/structure of DNA
solving Mendelian disorders linkage analysis / positional cloning disease location on function genome/chromosome gene
LINKAGE ANALYSIS _you look for the chromosomal region that is shared in all patients (in a family) and segregates with the disease _you follow polymorphic markers in the pedigree, can be CA repeats or SNPs _you need large families with (10) affected individuals to get a small region were you can search for the mutated gene _you need a lot of meioses to end up with a small region _was very time consuming with the old techniques (using CA repeats) _faster with SNPs as markers (SNP arrays) _small families with rare diseases can not be solved in this way
LINKAGE ANALYSIS ca 10 ca 15 ca 14 ca 18 ca 16 ca 10 ca 16 ca 12 ca 15 ca 15 ca 15 ca 10 ca 14 ca 18 ca 10 ca 14 ca 12 ca 10 ca 16 ca 16 ca 16 ca 16 This chromosome segregates with the disease in this family and ca 10 contains a gene mutation ca 15 ca 10 ca 15 ca 15 ca 14 ca 14 = principle of a linkage study ca 16 ca 16 ca 16 ca 16
MEIOSIS GERM CELL DIVISION during production of germ cells chromosomes in “metaphase” exchange pieces of DNA = recombination or crossing over to create genetic diversity germ cells end up with single chromatids from recombined chromosomes
recombination and segregation X how it works in a family A X X X A A A A A A A A A A = affected the region still needs to be sequenced
NEW WAY OF GENE / MUTATION FINDING = EXOME SEQUENCING _you can determine the complete DNA sequence of all exons in one individual in one go _only have to look in exons, because most Mendelian diseases have a mutation in the coding sequence _costs keep going down, more extensive platforms are developed = affordable _especially suited for small families and sporadic cases (without family history) de novo (dominant) cases ( were impossible to solve, mapping with linkage not possible ) continuing stuck positional cloning projects with large linkage areas
FIRST PAPER ON EXOME SEQUENCING _published in Nature Genetics in 2010 by Ng et al. _4 patients from 3 families with Miller syndrome (facial and limb abnormalities) _very rare disease, only 30 cases described in literature _linkage analysis was not an option, families are very small
3 families family 1: 2 affected individuals family 2: 1 affected individual family 3: 1 affected individual assumption mode of inheritance: autosomal recessive
Exome Sequencing: 40-fold coverage of 37 Mb = all coding sequence with data-analysis interest in: _variants that cause aminoacid changes (NS = non synonmyous changes) _splice site mutations (SS = splice site changes) _short insertions and deletions in coding regions _new variants not present in dbSNP database _predicted to be damaging for data-analysis: filtered out all variants that were not relevant _mainly synonymous changes: change on DNA level = aminoacid does not change _intronic changes _known SNPs
first only family 1 was analyzed for the recessive model it was required for each sibling to have variants in the same gene: 2800 genes _filtered out all known SNPs present in dbSNP129 and 8 HapMap samples: problem reduced to 9 genes _compared these 9 genes to variants in the two other unrelated patients from family 2 and 3: _problem reduced to 1 gene: DHODH = dihydro-orotate dehydrogenase BUT: it was not a classical recessive mutation, but a compound recessive mutation: the parents carried different mutations in the same gene!
Unexpected bonus: _also recessive mutations were found in another gene, DNAH5, but only in the 2 siblings from family 1, not in patients from family 2 and 3 _patients from family 1 had additional clinical problems with pulmonary infections (not part of Miller syndrome) and the second gene DNAH5 was responsible for that (= dynein, axonemal, heavy chain 5).
OMIM data 6000 7 years: 2270 start of exome seq 5000 4000 first working draft of human genome 10 years: 1800 3000 2000 12 years: 1000 1000 0 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 phenotype with known gene function phenotype without known gene function
PROBLEMS… IN TECHNIQUE I it all seems very easy but….. _still not all genes or exons are recognized, and are therefore not present in the sequence capture kit _sometimes even known genes are not present in the capture kit, but this is improving _ not all targeted exons are well captured _not all targeted sequence can be aligned back to the ref genome _not all aligned sequences can be accurately called
PROBLEMS… IN TECHNIQUE II _regulating DNA areas are not (yet) present in the sequence capture kit (whole genome seq better?) _copy number variation is (still) hard to recognize _mitochondrial DNA is sometimes present, sometimes not mitochondrial probes are not present in the capture kit mitochondrial sequence is a ‘’by - product” _micro RNAs are not all recognized and are not in the capture kit A WHOLE EXOME IS NOT THAT WHOLE
PROBLEM OF LOCUS HETEROGENEITY = the disease in your group of patients is caused by more than 1 gene for exome sequencing this is a problem: _the patient population is not homogeneous _you are searching for variants in more than 1 gene _you need enough samples to find the gene mutation _especially important for sporadic cases
ADVANTAGE OF ALLELIC HETEROGENEITY = the disease in your group of patients is caused by more than 1 variant in the same gene for exome sequencing this is an advantage: _the variants are spread over the same gene in different patients the risk of not finding the variant is spread over different sequencing targets and increases the chance of finding the causing gene
PROBLEMS … IN ANALYZING How to find the right variant (mutation) …… _after exome sequencing you are left with around 25.000 variants per person _filtering steps are needed to filter out normal variants many databases with DNA variants -- dbSNP -- 1000 genomes project -- Washington Exome Sequencing Project (ESP6500) -- ExAC (from 60,706 individuals combined from 17 different databases) can be used to filter out common variants non-coding variants can be filtered out new variants, but synonymous ones (no change in amino-acid) can be filtered out after that you hope you do not have that many left to select from coding - stop - nonsynonymous - splice site variants - some indels
MAIN ISSUE TO FIND THE RIGHT VARIANT _you will end up with a list of variants that are the potential cause of the disease you are studying _there will be false positives (false negatives?) _every healthy person has 100 LOF variants in their genome, including non-penetrant disease mutations which one is it?? _you will need to investigate functional issues are the variants conserved in other species do the variants have an effect on the protein _a region – even though large – can help restrict the number of variants
Recommend
More recommend