The 1000 genomes project The 1000 genomes project Genetic variation - - PowerPoint PPT Presentation

the 1000 genomes project
SMART_READER_LITE
LIVE PREVIEW

The 1000 genomes project The 1000 genomes project Genetic variation - - PowerPoint PPT Presentation

The 1000 genomes project The 1000 genomes project Genetic variation > 1% 1000 2500 individuals China, Germany, the UK, the USA 28 populations from Europe, East Asia, West Africa, America, South Asia The 1000 genomes project


slide-1
SLIDE 1

The 1000 genomes project

slide-2
SLIDE 2

The 1000 genomes project

  • Genetic variation > 1%
  • 1000 → 2500 individuals
  • China, Germany, the UK, the USA
  • 28 populations from Europe, East Asia, West

Africa, America, South Asia

slide-3
SLIDE 3

The 1000 genomes project

slide-4
SLIDE 4

The 1000 genomes project

Pilot Purpose Coverage Strategy Status 1 - low coverage 2-4X 2 - trios 20-60X 3 - gene regions 50X Assess strategy of sharing data across samples Whole-genome sequencing of 180 samples Sequencing completed October 2008 Assess coverage and platforms and centers Whole-genome sequencing of 2 mother- father-adult child trios Sequencing completed October 2008 Assess methods for gene-region- capture 1000 gene regions in 900 samples Sequencing completed June 2009

slide-5
SLIDE 5

The 1001 Genomes Project

Arabidopsis thaliana

slide-6
SLIDE 6

The 1001 Genomes Project

  • First plant with a known genome sequence
  • 125 – 150 Mb, 5 chromosomes, 30000 genes
  • Self-fertilizing
  • Big genetic and phenotypic diversity
  • Few known alleles responsible for phenotypic

variations

slide-7
SLIDE 7

The 1001 Genomes Project

  • 10x10x10+1 samples
  • The seeds are

available in Arabidospis stock centers

  • Includes

morphological analysis

slide-8
SLIDE 8

SHORE

  • Mapping and analysis pipeline
  • Short DNA sequences
  • Mapping to a reference sequence
  • Weighted and gapped alignments
  • SHOREmap
slide-9
SLIDE 9

Sequencing Arabidopsis thaliana

  • Two naturally inbred accessions (Bur-0, Tsu-1)
  • Reference genome sequence (Col-0)
  • 120 – 173 million SBS reads
  • Aligned to Col-0 (4 MM, 3 bp indels)
  • Minimum read support for base calls
slide-10
SLIDE 10

Identifying polymorphic regions

  • 4.3 Mb non-repetitive or moderately repetitive

regions not covered

  • GC poor regions
  • 8 non.rep. or mod.rep. positions
  • Col-0: 28kb
  • Bur-0: 3.25 Mb, Tsu-1: 3.13 Mb
slide-11
SLIDE 11

De novo assembly of dissimilar sequences

  • Unmapped reads of high quality
  • Retain high-confidence reads
  • Alignment to the homologous target in the

reference genome

  • Bur-0: 7396 contigs
  • Tsu-1: 3525 contigs
  • Col-0: 20 contigs
slide-12
SLIDE 12

Detection of duplications

  • Higher than expected coverage
  • Several reads support more than one base
  • Segmentation into regions of 250bp
  • Search for “heterozygous” positions
  • Bur-0: 332 kb
  • Tsu-1: 364 kb
  • Col-0: 11 kb