Let us know who you are . Background survey 1. Major 2. Primary - - PowerPoint PPT Presentation

let us know who you are
SMART_READER_LITE
LIVE PREVIEW

Let us know who you are . Background survey 1. Major 2. Primary - - PowerPoint PPT Presentation

Let us know who you are . Background survey 1. Major 2. Primary background (biology, computation, other) 3. Programming experience (how much, what language) Registered/not-registered/waiting-list Genome 373: Genomic Informatics


slide-1
SLIDE 1

Let us know who you are ….

  • Background survey
  • 1. Major
  • 2. Primary background (biology, computation,
  • ther)
  • 3. Programming experience (how much, what

language)

  • Registered/not-registered/waiting-list
slide-2
SLIDE 2

Genome 373: Genomic Informatics

Elhanan Borenstein

slide-3
SLIDE 3
  • This course is intended to introduce students to the

breadth of problems and methods in computational analysis of genomes and biological systems, arguably the single most important new(?) area in biological research.

  • Specific subjects will include:
  • Sequence alignment
  • Phylogenetic tree reconstruction
  • Clustering gene expression, annotation and enrichment
  • Network analysis
  • Gene finding
  • Machine learning
  • DNA sequencing and assembly

Genome 373: “Mission Statement”

slide-4
SLIDE 4

Today

  • Course logistics
  • Why Bioinformatics
  • Introduction to sequence alignment
slide-5
SLIDE 5

Instructors

  • Elhanan Borenstein: Weeks 1-5
  • Doug Fowler: Weeks 6-10
  • Office hours: Monday 2:20-3:00
slide-6
SLIDE 6

Who am I?

  • Faculty at Genome Sciences & Computer Science
  • Training: CS, physics, hi-tech, biology
  • Interests: Metagenomics; Human Microbiome;

Complex networks; Computational systems biology

Emphasis

  • Informatics (trom sequence to systems)
  • Algorithms & methods !
  • Concepts !

http://elbo.gs.washington.edu

slide-7
SLIDE 7

Quiz Section

  • Hannah Pliner (TA) will review additional

topics including programming and problem solving skills.

  • Material covered in section is required, and

will be on the exams.

slide-8
SLIDE 8

Webpage

  • Web site:
  • Page has links to

– Lecture notes (but please keep the class interactive) – Handouts – Many useful resources on:

  • Bioinformatics
  • Python

http://elbo.gs.washington.edu/courses/GS_373_18_sp/

slide-9
SLIDE 9

Programming

  • Historically, this course required prior programming

experience.

  • But … many students that are interested in

understanding computational methods do not have any programming experience

  • But … understanding how programs work and how

code is written is extremely beneficial for understanding algorithms (including bioinformatic algorithms)

  • But … learning how to program takes lots of time
slide-10
SLIDE 10

Final Comments about Programming

  • Some programming will be taught in quiz section
  • This is not a programming course !!!
  • Really, it’s not !!!
  • But programming is an important component of the

course for promoting your understanding of computational methods

  • If you do not have any programming experience, that’s

totally ok, but … you will need to catch up.

slide-11
SLIDE 11

Why Python?

  • Python is

– easy to learn – fast enough – object-oriented – widely used – fairly portable

  • C is much faster but

much harder to learn and use.

  • Java is somewhat faster

but harder to learn and use.

  • Perl is a little slower

and a little harder to learn.

slide-12
SLIDE 12

Grading

  • 50% homework
  • 20% midterm exam (in class)
  • 30% final exam
  • Final exam is cumulative
slide-13
SLIDE 13

Homework

  • Posted through Canvas Friday and due the

following Friday.

  • Homework is a mix of (mostly) bioinformatics

problems and (some) programming.

  • Homework assignments are to be submitted

through Canvas

  • Programming assignments should be

implemented in Python.

  • More on home assignment submission in the

quiz section.

slide-14
SLIDE 14

Textbooks (??)

slide-15
SLIDE 15

Why Bioinformatics?

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

tgcaagcatgcacatgtaccaggagaaaatgaagacaattgtggaaacttttagacttttcatcaactttctagtgtcacttttttgccgctttcct atctgatagttgcgaagactccgaagaaaatgagaatggtgaaggctagcatgctgatgcttcatttctctggagcaattgtggatttctatctaag cttcatttcgatcccagtgctcactttgcccgtttgctcaggtatccattgggattctcgttggtgttaggaattccaacgtctgttcaagtttata tcggagtttcatgtatgggcggtgggtcgctctgttgcaggaggtcttgaatttcttttttgcagtaatcggtgtaactattcttatatttttcgaa aatcgttactttcaactaatcaatggatcttctggtggtagaagttggaagcgaaaactatatgttttgtgtaattacgcgttctctgtaactttta tagctccagcgtttttagacatttttagtgaagaacaaggaagagcgtgcacgtttgaagtaagttaggcaaaccaaactcgctagtgtgatgaaat tttccagaaaattccgagtatccctatcgacgtgccttctcgctcaggatattttgtcctattaattgataacccagtctacagcatttgcgtaagc ctcttggtaattaaagtgtgcccacaaattggtatagtcgttttgttcatattcccttatattgttcaaacgaaatcacattctcgagccacacttc gtttacttcttcacttttttatcgcgatgtgtatccagctgtctattccatttttggtcatcttcttgccggctgcttttatagtgtacgcaattca atatgactattataatcaaggtatgaatattaggccttccacgaaggcgctattctcgcccgcccgtaccacaccaacgctcttctcagttgcacgc ggctatagtagcgcgagggcccgcgtagcgtcggccgccttcatagaaggtctaatgaatatatagtattaagtataatttaaataaagtttcagca gcaaacaacttggcgatggcaacaatggcattccatggggtatgtactacactgaccatgatcatcgtgcatacaccgtatcgtaacgctactttga gcattttacatctgaaatcggaaaaatcggcaaaaacagtgactgattcgaagattgtgtggaaaagtaacaagggagtacagatgacataaactat gcccattgttaccctatattttatttttctctatggtgacaactttatcttaagaaaaacacgcatataaatcaagcagttcctggtcacaggacgt ttacttccacctgtttctaatttcttataaaaccctatatctttcaagttttttccacaagactctgccactctgacacttatgtgctcgactagcc tcagcttctttgcttccgagcaaacatatataaaacttctacatactcttaccatacttgaactttccactcactcttttggagcatacatcatcat tacaaaaacaccgaaaaagttggaatccgtgaaggccagcatgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggttagct atgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtaggtttct gttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtgtaaaa gttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaagtcaaa caaaatgagaaaattgtatcggttactgtttgtcacagctaattatgtttatgctacattgtaccctgctcccatatactttttgcttcccgaccaa gaatatggaagaattttatcgaaaagtgtacgtcttaaaaagtttgaaacatatacaatgaaatgtcttacttttaaagtttgcgtttcagaaaaat ccgtgtattccgaacgaatatttaaaccatcctaatttctttttgcttgatctcgatggaaagtatacttcaatttgtatcctgcttatgttgagtt ctctggtctctcaaatgttttggcaaattggactgattttccgtcagatgctcaaaaatccgtccgtttctcaaaatacgcaccgactacaatacca gtttttaattgcaatgagcttgcaaggcaccattccaatgattatcattgtttttccagcttttttctatgttgtctcaattatgttaaattatcat aatcaaggtattgtatctattcggaacaagacattaaacataattccaacttttcaggtgcaaataacttatcgtttcttatcatttccatgcatgg agttctatcaacgttgacaatgctcatggcacacagaccgtatagacaatcgattgtcaaaatgttgaatctgaatttcaataaggcaggtggtggt gttcaacgtatttggacgctttccagaagaaataattaatgatgaccttggaaaaggctaatcttcacaacaatcaaatcaaataatcataaaagtt tttattgaagaaaaataaactatctgtgcacagaaatccaatgaattgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggt tagctatgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtagg tttctgttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtg taaaagttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaag

Find the binding sequence: caattatgttaaa

slide-19
SLIDE 19

tgcaagcatgcacatgtaccaggagaaaatgaagacaattgtggaaacttttagacttttcatcaactttctagtgtcacttttttgccgctttcct atctgatagttgcgaagactccgaagaaaatgagaatggtgaaggctagcatgctgatgcttcatttctctggagcaattgtggatttctatctaag cttcatttcgatcccagtgctcactttgcccgtttgctcaggtatccattgggattctcgttggtgttaggaattccaacgtctgttcaagtttata tcggagtttcatgtatgggcggtgggtcgctctgttgcaggaggtcttgaatttcttttttgcagtaatcggtgtaactattcttatatttttcgaa aatcgttactttcaactaatcaatggatcttctggtggtagaagttggaagcgaaaactatatgttttgtgtaattacgcgttctctgtaactttta tagctccagcgtttttagacatttttagtgaagaacaaggaagagcgtgcacgtttgaagtaagttaggcaaaccaaactcgctagtgtgatgaaat tttccagaaaattccgagtatccctatcgacgtgccttctcgctcaggatattttgtcctattaattgataacccagtctacagcatttgcgtaagc ctcttggtaattaaagtgtgcccacaaattggtatagtcgttttgttcatattcccttatattgttcaaacgaaatcacattctcgagccacacttc gtttacttcttcacttttttatcgcgatgtgtatccagctgtctattccatttttggtcatcttcttgccggctgcttttatagtgtacgcaattca atatgactattataatcaaggtatgaatattaggccttccacgaaggcgctattctcgcccgcccgtaccacaccaacgctcttctcagttgcacgc ggctatagtagcgcgagggcccgcgtagcgtcggccgccttcatagaaggtctaatgaatatatagtattaagtataatttaaataaagtttcagca gcaaacaacttggcgatggcaacaatggcattccatggggtatgtactacactgaccatgatcatcgtgcatacaccgtatcgtaacgctactttga gcattttacatctgaaatcggaaaaatcggcaaaaacagtgactgattcgaagattgtgtggaaaagtaacaagggagtacagatgacataaactat gcccattgttaccctatattttatttttctctatggtgacaactttatcttaagaaaaacacgcatataaatcaagcagttcctggtcacaggacgt ttacttccacctgtttctaatttcttataaaaccctatatctttcaagttttttccacaagactctgccactctgacacttatgtgctcgactagcc tcagcttctttgcttccgagcaaacatatataaaacttctacatactcttaccatacttgaactttccactcactcttttggagcatacatcatcat tacaaaaacaccgaaaaagttggaatccgtgaaggccagcatgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggttagct atgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtaggtttct gttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtgtaaaa gttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaagtcaaa caaaatgagaaaattgtatcggttactgtttgtcacagctaattatgtttatgctacattgtaccctgctcccatatactttttgcttcccgaccaa gaatatggaagaattttatcgaaaagtgtacgtcttaaaaagtttgaaacatatacaatgaaatgtcttacttttaaagtttgcgtttcagaaaaat ccgtgtattccgaacgaatatttaaaccatcctaatttctttttgcttgatctcgatggaaagtatacttcaatttgtatcctgcttatgttgagtt ctctggtctctcaaatgttttggcaaattggactgattttccgtcagatgctcaaaaatccgtccgtttctcaaaatacgcaccgactacaatacca gtttttaattgcaatgagcttgcaaggcaccattccaatgattatcattgtttttccagcttttttctatgttgtctcaattatgttaaattatcat aatcaaggtattgtatctattcggaacaagacattaaacataattccaacttttcaggtgcaaataacttatcgtttcttatcatttccatgcatgg agttctatcaacgttgacaatgctcatggcacacagaccgtatagacaatcgattgtcaaaatgttgaatctgaatttcaataaggcaggtggtggt gttcaacgtatttggacgctttccagaagaaataattaatgatgaccttggaaaaggctaatcttcacaacaatcaaatcaaataatcataaaagtt tttattgaagaaaaataaactatctgtgcacagaaatccaatgaattgctctatctacaatttgttggagcatttgtcgatgtctatttcagttggt tagctatgccgattctagtactacctttatgtgcaggacatgcgattggcttactttcattttttggggttccaagctcgttgcaagtttatgtagg tttctgttcactagcaggttggttcttaagaatgatggagagcgtcacatgtattgtgttgtacagatacaatttgaaagcaatccaatacagcgtg taaaagttttgcaattataaacatcattgcagttatggttatgacagtagtgatctttctggaagatcgtcgatatcggttggtgaacggtcaaaag

Find the binding sequence: caattatgttaaa

slide-20
SLIDE 20

Well, computers would definitely help … but why bioinformatics?

slide-21
SLIDE 21

Moore’s law

Computer processing power doubles every ~2 years.

dotted line - 2 year doubling

slide-22
SLIDE 22

Sequencing cost decreasing much faster than computing cost

>2-fold drop per year ? - changing so fast hard to be specific

slide-23
SLIDE 23

Sequencing data acquisition is constantly accelerating

slide-24
SLIDE 24

~3100 bases (3.1Kb)

  • Viruses

~3-1200 Kb

  • Bacteria

~1-5 Mb

  • Archaea

~1-5 Mb

  • Fungi

~10-50 Mb

  • Animals

~100-5,000 Mb

  • Plants

~100-10,000 Mb

slide-25
SLIDE 25

~3100 bases (3.1Kb)

  • Viruses

~3-1200 Kb

  • Bacteria

~1-5 Mb

  • Archaea

~1-5 Mb

  • Fungi

~10-50 Mb

  • Animals

~100-5,000 Mb

  • Plants

~100-10,000 Mb

TOTAL NUMBER OF SPECIES NUMBER OF SPECIES IDENTIFIED NUMBER OF SPECIES WITH SEQUENCED BACTERIA, ARCHAEA 100,000 to 10 million 12,000 (460 cultured Archaea) 17,420 bacteria, 362 Archaea FUNGI 1.5 million 100,000 356 INSECTS 10 million 1 million 98 PLANTS 435,000 (land plants and green algae) 300,000 150 TERRESTRIAL VERTEBRATES, FISH 80,500 (5,500 mammalian) 62,345 (5,487 mammalian) 235 (80 mammalian) MARINE INVERTEBRATES 6.5 million 1.3 million 60 OTHER INVERTEBRATES 1 million nematode, several thousandDrosophila 23,000 nematode, 1,300 Drosophila 17 nematode, 21 Drosophila

The Scientist, April 2014

slide-26
SLIDE 26

“Environmental” Genomics

slide-27
SLIDE 27

A computational bottleneck

slide-28
SLIDE 28

… allowing for one mutation and one insertion Find the binding sequence: caattatgttaaa

slide-29
SLIDE 29

Find the binding sequence: caattatgttaaa … allowing for one mutation and one insertion

caattatgtta-aa caatt-atgttaaa catttatgttaaa cagttatgttaa-a caattatgt-taaa caattatgttaaa caaatatgttaaa ca-attatggtaaa caattatattaaa cagttat-gttaaa caattatgttaga cagttatgttaaa caattatgttaaa c-aattatgttata caat-tatgttaaa caattatgttaat gaattatgttaaa

slide-30
SLIDE 30

How well can the string GAATTCAGTTA match the string GGATCGA? (what is the best alignment between the two strings?)

G – A A T T C A G T T A | | | | | | G G – A – T C – G - - A

slide-31
SLIDE 31

Informatics Challenges: Examples

  • Sequence comparison:

– Find the best match (alignment) of a given sequence in a large dataset of sequences – Find the best alignment of two sequences – Find the best alignment of multiple sequences

  • Motif and gene finding
  • Relationship between sequences

– Phylogeny

  • Clustering and classification
  • Many many many more …
slide-32
SLIDE 32

Sequence Comparison

slide-33
SLIDE 33

Informatics Challenges: Examples

  • Sequence comparison:

– Find the best match (alignment) of a given sequence in a large dataset of sequences – Find the best alignment of two sequences – Find the best alignment of multiple sequences

  • Motif and gene finding
  • Relationship between sequences

– Phylogeny

  • Clustering and classification
  • Many many many more …
slide-34
SLIDE 34

One of many commonly used tools that depend

  • n sequence alignment.
slide-35
SLIDE 35

Motivation

  • Why compare/align two protein or DNA

sequences?

slide-36
SLIDE 36

Motivation

  • Why compare/align two protein or DNA

sequences?

– Determine whether they are descended from a common ancestor (homologous). – Infer a common function. – Locate functional elements (motifs or domains). – Infer protein or RNA structure, if the structure of

  • ne of the sequences is known.

– Analyze sequence evolution

slide-37
SLIDE 37

Sequence Alignment

slide-38
SLIDE 38

Mission: Find the best alignment between two sequences.

slide-39
SLIDE 39

Sequence Alignment

GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC

  • GAAT-C

C-A-TAC GA-ATC CATA-C

(some of a very large number of possibilities)

  • Find the best alignment of GAATC and CATAC.
slide-40
SLIDE 40

Mission: Find the best alignment between two sequences.

This is an optimization problem! What do we need to solve this problem?

slide-41
SLIDE 41

Mission: Find the best alignment between two sequences.

A method for scoring alignments A “search” algorithm for finding the alignment with the best score

  • Substitution matrix
  • Gap penalties
  • Dynamic programming
slide-42
SLIDE 42