CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology

Course description • A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. • Prerequisite: – Programming experience – Strong background in algorithms and data structure – Basic understanding of statistics and probability – Appetite to learn some biology • For other information, check course website

Why bioinformatics • The advance of biomedical experimental technology has resulted in a huge amount of data – The human genome is “finished” – Even if it were, that’s only the beginning… • The bottleneck is how to integrate and analyze the data – Noisy – Diverse

Growth of GenBank vs Moore’s law

Genome annotations • The process of identifying the locations of genes and coding regions in a genome to determine what those genes do. • Finding and attaching the structural elements and its related function to each genome locations.

Genome annotations • Gene structure prediction • Identifying elements (introns/exons, coding region, stop codon, start codon) in the genome • Gene function prediction • Attaching biological information to these elements- eg: for which protein exon will code for

Genome annotations Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

What is bioinformatics • National Institutes of Health (NIH): – Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

What is bioinformatics • National Center for Biotechnology Information (NCBI): – the field of science in which biology, computer science, and information technology merge to form a single discipline . The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

Biology Molecular Biology Chemistry Medicine Bioinformatics Mathematics Physics Statistics Computer Science Informatics

Computer Scientists vs Biologists (courtesy Serafim Batzoglou, Stanford)

Biologists vs computer scientists • (almost) Everything is true or false in computer science • (almost) Nothing is ever true or false in Biology

Biologists vs computer scientists • Biologists seek to understand the complicated, messy natural world • Computer scientists strive to build their own clean and organized virtual world

Biologists vs computer scientists • Computer scientists are obsessed with being the first to invent or prove something • Biologists are obsessed with being the first to discover something

Some examples of central role of CS in bioinformatics

1. Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides ~500 nucleotides Genome sequencing is figuring out the order of DNA nucleotides, or bases, in a genome — the order of As, Cs, Gs, and Ts that make up an organism's DNA. The human genome is made up of over 3 billion of these genetic letters.

1. Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome

2. Gene Finding Where are the genes? In humans: ~22,000 genes ~1.5% of human DNA

2. Gene Finding Even in a familiar language it is difficult to pick out the meaning of the passage: The quick brown fox jumped over the lazy dog. The dog lay quietly dreaming of dinner. And the genome is "written" in a far less familiar language, multiplying the difficulties involved in reading it.

2. Gene Finding Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 5’ 3’ Splice sites Stop codon Start codon TAG/TGA/TAA ATG Hidden Markov Models (Well studied for many years in speech recognition)

3. Protein Folding • The amino-acid sequence of a protein determines the 3D fold • The 3D fold of a protein determines its function • Can we predict 3D fold of a protein given its amino-acid sequence? – Holy grail of computational biology — 40 years old problem – Molecular dynamics, computational geometry, machine learning

4. Sequence Comparison — Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC - AG G CTATCAC CT GACC T C CA GG C CGA -- TGCCC --- | | | | | | | | | | | | | x | | | | | | | | | | | T AG - CTATCAC -- GACC G C -- GG T CGA TT TGCCC GAC Sequence Alignment query Introduced ~1970 BLAST BLAST: 1990, one of the most cited papers in history Still very active area of research Efficient string matching algorithms DB Fast database index techniques

Lipman & Pearson, 1985 …, comparison of a 200-amino-acid …, comparison of a 200-amino-acid sequence to the 500,000 residues in the sequence to the 500,000 residues in the National Biomedical Research Foundation National Biomedical Research Foundation library would take less than 2 minutes on library would take less than 2 minutes on a minicomputer, and less than 10 minutes a minicomputer, and less than 10 minutes on a microcomputer (IBM PC). on a microcomputer (IBM PC). Database size today (2007): 10 12 (increased by 2 million folds). BLAST search: 1.5 minutes

5. Microarray data analysis Example: Clinical prediction of Leukemia type • 2 types of leukemia – Acute lymphoid (ALL) – Acute myeloid (AML) • Different treatments & outcomes • Predict type before treatment? Bone marrow samples: ALL vs AML Measure amount of each gene

Some goals of biology for the next 50 years • List all molecular parts that build an organism – Genes, proteins, other functional parts • Understand the function of each part • Understand how parts interact physically and functionally • Study how function has evolved across all species • Find genetic defects that cause diseases • Design drugs rationally • Sequence the genome of every human, use it for personalized medicine • Bioinformatics is an essential component for all the goals above

A short introduction to molecular biology

Life • Two main categories: – Prokaryotes (e.g. bacteria) • Unicellular • No nucleus – Eukaryotes (e.g. fungi, plant, animal) • Unicellular or multicellular • Has nucleus

Prokaryote vs Eukaryote • Eukaryote has many membrane-bounded compartment inside the cell – Different biological processes occur at different cellular location

Organism, Organ, Cell Organism

Chemical contents of cell • Water • Macromolecules (polymers) - “strings” made by linking monomers from a specified set (alphabet) – Protein – DNA – RNA –… • Small molecules – Sugar – Ions (Na + , Ka + , Ca 2+ , Cl - ,…) – Hormone –…

DNA • DNA: forms the genetic material of all living organisms – Can be replicated and passed to descendents – Contains information to produce proteins • To computer scientists, DNA is a string made from alphabet {A, C, G, T} – e.g. ACAGAACGTAGTGCCGTGAGCG • Each letter is a nucleotide • Length varies from hundreds to billions

RNA • Historically thought to be mainly an information carrier – DNA => RNA => Protein – Very important new roles have been found recently • To computer scientists, RNA is a string made from alphabet {A, C, G, U} – e.g. ACAGAACGUAGUGCCGUGAGCG • Each letter is a nucleotide • Length varies from tens to thousands

Protein • Protein: the actual “worker” for almost all processes in the cell – Enzymes: speed up reactions – Signaling: information transduction – Structural support – Production of other macromolecules – Transport • To computer scientists, protein is a string made from an alphabet of 20 letters – E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP • Each letter is called an amino acid • Length varies from tens to thousands

DNA/RNA zoom-in • Commonly referred to as Nucleic Acid • DNA: Deoxyribonucleic acid • RNA: Ribonucleic acid • Found mainly in the nucleus of a cell (hence “nucleic”) • Contain phosphoric acid as a component (hence “acid”) • They are made up of a string of nucleotides

Nucleotides • A nucleotide has 3 components – Sugar ring (ribose in RNA, deoxyribose in DNA) – Phosphoric acid – Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T) in DNA and Uracil (U) in RNA

Units of RNA: ribo-nucleotide • A ribonucleotide has 3 components – Sugar - Ribose – Phosphate group – Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Uracil (U)

Units of DNA: deoxy-ribo-nucleotide • A deoxyribonucleotide has 3 components – Sugar – Deoxy-ribose – Phosphate group – Nitrogen base • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T)

Polymerization: Nucleotides => nucleic acids Nitrogen Base Phosphate Sugar Nitrogen Base Phosphate Sugar Nitrogen Base Phosphate Sugar

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course description A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. Prerequisite: Programming

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation:

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Bioinformatics: Network Analysis Molecular Cell Biology: A Brief Review COMP 572 (BIOS 572 / BIOE

Genes Multiple Choice Review www.njctl.org Slide 3 / 46 1 Deoxyribonucleic acid nucleotides

FOUR-BODY NON-ADDITIVITY CONTRIBUTION TO B-DNA: A QUANTUM MONTE CARLO STUDY BAKASA NAMAROME

Chapter Nine Nucleic Acids: How Structure Conveys Information Paul D. Adams University of

A Crash Course in Genetics General Overview: DNA Structure RNA DNA Replication

Introduction to Bioinformatics Esa Pitknen esa.pitkanen@cs.helsinki.fi Autumn 2008, I period

CSI5126 . Algorithms in bioinformatics Overview of the course content and expectations Marcel

Evolutionary Systems Companion slides for the book Bio-Inspired Artificial Intelligence: Theories,

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course description A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. Prerequisite: Programming

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation:

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Bioinformatics: Network Analysis Molecular Cell Biology: A Brief Review COMP 572 (BIOS 572 / BIOE

Genes Multiple Choice Review www.njctl.org Slide 3 / 46 1 Deoxyribonucleic acid nucleotides

FOUR-BODY NON-ADDITIVITY CONTRIBUTION TO B-DNA: A QUANTUM MONTE CARLO STUDY BAKASA NAMAROME

Chapter Nine Nucleic Acids: How Structure Conveys Information Paul D. Adams University of

A Crash Course in Genetics General Overview: DNA Structure RNA DNA Replication

Introduction to Bioinformatics Esa Pitknen esa.pitkanen@cs.helsinki.fi Autumn 2008, I period

CSI5126 . Algorithms in bioinformatics Overview of the course content and expectations Marcel

Evolutionary Systems Companion slides for the book Bio-Inspired Artificial Intelligence: Theories,

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt