What is Bioinformatics? DNA mrab funktsiooni (?) DNA VALK - PDF document

✡ ☛ ✎ ✠ ✌ ✄ ✁ ☞ ✠ Eesmärgid • Tutvustada bioloogiliste andmete analüüsi Teadmiste kaevandamine (kaevandamise) problemaatikat bioloogilistest andmetest • Kas bioinformaatikal on midagi pakkuda Jaak Vilo teoreetilisele arvutiteadusele? Teoreetilise arvutiteaduse talvekool, Arula, 3. veebruar 2003 • Näiteid bio-andmete kaevandamisest What is Bioinformatics? DNA määrab funktsiooni (?) DNA VALK STRUKTUUR • Bioinformatics is the use of SwissProt/TrEMBL PDB/Molecular Structure Database GenBank / EMBL Bank information technology to store and analyze genetic information • Bioinformatic researchers develop and apply computing tools to extract the secrets of the life and death of organisms 4 Nucleotides 20+ Amino Acids from the genetic blueprints and molecular (3nt 1 AA) Funktsioon? structure stored in digital collections A Simple Gene http://www.scripps.edu/pub/goodsell / �✂✁ ✏✂✁ David S. Goodsell Upstream/ Downstream promoter ATCGAAAT ☎✂✆✞✝ ✟✂✠ ✝✂✍ DNA: TAGCTTTA 1

Uurimisküsimusi Kus vaja arvutiteadust? • Milline on iga geeni ja geeniprodukti • Andmete kogumise protsessi tugi (valk, RNA) funktsioon? • Toorandmetest üldistatud andmete • Kuidas täpselt toimuvad bioloogilised saamine protsessid? • Andmete haldamine • Transkriptsioon ja translatsioon ja nende reguleerimine • Andmete analüüs • Millised valgud interakteeruvad ja miks • Teadmiste esitamine • Metaboolsed rajad ja võrgud • Modelleerimine ja simulatsioon • Signaali ülekanne rakus ja organismis DNA (shotgun) sequencing Excerpts from curricula descriptions • New experimental techniques generating mass data are being developed. Cut randomly, • Every new biological research idea requires a sequence specifically tailored, algorithmic approach, raising pieces an abundance of challenging questions in 5-10x coverage algorithm design, analysis and implementation. • The demands of biosequence analysis require advances in many classical fields of algorithmics. Our recent work concentrates on suffix trees and Assemble related index structures, on efficient pattern matching in strings and trees, and on a new, algebraic style of dynamic programming. shortest supersequence Data Mining Bio-andmete kaevandamine Andmete kaevandamine 1. Millised bio-andmed on olemas ja • Data Mining ja Knowledge Discovery from kuidas neid võiks kasutada Databases (KDD) on uued arvutiteaduse 2. Andmete kogumine, puhastamine, alad, palju ühist statistikaga ettevalmistamine ja ühendamine • Eesmärgiks on väga suurte andmekogude analüüs 3. Analüüs arvutil (peamine algoritm) • Otsi (lihtsaid) reegleid ja seoseid mis 4. Tulemuste esitamine, reeglite kehtivad (vähemalt teatud osas) andmetes tuvastamine, visualiseerimine • Statistika, masinõppimine, andmebaasiteooria, + domain knowledge 5. Tulemuste tõlgendamine 2

TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCC A Challenge Problem TTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCA TCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTC (P. Pevzner, 2000) TTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAAT GCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAA • Insert into every sequence a 15-mer GTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTT where 4 positions have been randomly GGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCT changed TCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTT CTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTG • Challenge: discover what was the TGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACT original sequence inserted TTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTAC TTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTA • Why? 4^15 sequences in total, variants GATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGC can differ as much as in 8 positions out TTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCG of 15 AGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGT CTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC Patterns: AT Tekstist mustriga otsimine • Kas ATGCAGA esineb tekstis? • Kas ATGCAGA esineb tekstis ligilähedselt (ATCCAGA, ATGCGA)? • Kas A[TA].C[CG].{3,7} A[TA].C[CG] esineb tekstis? • Kui kaua võtab aega ülaltoodud päringutele vastamine? • Kuidas eeltöödelda ja indekseerida? • Suffiksipuud ja massiivid Pattern Discovery Algoritmid 1. Choose the language (formalism) • Tehniline: loenda kõik etteantud stringides to represent the patterns esinevad mustrid ja nende sagedused 2. Choose the rating for patterns, to tell that one pattern is “better” than other • Bioloogiline: kogu kokku sarnaselt 3. Design an algorithm that finds the ekspresseerunud geenide promootorid ja best patterns from the pattern class, otsi võimalikke transkriptsiooni-faktorite fast. seondumiseks soodsaid mustreid 3

Patterns: AT Patterns: WHAT ([AT][ACT]AT) Bioinformaatika algoritmid Näidis-meetodid • Mitte ainult arvutiressursside küsimus • Geeniekspressiooni analüüs • Transkriptsioonifaktorite seondumiskohtade ennustamine • Viisid kuidas kombineerida andmeid • Valk-valk interaktsioonide analüüs • Kas saadav tulemus on bioloogiliselt relevantne • G-valk retseptorite ja G-valkude seoste analüüs • Kas see edendab meie arusaamist bioloogilistest fenomenidest? • Teadustekstide analüüs From microarray images to gene Analysis of biological samples with microarrays expression data Raw data Intermediate data Final data Array scans Image quantifications Sample s culture 1 mRNA cDNA hybridise Spots Genes culture 2 LASER, scanning Gene DB Spot/Image expression levels quantiations 4

✤ ✦ ✜ ✢ ✙ ✗ ✖ ✗ ✙ ✎ ✗ ✗ ✎ ✘ ✥ ✏ ✙ ✗ ✘ ✚ ✚ ✕ ✙ ✎ ✛ ✎ ✩ ✴ ✴ ✴ ✗ ✜ ✛ ✚ ✙ ✏ ✘ ✕ ✔ ✓ ✒ ✕ Tumor classification Eisen etal, PNAS 98 Spellman etal Mol Biol Cell 98 Golub et al, Science Oct 15th 1999 ALL AML • 38 samples of acute ALL AML myeloic leukemia (AML) and acute lymphoblastic leukemia (ALL) •6817 genes •classificator built based on 50 best correlated genes •tested on 34 new samples, 29 of them predicted accurately Simplified ArrayExpress model �✂✁✄✁ ☎✝✆✟✞✟✠☛✡✟✁ ☞✟✌✍✌ ✎✑✏ ✖✑✗ ✣✑✕ ✘✑✘ ✥✑✧ ★✍✩ ✪✬✫✮✭✬✯✱✰✮✩ ✲✬✳✄✳ MAGE-ML MAGE-ML Expression MIAMExpress Internet Profiler Expression Profiler: EPCLUST Getting data into EP URL http://host/data.txt DATA SELECT/ FOLDER ANALYZE http://host/q.cgi?d=D&t=ratio FILTER EP Internet A “CLUSTER” GeneOntology Pathways ArrayExpress Databases URLMAP SPEXS Other tools 5

✓ ✣ ✝ ✩ ✝ ✟ ✠ ★ ✝ ✧ ✗ ✦ ✥ ✤ ✚ ✫ ✛ ✚ ✙ ✘ ✗ ✖ ✕ ✎ ✍ ✔ ✏ ✳ ✪ ✬ ☞ ✲ ✱ ✴ ✽ ✼ ✫ ✺ ✱ ✲ ✱ ✬ ✲ ✴ ✳ ✰ ✮ ✲ ✳ ✷ ✳ ✲ ✶ ✬ ✳ ✲ ✱ ✱ ✌ ✑ ✒ ✠ ✑ � ✁ ☎ ✆ ✝ ✞ ☎ ✟ ✲ ✆ ✎ ☛ ☞ ✡ ✌ ✍ ✎ ✏ Clustering – it’s “easy” (for humans) Unsupervised vs. Supervised Find groups inherent Find a “classifier” for to data (clustering) known classes Distance measures: Clustering cont… which two profiles are similar to each other? Rank correlation Euclidean , Manhattan etc. 3. 2. 1. Correlation, angle, etc. Time warping Cluster of co-expressed genes, pattern discovery in regulatory regions Model of RNA Polymerase II Transcription Initiation Machinery. The machinery depicted here ✁✄✂ encompasses over 85 polypeptides in ten (sub) complexes : core RNA polymerase II (RNAPII) consists of 12 subunits; TFIIH, 9 subunits; TFIIE, 2 subunits; TFIIF, 3 subunits; TFIIB, 1 subunit, TFIID, 14 subunits; core SRB/mediator, more than 16 subunits; Swi/Snf complex, 11 subunits; Srb10 kinase ✜✢✙ complex, 4 subunits; and ✭✯✮ ✴✄✵ ✭✹✸✄✫ ✬✹✻ SAGA, 13 subunits. Genome Research 1998; F.C.P. Holstege, E.G. Jenning, J.J. Wyrick, Tong Ihn Lee, C.J. Hengartner, M.R. Green, T.R. Golub, E.S. Lander, and R.A. Young Dissecting the Regulatory Circuitry of a Eukaryotic Genome ISMB (Intelligent Systems in Mol. Biol.) 2000 Cell 95: 717-728 (1998) 6

What is Bioinformatics? DNA mrab funktsiooni (?) DNA VALK - PDF document

Eesmrgid Tutvustada bioloogiliste andmete analsi Teadmiste kaevandamine (kaevandamise) problemaatikat bioloogilistest andmetest Kas bioinformaatikal on midagi pakkuda Jaak Vilo

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Endogenous Perturbation Analysis of Cancer Sven Nelander, Wallenberg laboratory /

Opportunity Zone Examples and Update on April 2019 Guidance 55 Community Drive, Suite 401 June

Shakespeare- Healing the Troubled Mind A Dramatic approach to issues of PTSD, suicidal

ssr

-arrestins in GPCR Desensitization How Lisp Will Save the World 15,596 abstracts 15

Webinar Recording: COVID-19 and the CV Service Line: Setting up Telehealth in Your Office - Part

General Track: Telehealth in Physical Therapy From the computer screen to the Clinic (Part I

Behavioral Health Health Information Technology Learning Collaborative We will start the event

Sambuz

Useful Links

Newsletter

Mail Us

What is Bioinformatics? DNA mrab funktsiooni (?) DNA VALK - PDF document

Eesmrgid Tutvustada bioloogiliste andmete analsi Teadmiste kaevandamine (kaevandamise) problemaatikat bioloogilistest andmetest Kas bioinformaatikal on midagi pakkuda Jaak Vilo

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Endogenous Perturbation Analysis of Cancer Sven Nelander, Wallenberg laboratory /

Opportunity Zone Examples and Update on April 2019 Guidance 55 Community Drive, Suite 401 June

Shakespeare- Healing the Troubled Mind A Dramatic approach to issues of PTSD, suicidal

ssr

-arrestins in GPCR Desensitization How Lisp Will Save the World 15,596 abstracts 15

Webinar Recording: COVID-19 and the CV Service Line: Setting up Telehealth in Your Office - Part

General Track: Telehealth in Physical Therapy From the computer screen to the Clinic (Part I

Behavioral Health Health Information Technology Learning Collaborative We will start the event

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt