Alignment for Morphology Induction Tzvetan Tchoukalov Christian - PowerPoint PPT Presentation

Multiple Sequence Alignment for Morphology Induction Tzvetan Tchoukalov Christian Monson Brian Roark

Multiple Sequence Alignment in Biology prokaryote16S rRNA Columns 1623-1703 out of 7683 ---T--C---C-G--------------C----T-G---A-TA-G---AT---G-G-----G-CTC-GCG--T-CTG--A ------G---T-G--------------G----T-A---T-AA-G---AT---G-G-----A-CCC-GCG--T-TGG--A ------G---T-G--------------G----T-A---T-AG-G---AT---G-G-----A-CCC-GCG--T-CTG--A ------G--GC-G--------------G----T-G---A-AG-G---AT---G-A-----G-CCC-GCG--G-CCT--A ------C---C-G--------------G----T-A---G-AC-G---AT---G-G-----G-GAT-GCG--T-TCC--A ---T--C---C-G--------------C----T-T---T-GA-G---AT---G-G-----C-CTC-GCG--T-CCG--A

Multiple Sequence Alignment in Biology prokaryote16S rRNA Columns 1623-1703 out of 7683 ---T--C---C-G--------------C----T-G---A-TA-G---AT---G-G-----G-CTC-GCG--T-CTG--A ------G---T-G--------------G----T-A---T-AA-G---AT---G-G-----A-CCC-GCG--T-TGG--A ------G---T-G--------------G----T-A---T-AG-G---AT---G-G-----A-CCC-GCG--T-CTG--A ------G--GC-G--------------G----T-G---A-AG-G---AT---G-A-----G-CCC-GCG--G-CCT--A ------C---C-G--------------G----T-A---G-AC-G---AT---G-G-----G-GAT-GCG--T-TCC--A ---T--C---C-G--------------C----T-T---T-GA-G---AT---G-G-----C-CTC-GCG--T-CCG--A Sequences of symbols Sequences are related e.g. serve same function in different organisms Why? To identify conserved regions To identify regions with similar physical structure

Multiple Sequence Alignment for Morphology d – a n c – e s English Verbs d – a n c – e d d – a n c - e d – a n c i n g r – u n n i n g j – u m p i n g j – u m p – e d j – u m p - s j – u m p - - - l a u g h i n g Sequences of symbols Sequences are related e.g. serve same function in different words Why? To learn morphological structure

Language Vs. Biology Differences # of Length of Symbol = Sequences Sequences Meaning to Align 10’s Language Millions No 10’s Biology Millions Yes Similarities Both involve sequences Size of Alphabet (less than 100)

What We Did 1. Progressive alignment To build a profile 2. Leave-one-out realignment 3. Align words to the profile 4. Segment words Based on alignment

Step 1) Progressive Alignment A Profile 1 2 3 4 5 6 7 8 d – a n c – e s d – a n c – e d d – a n c - e d – a n c i n g r – u n n i n g j – u m p i n g j – u m p – e d j – u m p - s j – u m p - - - l a u g h i n g

Step 1) Progressive Alignment A Profile 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 d – a n c – e s a 1 2 5 1 1 1 5 1 Column d – a n c – e d c 1 1 1 1 5 1 1 1 d – a n c - e d 5 1 1 1 1 1 1 3 Distributions d – a n c i n g e 1 1 1 1 1 1 1 1 r – u n n i n g g 1 1 1 2 1 1 1 5 j – u m p i n g h 1 1 1 1 2 1 1 1 j – u m p – e d i 1 1 1 1 1 5 1 1 j – u m p - s j 5 1 1 1 1 1 1 1 j – u m p - - - l 2 1 1 1 1 1 1 1 l a u g h i n g m 1 1 1 5 1 1 1 1 n 1 1 1 6 2 1 5 1 p 1 1 1 1 5 1 1 1 r 2 1 1 1 1 1 1 1 s 1 1 1 1 1 1 2 2 u 1 1 7 1 1 1 1 1 gap 1 A 1 1 1 7 2 4

Step 1) Progressive Alignment A Profile 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 d – a n c – e s a 1 2 5 1 1 1 1 1 Laplace d – a n c – e d c 1 1 1 1 5 1 1 1 d – a n c - e d 5 1 1 1 1 1 1 3 Smoothing d – a n c i n g e 1 1 1 1 1 1 5 1 r – u n n i n g g 1 1 1 2 1 1 1 5 j – u m p i n g h 1 1 1 1 2 1 1 1 j – u m p – e d i 1 1 1 1 1 5 1 1 j – u m p - s j 5 1 1 1 1 1 1 1 j – u m p - - - l 2 1 1 1 1 1 1 1 l a u g h i n g m 1 1 1 5 1 1 1 1 n 1 1 1 6 2 1 5 1 p 1 1 1 1 5 1 1 1 r 2 1 1 1 1 1 1 1 s 1 1 1 1 1 1 2 2 u 1 1 7 1 1 1 1 1 gap 1 A 1 1 1 7 2 4

Step 1) Progressive Alignment A Profile 1 2 3 4 5 6 7 8 1. Sort words by frequency d – a n c – e s d – a n c – e d 2. Using Levenshtein distance d – a n c – e - In first n=1000 words d – a n c i n g Find most similar pair of words, W 1 and W 2 r – u n n i n g 3. Align W 1 and W 2 (using Levenshtein) j – u m p i n g This is our Profile j – u m p – e d j – u m p – s - 4. For i =3 to M=5000, 10000, … j – u m p - - - Find word W i , most similar to W j , j<i l a u g h i n g Align W i to profile

Step 1) Progressive Alignment A Profile 1 2 3 4 5 6 d a n c e s d a n c e d d a n c e - d a n c i n g New W i

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 d a n c i The Goal n g

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 d 4.4 a 5.9 n 7.4 c 8.8 i 10.4 n 11.9 g 13.4

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 d 4.4 1.6 a 5.9 Match n 7.4 cost = -log P(character) c 8.8 i 10.4 n 11.9 g 13.4

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 d 4.4 8.8 a 5.9 Insert gap into new word n 7.4 cost = -log P(gap) c 8.8 i 10.4 n 11.9 g 13.4

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 d 4.4 8.8 a 5.9 Insert gap into n 7.4 alignment profile cost = -log P(unattested) c 8.8 i 10.4 n 11.9 g 13.4

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 d 4.4 1.6 a 5.9 Match n 7.4 cost = -log P(character) c 8.8 i 10.4 n 11.9 g 13.4

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 d 4.4 1.6 6.0 7.5 9.0 10.5 12.0 a 5.9 3.1 3.1 7.6 9.1 10.6 12.1 n 7.4 4.6 4.6 4.7 9.1 10.6 12.1 c 8.8 6.1 6.1 6.2 6.2 10.7 12.2 i 10.4 7.6 7.6 7.7 7.7 9.2 12.9 n 11.9 9.1 9.1 9.2 9.2 10.7 12.1 g 13.4 10.6 10.6 10.7 10.7 12.2 13.6

Align W i to Profile d a n c e d Dynamic d a n c e s Programming d a n c e - 0.0 4.4 5.9 7.4 8.9 10.4 11.9 1 2 3 4 5 6 7 d a n c – e s d 4.4 1.6 6.0 7.5 9.0 10.5 12.0 d a n c – e d d a n c – e - a 5.9 3.1 3.1 7.6 9.1 10.6 12.1 d a n c i n g n 7.4 4.6 4.6 4.7 9.1 10.6 12.1 c 8.8 6.1 6.1 6.2 6.2 10.7 12.2 i 10.4 7.6 7.6 7.7 7.7 9.2 12.9 n 11.9 9.1 9.1 9.2 9.2 10.7 12.1 g 13.4 10.6 10.6 10.7 10.7 12.2 13.6

Steps 2 & 3 Step 2) Leave-one-out realignment Improves the greedy alignment Step 3) Align remaining words Profile is frozen Gaps inserted in word only

Step 4) Segmentation 6 Hungarian words from a real alignment -----k----ö---z-----ö-------t-----------t------- -----k----ö---z-----ö-------t-----------t----i-- -----k----ö---z-----ö-------t-----------t----i-t -----k----ö---z-----ö-------t-----------t----e-- -----k----ö---z-----ö-------t-----------t----e-m -----k----ö---t-----ö-------t-----------t----e-m Where are the morpheme boundaries?

Step 4) Segmentation 6 Hungarian words from a real alignment -----k----ö---z-----ö-------t-----------t------- -----k----ö---z-----ö-------t-----------t----i-- -----k----ö---z-----ö-------t-----------t----i-t -----k----ö---z-----ö-------t-----------t----e-- -----k----ö---z-----ö-------t-----------t----e-m -----k----ö---t-----ö-------t-----------t----e-m Where are the morpheme boundaries? Gaps do not correspond to morpheme boundaries Biologists don’t segment!!

Step 4) Segmentation Mimic the ParaMor-Morfessor Union! Take ParaMor-Morfessor Union as THE TRUTH Greedy search For each column, c, in profile Segment all words at c Score against Union system Keep the best scoring segmentation column Repeat until no column improves score

Turkish Linguistic Competition Results AUTHOR METHOD PREC. REC. F1 Monson et al. ParaMor-Morfessor Mimic 48.07% 60.39% 53.53% Monson et al. ParaMor-Morfessor Union 47.25% 60.01% 52.88% Monson et al. ParaMorMimic 49.54% 54.77% 52.02% Lavallée & Langlais RALI-COF 48.43% 44.54% 46.40% - Morfessor CatMAP 79.38% 31.88% 45.49% Spiegler et al. PROMODES 2 35.36% 58.70% 44.14% Spiegler et al. PROMODES 32.22% 66.42% 43.39% Bernhard MorphoNet 61.75% 30.90% 41.19% Can & Manandhar 2 41.39% 38.13% 39.70% Spiegler et al. PROMODES committee 55.30% 28.35% 37.48% Golénia et al. UNGRADE 46.67% 30.16% 36.64% Tchoukalov et al. MetaMorph 39.14% 29.45% 33.61% Virpioja & Kohonen Allomorfessor 85.89% 19.53% 31.82% - Morfessor Baseline 89.68% 17.78% 29.67% Lavallée & Langlais RALI-ANA 69.52% 12.85% 21.69% - letters 8.66% 99.13% 15.93% Can & Manandhar 1 73.03% 8.89% 15.86%

Performance Before Profile is Frozen F 1 10 20 30 40 50 60 0 Multiple Sequence Alignment 5K ParaMor-Morfessor Union Multiple Sequence Alignment 10K M ParaMor-Morfessor Union Multiple Sequence Alignment 20K ParaMor-Morfessor Union

Alignment for Morphology Induction Tzvetan Tchoukalov Christian - PowerPoint PPT Presentation

Multiple Sequence Alignment for Morphology Induction Tzvetan Tchoukalov Christian Monson Brian Roark Multiple Sequence Alignment in Biology prokaryote16S rRNA Columns 1623-1703 out of 7683

Morphology Morphology Morphology yields words with Morphology yields words with predictable

Induction Stepwise induction (for T PA , T cons ) Complete induction (for T PA , T cons )

Induction and recursion Chapter 5 Chapter Summary Mathematical Induction Strong Induction

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Computational Morphology: Machine learning of morphology Yulia Zinova 09 April 2014 16 July

Update on morphology WP activities M. Huertas-Company (GAL-SWG - morphology) EUCLID France - 7

Mathematical Induction Lecture 10-11 Menu Mathematical Induction Strong Induction

MA THEMA TICAL INDUCTION Induction and Deduction Mathematical Induction (its

Beyond Inductive Definitions Induction-Recursion, Induction-Induction, Coalgebras Anton

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

Strong induction (3) 23/38 Let P be a unary predicate on N Strong induction: Induction . . .

Improving Morphology Induction with Spelling Rules Jason Naradowsky University of Massachusetts

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 2 collected water samples for DNA extraction.

RNA Structure modeling GDR MASIM Paris, 16-17 th November 2017 Bruno Sargueil, CNRS UMR 8015

Parallel Adaptations to High Temperatures in the Archean Eon Samuel Blanquart a 1 Bastien Boussau

Two added structures in sparse recovery: nonnegativity and disjointedness Simon Foucart

The RISC-V Processor Hakim Weatherspoon CS 3410 Computer Science Cornell University

Supporting 64 bit pointers in RISCV 32 bit LLVM backend Reshabh Sharma Background: Prof.

This Unit: Single-Cycle Datapath App App App Datapath storage elements System software

CMPSC 497: Software Fault Isolation Trent Jaeger Systems and Internet Infrastructure

Sambuz

Useful Links

Newsletter

Mail Us