Di Discove vering Mos Most Classif ific icat atory P y Pattern - PowerPoint PPT Presentation

Di Discove vering Mos Most Classif ific icat atory P y Pattern rns f for Ver ery E Expr press ssive P e Pattern Cl Classes sses Masayuki Takeda 1,2 , Shunsuke Inenaga 3 , Hideo Bannai 4 , Ayumi Shinohara 1,2 , and Setsuo Arikawa 1 1 Department of Informatics, Kyushu University 2 Japan Science Technology Corporation Agency 3 Department of Computer Science, University of Helsinki 4 Human Genome Center, University of Tokyo

Backgr grou ound nd a and Motivation on Distinguish two given string datasets - to obtain a good rule and/or useful knowledge Grade up BONSAI system - so that it can deal with more expressive pattern classes

System BONSAI Mach chine Disco scovery Syst [ Shimozono et. al 1994 ] Positive Negative Datasets Examples Examples ABCDEFGHIJKLMNOPQRSTUVWXY BONSAI 0011001010001110000011010 POS NEG pos neg Indexing x 11 y Combinatorial I(POS) I(NEG) I(pos) I(neg) Optimization P Algorithm x 101 y Accuracy Decision Tree N x 111 y Evaluation Generator P N Accuracy Indexing Decision Tree

Patter ttern Discov over ery from from Data taset ets Find a pattern string that occurs in all strings of A and in no strings of B . A B AKEBONO MUSASHIMARU WAKANOHANA TAKANOHANA CONTRIBUTIONS OF AI CONTRIBUTIONS OF UN BEYOND MESSY LEARNING TRADITIONAL APPROACHES BASED ON LOCAL SEARCH ALGORITHMS GENETIC ALGORITHMS BOOLEAN CLASSIFICATION PROBABILISTIC RULE SYMBOLIC TRANSFORMATION NUMERIC TRANSFORMATION BACON SANDWICH PLAIN OMELETTE PUBLICATION OF DISSERTATION TOY EXAMPLES Answer: BONSAI

Opti ptimiza zation P Prob roblem em  Input: Two sets S, T of strings  Output: A pattern p that maximizes the score function f ( x p , y p , |S|, |T| ) . x p : The num. of strings in S that p matches. y p : The num. of strings in T that p matches. Score function f expresses the goodness of p in terms of separating the two sets S and T .

Proc roces ess of of Com ompu putation S INPUT T computing the “goodness” for all possible patterns as fast as OUTPUT possible!! the pattern of best score

Prev reviou ous Work Work • BONSAI (discovering best Substring pattern), Shimozono et al., 1994 • Discovering best Subsequence pattern, Hirao et al., 2000 • Discovering best Episode pattern, Hirao et al., 2001 • Discovering best VLDC pattern, Inenaga et al., 2002 • Discovering best Window Accumulated VLDC pattern, Inenaga et al., 2002

This is W Work We present efficient algorithms to discover: • the best Fixed/Variable Length Don’t Care Pattern • the best Approximate FVLDC Pattern The aim is to apply more expressive pattern classes to BONSAI • the best Window Accumulated FVLDC Pattern • the best Window Accumulated Approx. FVLDC Pattern The aim is to add a more classificatory power to the pattern classes

Score F ore Function The goodness of pattern p good ( p , S , T ) = f ( x p , y p , | S | , | T | ) S , T : two given sets of strings x p : num. of strings in S that p matches y p : num. of strings in T that p matches If score function f is conic , then we can apply an efficient pruning technique for speeding up the computation.

Score F ore Function to to be e Con onic y f x x y y x f

Con onic F Function Propert roperty ( x , y ) (0, y ) ( x ’, y ’) f ( x ’, y ’) ≤ upperBound ( x , y ) (0, 0) ( x , 0) upperBound ( x , y ) : the max value on the square = max { f (0, 0), f ( x , 0), f (0, y ), f ( x , y )}

Pruning Technique numOfMatchedStr ( p , T ) d ∗ sco d ∗ scover numOfMatchedStr ( p , S ) ≤ < The goodness of The upperBound of The current d ∗ sco best score d ∗ scover

FVLDC P Patter ttern A Fixed/Variable Length Don’t Care Pattern is an element of Π = ( Σ ∪ { ○ , ★ } ) ∗ , where ○ matches any character and ★ matches any string . e.g. FVLDC pattern ab ○ a ○★ b matches abbaabbb . ab a b a bb

FVLDC P Patter ttern Matc tching We use an NFA that recognizes the language of a given FVLDC pattern p . The num. of states is m +1 , where m is the num. of constants and ○ ’ s in p . p = ★ ab ○★ b Σ a b b Σ Σ Using the bit-parallel technique , we can do matching for p in O ( m| Σ | ) preprocessing time and O ( n ) running time .

Approx pproximate F e FVLDC Patter ttern An Approximate FVLDC Pattern is an element of Π × Ν , where Ν is the set of non-negative integers . Approx. FVLDC pattern <p, k> is said to match a string w within distance k if the Hamming Distance between p and w is within k . e.g. Approx. FVLDC pattern < ab ○ a ○★ b , 1> matches abbaabba . ab a b a bba

Approx pprox. FVLDC Patter ttern Matc tching We use an NFA that recognizes the language of a given approx. FVLDC pattern < p, k > . The NFA has ( m+ 1)( k +1) states, but ( m-k+ 1)( k+ 1) bits are actually enough. If ( m-k+ 1)( k+ 1) is not larger than the computer word length, our bit-parallel algorithm runs in O ( |n| ) time after O ( m| Σ | ) -time preprocessing for p .

Approx pprox. FVLDC Patter ttern Matc tching m =4 p = < ★ ab ○★ b , 2> k =2 Σ a b b Mismatches =0 Σ Σ Σ Σ Σ Σ Σ a b b Mismatches =1 Σ Σ Σ Σ Σ Σ Σ a b b Mismatches =2 Σ Σ The NFA has ( m+ 1)( k +1) states.

Approx pprox. FVLDC Patter ttern Matc tching m =4 p = < ★ ab ○★ b , 2> k =2 Σ a b b Mismatches= 0 Σ Σ Σ Σ Σ Σ Σ a b b Mismatches= 1 Σ Σ Σ Σ Σ Σ Σ a b b Mismatches= 2 Σ Σ Only ( m-k+ 1)( k+ 1) states are necessary.

More C ore Classificator ory Patter ttern C Class any pattern similar p = ★ d ○★ sc ○★ very ★ to “ discovery ” ? w = fhdihertlhglehglioogfrg xawpolmkhhjqirvnbotuhxxxxr ylnvhbtriscovbgneinmvgerig eooitrnrnvevroigreintnnvoi woireohirlneroiveryniritro eitruijnnbrymxbairive

Window ow A Accumulati tion on Bound the length of occurrence of p by a window size h . p = ★ d ○★ sc ○★ very ★ h This way we can get rid of redundant matches, and obtain better classification!

Wi Window Accumulated ed P Patter ttern Matc tching We use two NFAs each recognizes the language of either a given FVLDC pattern p or its reversal. p rev = b ★○ ab ★ Σ b a b Σ Σ Using the bit-parallel technique , we can do pattern matching for < p, h > in O ( m| Σ | ) preprocessing time and in O ( n 2 ) running time . Same for Win-Acc. approx. FVLDC patterns.

Experi rimen enta tal E Enviro ronmen ent Machine: Alpha Station XP1000 CPU: Alpha21264 processor of 667MHz OS: Tru64 Unix OS V4.0F Datasets: (1) completely random data (2) VLDC pattern embedded data (3) FVLDC pattern embedded data (4) 2-approx. VLDC pattern embedded data (5) window-accumulated 2-approx. VLDC pattern embedded data

Exper perimental R Res esult 1

Exper perimental R Res esult 4 dataset pattern class (1) (2) (3) (4) (5) (5) VLDC 423 109 236 182 224 (554) 1068 331 645 514 623 (1579) FVLDC approx. VLDC ( k max = 1 ) 2203 725 1088 853 1026 (1820) 4569 1660 2185 1790 2035 (3558) approx. VLDC ( k max = 2 ) approx. VLDC ( k max = 3 ) 6973 2739 3324 2868 3146 (5679) approx. VLDC ( k max = 4 ) 9396 3880 4492 4008 4304 (8377) Execution times (in seconds) for different pattern classes: The maximum pattern length was set to 7. Execution time for each window-accumulated version with dataset (5) is shown in parentheses.

Di Discove vering Mos Most Classif ific icat atory P y Pattern - PowerPoint PPT Presentation

Di Discove vering Mos Most Classif ific icat atory P y Pattern rns f for Ver ery E Expr press ssive P e Pattern Cl Classes sses Masayuki Takeda 1,2 , Shunsuke Inenaga 3 , Hideo Bannai 4 , Ayumi Shinohara 1,2 , and Setsuo Arikawa 1 1

Disc Discove Disc Discove overin overin ing What ing What g What g What NIU Graduates

WESLEY CHAPEL HIGH SCHOOL ICAT ADVANCED SCHOLAR PROGRAM WHY CHOOSE ICAT? - Every Freshman Ever

Risk sk s strat tratif ific icat ation a and nd inc ncidenc nce o e of f acute c e

Exact Design of All-MOS Log Filters X.Redondo and F.Serra-Graells Design Department Institut de

MOS Transistor MOS Transistor Professor Chris H. Kim Gate University of Minnesota Dept. of ECE

Lumped Element High Voltage MOS Model presented by Sebastian Schmidt at MOS-AK / Bblingen

Vgs Forms a Channel CS/EE 6710 MOS Transistor Models Electrical Effects Propagation Delay

UMBC A B M A L T F O U M B C I M Y O R T 1 (December 4, 2000 6:10 pm) I E S

The iCat in the JAST Multimodal Dialogue System Mary Ellen Foster Technical University of Munich

iCAT Interactions in iRODS Wayne Schroeder DICE/INC University of California San Diego, CA,

PAC ACIFIC IFIC RI RING NG OF FI FIRE RE Photo credit: wikipedia.org PAC ACIFIC IFIC TY

Stefano Stefano Ga Gariazzo riazzo IFIC, Valencia (ES) CSIC Universitat de Valencia

Activities in Valencia C. Lacasta ( Carlos.Lacasta@ific.uv.es ) C. Lacasta (

2Q Main Highlights ts: : Accel eler erat ated ed Recoveri vering ng Path Postpaid Net Adds

Deli livering vering the tot otal al digital gital experien erience ce PLDT Inc. First st

Deli livering vering the tot otal al digital gital experien erience ce PL PLDT, , Inc.

Improvements for Eclipse JavaScript Tooling Eclipse Neon Alexey Kazakov, Red Hat Max Rydahl

RUMP KERNELS and {why,how} we got here New Directions in Operating Systems November 2014,

What is the quantum state? Jonathan Barrett QISW, Oxford, March 2012 Matt Pusey Terry Rudolph

WHAT TO EXPECT WHEN WHAT TO EXPECT WHEN YOU'RE EXPECTING... JETPACK YOU'RE EXPECTING... JETPACK

02267: Software Development of Web Services 1.1 Software installation Install the OpenESB v3

Interactive and Opportunistic Knowledge Acquisition in Case-Based Reasoning FIKA, I AK A, F

OmniUpdate Training Tuesday Creating Directory Variables Zoom Event # 979 1451 0025 Audio will be

Development Office of Housing Counseling Financial Grant Reporting April 21, 2015 12:00 PM