Di Discove vering Mos Most Classif ific icat atory P y Pattern - - PowerPoint PPT Presentation

di discove vering mos most classif ific icat atory p y
SMART_READER_LITE
LIVE PREVIEW

Di Discove vering Mos Most Classif ific icat atory P y Pattern - - PowerPoint PPT Presentation

Di Discove vering Mos Most Classif ific icat atory P y Pattern rns f for Ver ery E Expr press ssive P e Pattern Cl Classes sses Masayuki Takeda 1,2 , Shunsuke Inenaga 3 , Hideo Bannai 4 , Ayumi Shinohara 1,2 , and Setsuo Arikawa 1 1


slide-1
SLIDE 1

Di Discove vering Mos Most Classif ific icat atory P y Pattern rns f for Ver ery E Expr press ssive P e Pattern Cl Classes sses

Masayuki Takeda1,2, Shunsuke Inenaga3, Hideo Bannai4, Ayumi Shinohara1,2, and Setsuo Arikawa1

1Department of Informatics, Kyushu University 2Japan Science Technology Corporation Agency 3Department of Computer Science, University of Helsinki 4Human Genome Center, University of Tokyo

slide-2
SLIDE 2

Backgr grou

  • und

nd a and Motivation

  • n

Distinguish two given string datasets

  • to obtain a good rule and/or useful knowledge

Grade up BONSAI system

  • so that it can deal with more expressive pattern

classes

slide-3
SLIDE 3

Negative Examples Positive Examples

Indexing Decision Tree Accuracy

Accuracy Evaluation

Combinatorial Optimization Algorithm

POS NEG

neg pos

I(POS) I(NEG)

I(neg) I(pos)

Indexing

BONSAI

ABCDEFGHIJKLMNOPQRSTUVWXY 0011001010001110000011010

x 11 y x 101 y x 111 y

N N P P

Decision Tree Generator

Mach chine Disco scovery Syst System BONSAI

Datasets

[Shimozono et. al 1994]

slide-4
SLIDE 4

Find a pattern string that occurs in all strings of A and in no strings of B.

WAKANOHANA TAKANOHANA CONTRIBUTIONS OF UN TRADITIONAL APPROACHES GENETIC ALGORITHMS PROBABILISTIC RULE NUMERIC TRANSFORMATION PLAIN OMELETTE TOY EXAMPLES

A B

Answer: BONSAI

Patter ttern Discov

  • ver

ery from from Data taset ets

AKEBONO MUSASHIMARU CONTRIBUTIONS OF AI BEYOND MESSY LEARNING BASED ON LOCAL SEARCH ALGORITHMS BOOLEAN CLASSIFICATION SYMBOLIC TRANSFORMATION BACON SANDWICH PUBLICATION OF DISSERTATION

slide-5
SLIDE 5

xp : The num. of strings in S that p matches. yp : The num. of strings in T that p matches.

Opti ptimiza zation P Prob roblem em

 Input: Two sets S, Tof strings  Output: A pattern p that maximizes the

score function f(xp, yp, |S|, |T|).

Score function f expresses the goodness of p in terms of separating the two sets S and T.

slide-6
SLIDE 6

T S INPUT OUTPUT computing the “goodness” for all possible patterns

the pattern of best score

as fast as possible!! Proc roces ess of

  • f Com
  • mpu

putation

slide-7
SLIDE 7

Prev reviou

  • us Work

Work

  • BONSAI

(discovering best Substring pattern), Shimozono et al., 1994

  • Discovering best Subsequence pattern, Hirao et al., 2000
  • Discovering best Episode pattern, Hirao et al., 2001
  • Discovering best VLDC pattern, Inenaga et al., 2002
  • Discovering best Window Accumulated VLDC pattern,

Inenaga et al., 2002

slide-8
SLIDE 8

This is W Work

We present efficient algorithms to discover:

  • the best Fixed/Variable Length Don’t Care Pattern
  • the best Approximate FVLDC Pattern

The aim is to apply more expressive pattern classes to BONSAI

  • the best Window Accumulated FVLDC Pattern
  • the best Window Accumulated Approx. FVLDC Pattern

The aim is to add a more classificatory power to the pattern classes

slide-9
SLIDE 9

The goodness of pattern p

good(p, S, T) = f(xp, yp, |S|, |T|)

S, T : two given sets of strings xp : num. of strings in S that p matches yp : num. of strings in T that p matches

Score F

  • re Function

If score function f is conic, then we can apply an efficient pruning technique for speeding up the computation.

slide-10
SLIDE 10

x y x f Score F

  • re Function to

to be e Con

  • nic

y f x y

slide-11
SLIDE 11

(x, y) (0, y) (x, 0) (0, 0) (x’, y’)

= max{f(0, 0), f(x, 0), f(0, y), f(x, y)}

f(x’, y’) ≤ upperBound(x, y)

upperBound(x, y) : the max value on the square Con

  • nic F

Function Propert roperty

slide-12
SLIDE 12

numOfMatchedStr(p, S) numOfMatchedStr(p, T)

d∗sco d∗scover

<

The current best score

The goodness of d∗scover The upperBound of d∗sco

Pruning Technique

slide-13
SLIDE 13

FVLDC P Patter ttern

A Fixed/Variable Length Don’t Care Pattern is an element of Π = (Σ∪{○, ★})∗, where ○ matches any character and ★ matches any string. e.g. FVLDC pattern ab○a○★b matches abbaabbb. b a bb ab a

slide-14
SLIDE 14

FVLDC P Patter ttern Matc tching

We use an NFA that recognizes the language of a given FVLDC pattern p. The num. of states is m+1, where m is the num. of constants and ○’s in p. p = ★ab○★b a b b

Σ Σ Σ

Using the bit-parallel technique, we can do matching for p in O(m|Σ|) preprocessing time and O(n) running time .

slide-15
SLIDE 15

Approx pproximate F e FVLDC Patter ttern

An Approximate FVLDC Pattern is an element of Π×Ν, where Ν is the set of non-negative integers.

  • Approx. FVLDC pattern <p, k> is said to match

a string w within distance k if the Hamming Distance between p and w is within k. e.g. Approx. FVLDC pattern <ab○a○★b, 1> matches abbaabba. b a bba ab a

slide-16
SLIDE 16

Approx

  • pprox. FVLDC Patter

ttern Matc tching

We use an NFA that recognizes the language

  • f a given approx. FVLDC pattern <p, k>.

The NFA has (m+1)(k+1) states, but (m-k+1)(k+1) bits are actually enough. If (m-k+1)(k+1) is not larger than the computer word length, our bit-parallel algorithm runs in O(|n|) time after O(m|Σ|)-time preprocessing for p.

slide-17
SLIDE 17

Approx

  • pprox. FVLDC Patter

ttern Matc tching

p = <★ab○★b, 2> a b b

Σ Σ Σ

a b b

Σ Σ Σ

a b b

Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ

The NFA has (m+1)(k+1) states. Mismatches=0 Mismatches=1 Mismatches=2

m=4 k=2

slide-18
SLIDE 18

Approx

  • pprox. FVLDC Patter

ttern Matc tching

a b b

Σ Σ Σ

a b b

Σ Σ Σ

a b b

Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ

Mismatches=0 Mismatches=1 Mismatches=2 Only (m-k+1)(k+1) states are necessary.

p = <★ab○★b, 2>

m=4 k=2

slide-19
SLIDE 19

More C

  • re Classificator
  • ry Patter

ttern C Class p = ★d○★sc○★very★ w = fhdihertlhglehglioogfrg xawpolmkhhjqirvnbotuhxxxxr ylnvhbtriscovbgneinmvgerig eooitrnrnvevroigreintnnvoi woireohirlneroiveryniritro eitruijnnbrymxbairive

any pattern similar to “discovery”?

slide-20
SLIDE 20

Window

  • w A

Accumulati tion

  • n

p = ★d○★sc○★very★ h Bound the length of occurrence of p by a window size h. This way we can get rid of redundant matches, and obtain better classification!

slide-21
SLIDE 21

Wi Window Accumulated ed P Patter ttern Matc tching

We use two NFAs each recognizes the language

  • f either a given FVLDC pattern p or its reversal.

prev = b★○ab★ b a b

Σ Σ Σ Using the bit-parallel technique, we can do pattern matching for <p, h> in O(m|Σ|) preprocessing time and in O(n2) running time .

Same for Win-Acc. approx. FVLDC patterns.

slide-22
SLIDE 22

Experi rimen enta tal E Enviro ronmen ent

Machine: Alpha Station XP1000 CPU: Alpha21264 processor of 667MHz OS: Tru64 Unix OS V4.0F Datasets: (1) completely random data (2) VLDC pattern embedded data (3) FVLDC pattern embedded data (4) 2-approx. VLDC pattern embedded data (5) window-accumulated 2-approx. VLDC pattern embedded data

slide-23
SLIDE 23

Exper perimental R Res esult 1

slide-24
SLIDE 24

Exper perimental R Res esult 2

slide-25
SLIDE 25

Exper perimental R Res esult 3

slide-26
SLIDE 26

Exper perimental R Res esult 4

(8377) 4304 4008 4492 3880 9396

  • approx. VLDC (kmax= 4)

(5679) 3146 2868 3324 2739 6973

  • approx. VLDC (kmax= 3)

(3558) 2035 1790 2185 1660 4569

  • approx. VLDC (kmax= 2)

(1820) 1026 853 1088 725 2203

  • approx. VLDC (kmax= 1)

(1579) 623 514 645 331 1068

FVLDC

(554) 224 182 236 109 423

VLDC pattern class dataset

(1) (5) (2) (3) (4) (5) Execution times (in seconds) for different pattern classes: The maximum pattern length was set to 7. Execution time for each window-accumulated version with dataset (5) is shown in parentheses.