Mining F reqeun t Episo des for relating Financial Ev en - - PDF document

mining f reqeun t episo des for relating financial ev en
SMART_READER_LITE
LIVE PREVIEW

Mining F reqeun t Episo des for relating Financial Ev en - - PDF document

Mining F reqeun t Episo des for relating Financial Ev en ts and Sto c k T rends Ann y Ng and Ada W aic hee F u Departmen t of Computer Science and Engineering The Chinese Univ ersit y of Hong Kong


slide-1
SLIDE 1 Mining F reqeun t Episo des for relating Financial Ev en ts and Sto c k T rends Ann y Ng and Ada W aic hee F u Departmen t
  • f
Computer Science and Engineering The Chinese Univ ersit y
  • f
Hong Kong Shatin Hong Kong Email angadafucsecuhkeduhk Abstract It is exp ected that sto c k prices can b e aected b y the lo cal and
  • v
erseas p
  • litical
and economic ev en ts W e extract ev en ts from the nancial news
  • f
Chinese lo cal newspap ers whic h are a v ailable
  • n
the w eb the news are matc hed against sto c k prices databases and a new metho d is prop
  • sed
for the mining
  • f
frequen t temp
  • ral
patterns
  • In
tro duction In sto c k mark et the share prices can b e inuenced b y man y factors ranging from news releases
  • f
companies and lo cal p
  • litics
to news
  • f
sup erp
  • w
er econom y
  • W
e call these incidences ev en ts W e assume that eac h ev en t is
  • f
a certain ev en t t yp e and eac h ev en t has a time
  • f
  • ccurrence
t ypically giv en b y the date that the ev en t
  • ccurs
  • r
it is rep
  • rted
Eac h ev en t therefore corresp
  • nds
to a time p
  • in
t W e exp ect that ev en ts lik e the Hong Kong go v ernmen t announcing decit and W ashington deciding to increase the in terest rate ma y lead to uctuations in the Hong Kong sto c k prices within a short p erio d
  • f
time When a n um b er
  • f
ev en ts
  • ccur
within a short p erio d
  • f
time w e assume that they p
  • ssibly
ha v e some relationship Suc h a p erio d
  • f
time can b e determined b y the application exp erts and it is called a windo w usually limited to a few da ys Roughly sp eaking a set
  • f
ev en ts that
  • ccur
within a windo w is called an episo de instance The set
  • f
ev en t t yp es in the instance is called an episo de F
  • r
example w e ma y ha v e the follo wing statemen t in a nancial rep
  • rt
T elecomm unications sto c ks pushed the Hang Seng Index
  • higher
follo wing the Star TVHK T elecom and OrangeMannesmann deals This can b e an ex ample for an episo de in whic h all the four ev en ts telecomm unicatio n sto c ks rise Hang Seng Index surges and the t w
  • deals
  • f
Star TVHK T elecom and OrangeMannesmann all happ ened within a p erio d
  • f
  • da
ys If there are man y instances
  • f
the same episo de it is called a fr e quent episo de W e are in terested to nd frequen t episo des related to sto c k mo v emen ts The sto c k mo v e men t need not b e the last ev en t
  • ccurring
in the episo de instance b ecause the mo v emen t
  • f
sto c ks ma y b e caused b y the in v estors exp ectation that something w
  • uld
happ en
  • n
the follo wing da ys F
  • r
example w e can ha v e a news rep
  • rt
sa ying Hong Kong shares slid y esterda y in a mark et burdened b y the fear
  • f
p
  • s
sible United States in terest rates rises tomorro w Therefore w e do not assume an
  • rdering
  • f
the ev en ts in an episo de
slide-2
SLIDE 2 F rom the frequen t episo de w e ma y disco v er the factors for the uctuation
  • f
sto c k prices W e are in terested in a sp ecial t yp e
  • f
episo des that w e call sto c kepiso des it can b e written as he
  • e
  • e
n t da ysi where the e
  • e
  • e
n are ev en t t yp es and at least
  • ne
  • f
the ev en ts should b e the ev en t
  • f
sto c k uctuation An instance for this sto c kepiso de is an instance where the ev en ts
  • f
the ev en t t yp es e
  • e
n app ear in a windo w
  • f
t da ys Since w e are
  • nly
concerned with sto c kepiso des w e shall simply refer to sto c kepiso des as episo des
  • Denition
s Let E
  • fE
  • E
  • E
m g b e a set
  • f
ev en t t yp es Assume that w e ha v e a database that records ev en ts for da ys
  • to
n W e call this a ev en t database w e can represen t this as D B
  • D
  • D
  • D
n
  • where
D i is for da y i and D i
  • fe
i
  • e
i
  • e
ik g where e ij
  • E
j
  • k
  • This
means that the ev en ts that happ en
  • n
da y i ha v e ev en t t yp es e i
  • e
i
  • e
ik
  • Eac
h D i is called a da y record The da y records D i in the database are consecutiv e and arranged in c hronological
  • rder
where D i is
  • ne
da y b efore D i for all n
  • i
  • P
  • fe
p
  • e
p
  • e
pb g where e pi
  • E
i
  • b
is an episo de if P has at least t w
  • elemen
ts and at least
  • ne
e pj is a sto c k ev en t t yp e W e assume that a windo w size is giv en whic h is x da ys this is used to indicate a consecutiv e sequence
  • f
x da ys W e are in terested in ev en ts that
  • ccur
within a short p erio d as dened b y a windo w If the database consists
  • f
m da ys and the windo w size is x da ys there are m windo ws in the database The rst windo w con tains exactly da ys D
  • D
  • D
x The ith windo w con tains D i
  • D
i
  • with
up to x da ys The second last windo w con tains D m
  • D
m
  • and
the last windo w con tains
  • nly
D m
  • In
some previous w
  • rk
suc h as
  • the
frequency
  • f
an episo de is dened as the n um b er
  • f
windo ws whic h con tain ev en ts in the episo de F
  • r
  • ur
application w e notice some problem with this denition supp
  • se
w e ha v e a windo w size
  • f
x if an episo de
  • ccurs
in a single da y i then for windo ws that start from da y i
  • x
  • to
windo ws starting from i they all con tain the episo de so the frequency
  • f
the episo de will b e x Ho w ev er the episo de actually has
  • ccurred
  • nly
  • nce
Therefore w e prop
  • se
a dieren t denition for the frequency
  • f
an episo de Deniti
  • n
  • Given
a window size
  • f
x days for DB and an episo de P
  • an
episo de instance
  • f
P is an
  • c
curr enc e
  • f
al l the event typ es in P within a window W and wher e the r e c
  • r
d
  • f
the rst day
  • f
the window W c
  • ntains
at le ast
  • ne
  • f
the event typ es in P
  • Each
window c an b e c
  • unte
d at most
  • nc
e as an episo de instanc e for a given episo de The frequency
  • f
an ev en t is the numb er
  • f
  • c
curr enc es
  • f
the event in the datab ase The supp
  • rt
  • r
the frequency
  • f
an episo de is the numb er
  • f
in stanc es for the episo de Ther efor e the fr e quency
  • f
an episo de P is the numb er
  • f
windows W
  • such
that W c
  • ntains
al l the event typ es in P and the rst day
  • f
W c
  • ntains
at le ast
  • ne
  • f
the event typ es in P
  • A
n episo de is a frequen t episo de if its fr e quency is
  • a
given minim um supp
  • rt
threshold
slide-3
SLIDE 3 Problem deniti
  • n
  • Our
problem is to nd all the frequen t episo des giv en a ev en t database and the parameters
  • f
windo w size and minim um supp
  • rt
threshold Let us call the n um b er
  • f
  • ccurrences
  • f
an ev en t t yp e a in D B the database frequency
  • f
a Let us call the n um b er
  • f
windo ws that con tain an ev en t t yp e a the windo w frequency
  • f
a The windo w frequency
  • f
a is t ypically greater than the database frequency
  • f
a since the same
  • ccurrence
  • f
a can b e con tained in m ultiple windo ws W e ha v e the follo wing prop ert y
  • Pr
  • p
erty
  • F
  • r
an y episo de that con tains an ev en t a its frequency m ust b e equal to
  • r
less than the windo w frequency
  • f
a That is the upp er limit
  • f
the frequency
  • f
an episo de con taining a is the windo w frequency
  • f
a Lemma
  • F
  • r
a fr e quent episo de a subset
  • f
that episo de may not b e fr e quent Pro
  • f
W e pro v e b y giving a coun ter example to the h yp
  • thesis
that all subsets
  • f
a frequen t episo de are frequen t Supp
  • se
w e ha v e a database with
  • da
ys and the windo w size is
  • the
records D
  • to
D
  • are
fbg fa cg fbg fdg
  • f
bg f c ag
  • fd
g resp ectiv ely
  • If
the threshold is
  • then
  • abc
  • has
  • ccurrences
and is a fre quen t episo de while
  • ac
  • whic
h is a subset
  • f
  • abc
  • has
  • nly
  • ccurrences
and is not a frequen t episo de
  • Related
W
  • rk
The mining
  • f
frequen t temp
  • ral
patterns has b een considered for sales records nancial data w eather forecast and
  • ther
applications The denitions
  • f
the patterns v ary in dieren t applications In general an episo de is a n um b er
  • f
ev en ts
  • ccurring
within a sp ecic short p erio d
  • f
time The restriction
  • f
the
  • rdering
  • f
ev en ts in an episo de dep ends
  • n
the applications Previous related researc h includes disco v ering sequen tial patterns
  • frequen
t episo des
  • temp
  • ral
patterns
  • and
frequen t patterns
  • In
  • an
episo de dened as the par tially
  • rdered
ev en ts app earing close together is dieren t from
  • ur
denition
  • f
sto c kepiso de Some related w
  • rk
fo cus
  • n
sto c k mo v emen t
  • but
w e w
  • uld
lik e to relate nancial ev en ts with sto c k mo v emen t When w e deal with the ev en ts whic h last for a p erio d
  • f
time w e ma y consider the starting time and ending time
  • f
the ev en ts as w ell as their temp
  • ral
relations suc h as
  • v
erlap and during
  • disco
v ers more dieren t kinds
  • f
temp
  • ral
pattern
  • disco
v ers frequen t sequen tial patterns b y using a tree structure
  • nds
the frequen t series and parallel episo des in a sequence
  • f
p
  • in
tbased ev en ts An episo de is a partially
  • rdered
  • f
ev en ts
  • ccurring
close together An episo de X is a sub episo de
  • f
another episo de Y if all ev en ts in X are also con tained in Y and the
  • rder
  • f
ev en ts in X is the same with that in Y
  • The
frequency
  • f
an episo de is the n um b er
  • f
windo ws con taining the episo de Note that this denition is dieren t from
  • urs
since it allo ws the same episo de instance to b e coun ted m ultiple times when m ultiple windo ws happ en to con tain the instance
slide-4
SLIDE 4 Most
  • f
the algorithms in tro duced in the ab
  • v
e are based
  • n
Apriori Algo rithm
  • Our
denition
  • f
frequen t episo des do es not giv e rise to the subset closure prop ert y utilized in these metho ds
  • pro
vides a fast alternativ e to nd the frequen t pattern with a frequen t pattern tree FPtree whic h is a kind
  • f
prex tree Ho w ev er the FPtree is not designed for temp
  • ral
pattern mining There is some related w
  • rk
in applying the tec hnique to mine frequen t subse quences in giv en sequences
  • but
the problem is quite dieren t from
  • urs
  • An
Ev en t T ree for the database The metho d w e prop
  • se
has some similarit y to that in
  • W
e use a tree structure to represen t the sets
  • f
ev en t t yp es with paths and no des The pro cess is com prised
  • f
t w
  • phrases
  • T
ree construction and
  • Mining
frequen t episo des The tree structure for storing the ev en t database is called the ev en t tree It has some similarit y to the FPtree The ro
  • t
  • f
the ev en t tree is a n ull no de Eac h no de is lab eled b y an ev en t t yp e Eac h no de also con tains a coun t and a binary bit whic h indicates the no de typ e Before the ev en t tree is built w e rst gather the frequencies
  • f
eac h ev en t t yp e in the database DB W e sort the ev en t t yp es b y descending frequencies Next w e consider the windo ws in the database F
  • r
eac h windo w
  • Find
the set F
  • f
ev en t t yp es in the rst da y
  • and
the set R
  • f
ev en t t yp es in the remaining da ys F and R are eac h sorted in descending database frequency
  • rder
  • Then
the sorted list from F and that from R are concatenated in to
  • ne
list and inserted in to the ev en t tree One tree no de corresp
  • nds
to eac h ev en t t yp e in eac h
  • f
F and R If an ev en t t yp e is from F
  • the
binary bit in the tree no de is
  • if
the ev en t t yp e is from R the binary bit in the tree no de is
  • Windo
ws with similar ev en t t yp es ma y share the same prex path in the tree with accum ulated coun t Hence a path ma y corresp
  • nd
to m ultiple windo ws If a new tree no de is en tered in to the tree the coun t is initialized to
  • When
an ev en t t yp e is inserted in to an existing no de the coun t in the no de is incremen ted b y
  • In
the ev en t tree eac h path from the ro
  • t
no de to a leaf no de is called a windo w path
  • r
simply a path when no am biguit y can arise The ev en t tree diers from an FPtree in that eac h windo w path
  • f
the tree is divided in to t w
  • parts
There is a cut p
  • in
t in the path so that the no des ab
  • v
e the cut p
  • in
t has binary bit set to
  • This
is called the rstda ys part
  • f
the path The second part
  • f
the path with binary bits
  • f
  • is
called the remainingd a ys part
  • f
the path There is a header table that con tains the ev en t t yp es sorted in descending
  • rder
  • f
their frequencies Eac h en try in the header table is the header
  • f
a link ed list
  • f
all the no des in the ev en t tree lab eled with the same ev en t t yp e as the header en try
  • Eac
h time a tree no de x is created with a lab el
  • f
ev en t t yp e e the
slide-5
SLIDE 5 no de x is added to the link ed list from the header table at en try e The link ed list therefore has a mixture
  • f
no des with binary bits
  • f
  • r
  • The
adv an tage
  • f
the ev en t tree structure is that windo ws with common frequen t ev en t t yp es can lik ely share the same prex no des in the ev en t tree In eac h
  • f
the rstda ys part and the remainingda ys part the more frequen t the ev en t is the higher lev el the ev en t no de is in so as to increase the c hance
  • f
reusing the existing no des Before building the tree w e can do some pruning based
  • n
ev en t t yp e fre quencies Those ev en t t yp es with windo w frequencies less than the minim um supp
  • rt
threshold are excluded from the tree b ecause the ev en t t yp es will not app ear in the frequen t episo des with the reason stated in Prop ert y
  • Once
an ev en t t yp e is excluded it will b e ignored whenev er it app ears in a windo w This helps us to reduce the size
  • f
tree and reduce the c hance
  • f
including nonfrequen t episo des Strategy
  • We
r emove those events with the window fr e quencies less than the minimum supp
  • rt
thr eshold b efor e c
  • nstructing
the tr e e Strategy
  • In
c
  • unting
the windows for the fr e quency
  • f
an episo de e ach win dow c an b e c
  • unte
d at most
  • nc
e If an event typ e app e ars in the rst day and also in the r emaining days
  • f
a window simultane
  • usly
the ee ct
  • n
the c
  • unting
is the same as if the event typ e app e ars
  • nly
in the rst day Ther efor e for such a window
  • nly
the
  • c
curr enc e
  • f
the event typ e in the rst day wil l b e kept and that in the r emaining p art
  • f
the window isar e r emove d Example
  • Giv
en an ev en t database as sho wn in Figure a supp
  • se
the windo w size is set to
  • da
ys and the minim um supp
  • rt
is set to
  • the
ev en t database is rst scanned to sum up the frequencies
  • f
eac h ev en t t yp e in the database and also the windo w frequencies whic h are
  • a
  • b
  • c
  • d
  • m
  • x
  • y
  • z
  • and
  • a
  • b
  • c
  • d
  • m
  • x
  • y
  • z
  • resp
ectiv ely
  • Th
us the frequen t ev en t t yp es are a b c d x since their windo w frequencies are at least the minim um supp
  • rt
The frequen t ev en t t yp es are sorted in the descending
  • rder
  • f
their database frequencies and the
  • rdered
frequen t ev en t t yp es are
  • b
a c d x
  • Da
y Ev en ts
  • a
b c
  • y
  • m
b d x
  • a
c
  • a
b
  • z
  • m
y
  • d
  • b
x c
  • d
a b
  • x
z Windo w No Da ys included Ev en tset pairs
  • b
a c d x
  • b
d x a c
  • a
c b
  • b
a d
  • d
b c x
  • b
c x a d
  • b
a d x
  • x
  • a
b Fig
  • An
ev en t database and the corresp
  • nding
windo ws Next a n ull ro
  • t
no de is created The ev en t database is then scanned for the second time to read the ev en t t yp es in ev ery
  • da
ys for inserting the windo ws
slide-6
SLIDE 6 ev en t t yp es in to the tree Keeping
  • nly
the frequen t ev en t t yp es and excluding the duplicate ev en t t yp es the rst windo w can b e represen ted b y
  • b
a c d x
  • in
whic h the rst round brac k ets consists
  • f
the ev en t t yp es in the rst da y
  • f
windo w while the second round brac k ets consists
  • f
the ev en t t yp es in the remaining da ys
  • f
the windo w the second da y
  • f
the windo w in this example Both ev en t lists are sorted in decreasing windo w frequency
  • rder
W e call
  • b
a c d x
  • the
ev en tset pair represen tation for the windo w A rst new path is built for the rst windo w with all coun ts initialized to
  • ne
The no des are created in the sorted
  • rder
and the t yp es
  • f
the no des b c and a are set to
  • while
that
  • f
no des d and x are set to
  • The
windo w is shifted
  • ne
da y lo w er and
  • ne
more da y
  • f
ev en t t yp es in the database are read to get the second windo w The ev en t t yp es are sorted and the second windo w is
  • b
d x a c
  • The
tree after inserting the path for the second windo w is sho wn in Figure
  • Eac
h tree no de has a lab el
  • f
the form
  • E
  • C
  • B
  • where
E is an ev en t t yp e C is the coun t and B is the binary bit In this gure the dotted lines indicates the link ed list
  • riginating
from items in the header table to all no des in the tree with the same ev en t t yp e

null Header Table item | head of link b a c d x b:2:0 a:1:0 x:1:1 c:1:0 d :1:1 d:1:0 c:1:1 x:1:0 a:1:1 null Header Table event head of type link b a c d x b:5:0 a:3:0 x:1:1 c:1:0 d :1:1 d:1:0 x:1:0 a:1:1 c:1:1 c:1:0 x:1:0 a:1:1 d:1:1 d:1:0 x:1:1 a:1:0 c:1:0 b:1:1 d:1:1 d:1:0 b:1:1 c:1:1 x:1:1

a b Fig
  • a
The tree after inserting the rst t w
  • windo
ws b A rough structure
  • f
the nal tree constructed in Example
  • The
remaining windo ws are inserted to the tree in the similar w a y
  • The
rough structure
  • f
the nal tree constructed is sho wn in Figure b Note that some dotted lines are missing in the gure for clarit y
  • Mining
frequen t episo des with the ev en t tree Our mining pro cess is a recursiv e pro cedure applied to eac h
  • f
the link ed list k ept at the header table Let the ev en t t yp es at the header table b e h
  • h
  • h
H
  • in
the topdo wn
  • rdering
  • f
the table W e start from the ev en t t yp e h H at the b
  • ttom
  • f
the header table and tra v erse up the header table W e ha v e the follo wing
  • b
jectiv e in this recursiv e pro cess
slide-7
SLIDE 7 Ob jectiv e A Our aim is that when we have nishe d the pr
  • c
essing
  • f
the linke d list for h i
  • we
should have mine d al l the fr e quent episo des that c
  • ntain
event typ es h i
  • h
i
  • h
H
  • Supp
  • se
w e are pro cessing the link ed list for ev en t t yp e h i
  • fh
i g is called a base ev en t set in this step W e can examine all the paths including the ev en t t yp e h i from the ev en t tree b y follo wing the link ed list Let us call the set
  • f
these paths P i
  • These
paths will help us to nd the frequencies
  • f
episo des con taining ev en t t yp e h i
  • W
e ha v e the follo wing
  • b
jectiv e Ob jectiv e B F rom the paths in P i
  • w
e should nd all frequen t episo des that con tain h i but not an y
  • f
h i
  • h
H
  • The
reason wh y w e do not w an t to include h i
  • h
H is that frequen t episo des con taining an y
  • f
h i
  • h
H ha v e b een pro cessed in earlier iterations Let us call the set
  • f
all frequen t episo des in DB that con tain h i but not an y
  • f
h i
  • h
H
  • the
set X i
  • W
e break up the Ob jectiv e B in to t w
  • smaller
  • b
jectiv es Ob jectiv e B F r
  • m
the p aths in P i
  • we
would like to nd al l fr e quent episo des in X i
  • f
the form fag
  • fh
i g wher e a is a single event typ e Ob jectiv e B F r
  • m
the p aths in P i
  • we
would like to form a datab ase
  • f
p aths D B
  • which
c an help us to nd the set S i
  • f
al l fr e quent episo des in X i that c
  • ntains
h i and at le ast two
  • ther
event typ es D B
  • is
a conditional database which do es not c
  • ntain
fh i g such that if we c
  • nc
atenate e ach c
  • nditional
fr e quent episo de in D B
  • with
h i
  • the
r esulting episo des wil l b e the set we want With D B
  • w
e shall build a conditional ev en t tree T
  • with
its header table in a similar w a y as the rst ev en t tree Therefore w e can rep eat the mining pro cess recursiv ely to get all the conditional frequen t episo des in T
  • No
w w e consider ho w w e can get the set
  • f
paths P i
  • and
from there
  • btain
a set
  • f
condition al paths C i in
  • rder
to ac hiev e Ob jectiv es B and B Naturally w e examine the link ed list for h i
  • and
lo cate all paths that con tain h i
  • Let
us call the ev en t t yp es h i
  • h
H in v alid and the
  • ther
ev en t t yp es in the header table v alid A no de lab eled with an in v alidv alid ev en t t yp e is in v alid v alid Supp
  • se
w e arriv e at a no de x in the link ed list there are t w
  • p
  • ssibilities
  • If
the no de x with ev en t t yp e h i
  • is
at the rstda ys part the binary bit is
  • w
e rst visit all the ancestor no des
  • f
x and include all the no des in
  • ur
conditional path prex W e p erform a depthrst se ar ch to visit all the subpaths in the subtree ro
  • ted
under ev en t no de x eac h suc h path has a p
  • ten
tial to form a conditional path Only v alid no des are used to form paths in P i
  • Note
that the no des in the rstda ys part
  • f
the path b elo w ev en t x will b e in v alid and hence ignored
  • If
the no de x with ev en t t yp e h i
  • is
in the remainingda ys part w e simply tra v erse up the path and ignore the subtree under this no de This is b ecause all the no des b elo w x will b e in v alid In v alid no des ma y app ear ab
  • v
e x and they are also ignored F
  • r
example when we pr
  • c
ess the left most
  • d
  • no
de in the tr e e in Figur e b we tr averse up the tr e e include al l no des exc ept for
  • x
slide-8
SLIDE 8 sinc e it is invalid When we pr
  • c
ess the left most
  • c
  • no
de in the tr e e in Figur e b we tr averse up to no de
  • b
  • and
then we do a depth rst se ar ch We ignor e the no des
  • x
  • b
elow it and include
  • a
  • but
not
  • d
  • Note
that the do wn w ard tra v ersal can b e stopp ed when the curren t no de has no c hild no de
  • r
w e ha v e reac hed an in v alid no de Let us represen t a path in the tree b y
  • e
  • c
  • e
  • c
  • e
p
  • c
p
  • e
  • c
  • e
  • q
  • e
  • q
  • where
e i
  • e
  • j
are ev en t t yp es c i
  • c
  • j
are their resp ectiv e coun ts e i are ev en t t yp es in the rstda ys part and e
  • j
are from the remainingda ys part Consider a path p that w e ha v e tra v ersed in the ab
  • v
e W e eectiv ely do a few things for p
  • Step
  • Remo
v e in v alid ev en t t yp es namely
  • h
i
  • h
H
  • Step
  • Adjust
coun ts
  • f
no des ab
  • v
e h i in the path to b e equal to that
  • f
h i
  • Step
  • If
h i is in the rstda ys part then mo v e all ev en t t yp es in the remainingda ys part to the rstda ys part
  • Step
  • Remo
v e h i from the path The resulting path is a conditional path for h i
  • After
w e ha v e nished with all no des in the link ed list for h i
  • w
e ha v e the complete set
  • f
conditional paths C i for h i
  • F
  • r
example for Figure
  • the
conditional paths for x are
  • d
  • b
  • c
  • b
  • c
  • a
  • d
  • b
  • a
  • c
  • d
  • b
  • a
  • d
  • b
  • d
  • a
  • c
  • The
set C i forms
  • ur
conditional database D B
  • It
helps us to ac hiev e b
  • th
Ob jectiv es B and B F
  • r
Ob jectiv e B w e rst determine those ev en t t yp es in D B
  • with
a windo w frequencies whic h satises the minim um threshold requiremen t This can help us to prune some ev en t t yp es when constructing the conditional ev en t tree T
  • The
windo w frequency
  • f
e is the sum
  • f
the coun ts
  • f
no des in C i with a lab el
  • f
e F
  • r
Ob jectiv e B w e need to nd the single ev en t t yp es whic h when com bined with h i will form a frequen t episo de F
  • r
lo cating these ev en t t yp es w e use the rstpart frequency for ev en t t yp es The rstpart frequency
  • f
an ev en t t yp e e in the set
  • f
conditional paths C i is the sum
  • f
the coun ts in the no des with lab el e in C i that are in the rstda ys part In the ab
  • v
e w e describ e ho w w e can form a conditional database D B
  • with
a base ev en t set
  • fh
i g and a conditional ev en t tree T
  • can
b e built from D B
  • In
the header table for T
  • the
ev en t t yp es are sorted in descending
  • rder
  • f
the database frequencies in D B
  • Ev
en t t yp es in the conditional paths in D B
  • are
then sorted in the same
  • rder
at b
  • th
the rstda ys part and the remainingda ys part b efore they are inserted in to T
  • W
e apply the mining pro cess recursiv ely
  • n
this ev en t tree T
  • T
  • has
its
  • wn
header table and w e can rep eat the link ed list pro cessing with eac h en try in the header table When w e build another conditional ev en t tree T
  • for
a certain header h
  • j
for T
  • the
event b ase set is up dated as h
  • j
  • This
means that frequen t episo des unco v ered from T
  • are
to b e concatenated with h
  • j
  • as
the resulting frequen t episo des
slide-9
SLIDE 9 Strategy
  • When
a c
  • nditional
event tr e e c
  • ntains
  • nly
a single p ath the fr e quent episo des c an b e gener ate d dir e ctly by forming the set S
  • f
al l p
  • ssible
subsets
  • f
the event typ es in the rstdays p art
  • f
p ath and then the set S
  • f
al l p
  • ssible
subsets
  • f
the event typ es in the r emainingdays p art A ny element
  • f
S
  • with
the event b ase set is a p
  • ssible
fr e quent episo de The union
  • f
any element
  • f
S
  • and
any element
  • f
S
  • and
the event b ase set is also a p
  • ssible
fr e quent episo de A nd the fr e quency
  • f
such a episo de is the minimum among the event typ es in episo de By the w a y w e construct a conditional ev en t tree if a path con tains ev en t t yp es in the remainingda ys part those ev en t t yp es corresp
  • nds
to windo ws whic h con tains some episo de e with h i in the remainingda ys part F
  • r
suc h windo ws to b e coun ted for the episo de e there m ust b e some ev en t t yp e in e that
  • ccurs
in the rstda ys part Therefore when w e form episo de with an elemen t in S
  • w
e m ust com bine with some elemen t in S
  • P
erformance ev aluation T
  • ev
aluate the p erformance
  • f
the prop
  • sed
metho d w e conduct exp erimen ts
  • n
an Sun Ultra
  • mac
hine running SunOS
  • with
  • MB
Main Memory
  • The
programs are written in C Both syn thetic and real data sets are used Syn theti c data The syn thetic data sets are generated from a mo died v ersion
  • f
the syn thetic data generator in
  • The
data is generated with the consideration
  • f
  • v
erlapping windo ws That is with the windo w size
  • f
x da ys the program will consider what data it has generated in the previous x
  • da
ys in
  • rder
to c ho
  • se
the suitable ev en t t yp es for the xth da y to main tain the target frequencies
  • f
the frequen t episo des The data generator tak es six main parameters as listed in the follo wing table P arameter Description V alues jD j Num b er
  • f
da ys K K K jT j Av erage n um b er
  • f
ev en ts p er da y
  • jI
j Av erage size
  • f
frequen t episo des
  • jLj
Num b er
  • f
frequen t episo des
  • M
Num b er
  • f
ev en t t yp es
  • W
Windo w size
  • F
  • ur
datasets with dieren t parameter settings as sho wn in the follo wing table are pro duced With D and D w e v ary the thresholds and windo w sizes With D to D w e v ary the n um b er
  • f
da ys and ev en t t yp es Dataset Name Dataset jT j jI j jD j jM j D TIMDK
  • K
  • D
TIMDK
  • K
  • D
TIMDK
  • K
  • D
TIMDK
  • K
  • In
  • ur
implemen tatio n w e used link ed lists to k eep the frequen t episo des
  • ne
list for eac h episo de size Eac h
  • f
these lists is k ept in an
  • rder
  • f
decreasing frequencies for a rank ed displa y to the user at the end
  • f
the mining W e measure the run time as the total execution time
  • f
b
  • th
CPU time and IO time The run time in the exp erimen t include b
  • th
tree construction and mining Eac h data p
  • in
ts in graphs are the mean time
  • f
sev eral runs
  • f
the exp erimen t
slide-10
SLIDE 10 The run time decreases with the supp
  • rt
threshold as sho wn in Figure
  • a
As the supp
  • rt
threshold increases less frequen t ev en ts are found and included in the subsequen t conditional trees and m uc h less time are required to nd the frequen t ev en t t yp es in the smaller conditional trees 50 100 150 200 250 300 350 2 4 6 8 10 12 14 16 18 20 Time (in second) Support threshold (%) D1 D2 2000 4000 6000 8000 10000 12000 14000 2 3 4 5 6 7 8 9 10 Run Time (in second) Window Size (in days) D1 D2 a windo w size
  • b
threshold
  • Fig
  • P
erformance
  • f
syn thetic datasets D and D Figure b sho ws the eect
  • f
dieren t windo w sizes
  • n
the run time The datasets D and D are used and the exp erimen t run under threshold xed to
  • When
the windo w size increases the execution time increases b ecause more items are included in windo w and paths
  • f
trees The parameters
  • f
D are greater than D and therefore the sizes
  • f
the initial tree and the conditional trees are larger So the run time for D is m uc h larger than that for D when the windo w size is greater than
  • da
ys T
  • study
the eect
  • f
the n um b er
  • f
da ys in datasets
  • n
the execution time the exp erimen t
  • n
dataset D is conducted The supp
  • rt
threshold and the windo w size are set to
  • and
  • da
ys resp ectiv ely
  • The
result in Figure a sho ws that the execution time increases linearly with the n um b er
  • f
da ys The eect
  • f
the n um b er
  • f
ev en t t yp es
  • n
the execution time is also in v es tigated The dataset D is used and the supp
  • rt
threshold is set to
  • with
  • da
ys
  • f
windo w size The result is sho wn in Figure b The curv e falls exp
  • nen
tially as the n um b er
  • f
ev en t t yp es increases When the n um b er
  • f
items is decreased the distribution
  • f
ev en t t yp es are more con cen trated and the frequencies
  • f
the ev en t t yp es are higher Therefore less ev en ts are pruned when constructing the conditional trees and the run time is longer Real data The real data set is the news ev en t extraction from a in ternet rep
  • sitory
  • f
a n um b er
  • f
lo cal newspap ers more details
  • f
whic h is rep
  • rted
in
  • It
con tains
  • ev
en t t yp es and
  • da
ys F
  • r
example Cheung Kong sto c k go es up is an ev en t t yp e An ev en t for an ev en t t yp e
  • ccurs
when it is rep
  • rted
in the collected news In addition w e ha v e collected sto c k data from the Datastream In ternational Electronic Databse w e ha v e retriev ed Do w Jones
slide-11
SLIDE 11 10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 Run Time (in second) Number of days (K) 50 100 150 200 250 300 350 400 450 200 400 600 800 1000 Run Time (in second) Number of event types a v arying n um b er
  • f
da ys b v arying n um b er
  • f
ev en t t yp es Fig
  • Syn
thetic dataset D with windo w size
  • and
threshold
  • industrial
a v erage Nasdaq Comp
  • site
Index Hang Seng Index F uture Hang Seng Index and prices
  • f
  • top
lo cal companies for the same p erio d
  • f
time The p erformance using the real dataset with dieren t supp
  • rt
thresholds and windo w sizes are sho wn in Figure
  • a
and b In Figure
  • a
the windo w size is set to
  • da
ys The execution time is rapidly decreased with the threshold ab
  • v
e
  • It
is b ecause the supp
  • rts
  • f
half
  • f
the most frequen t ev en ts are close together when the threshold is b elo w
  • the
pruning p
  • w
er in forming conditional trees is w eak 1000 2000 3000 4000 5000 6000 7000 10 15 20 25 30 Time (in second) Support threshold (%) 50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 Time (in second) Window Size (Days) a b Fig
  • Real
dataset with a windo w size
  • da
ys b supp
  • rt
threshold
  • The
p erformance
  • n
v arying the windo w sizes are sho wn in Figure b with the threshold equal to
  • The
execution time increases with the windo w size steeply
  • The
run time with a windo w size
  • f
  • da
ys is to
  • long
  • seconds
and is not sho wn in graph When the windo w size is large the tree paths are longer and include more items As the supp
  • rts
  • f
the items are close together the subsequen t conditional trees nearly include all ev en t t yp es from the
  • riginal
slide-12
SLIDE 12 trees and the sizes
  • f
conditional trees cannot b e reduced Th us the mining time increases with the windo w size T able
  • Some
  • f
the results mined with threshold
  • and
windo w size
  • da
ys Episo de Supp
  • rt
Nasdaq do wns PCCW do wns
  • Cheung
Kong ups Nasdaq ups
  • Cheung
Kong Holdings ups China Mobile Ups
  • Nasdaq
ups SHK Prop erties flats HSBC flats
  • Cheung
Kong ups SHK Prop erties flats HK Electric flats
  • China
Mobile do wns Nasdaq do wns HK Electric flats
  • China
Mobile do wns Heng Sang Index do wns HSBC flats
  • US
increases in terest rate HSBC flats Do w Jones flats
  • Real
Dataset Results In terpretati
  • n
Since the frequencies
  • f
the ev en ts
  • btained
from newspap ers are m uc h less than the ev en ts
  • f
sto c k price mo v emen t w e ha v e set the threshold to
  • to
allo w more episo des including the newspap er ev en ts to b e mined W e ha v e selected some in teresting episo des mined with threshold set to
  • and
windo w size set to
  • da
ys in T able
  • W
e notice some relationship b et w een Nasdaq and PCCW a telecom sto c k W e see that Nasdaq ma y ha v e little impact
  • n
SHK Prop erties real estate
  • r
HSBC banking Ac kno wledgemen ts This researc h w as supp
  • rted
b y the R GC the Hong Kong Researc h Gran ts Council gran t UGC REFCUHK E References
  • R
Agra w al and R Srik an t Mining se quential p atterns th In ternational Conf On Data Engineering Marc h
  • MS
Chen JS P ark and P S Y u Ecient Data Mining for Path T r aversal Pat terns IEEE T ransactions
  • n
Kno wledge and Data Engineering Marc hApril
  • J
Han J P ei and Y Yin Mining F r e quent Patterns without Candidate Gener ation SIGMOD
  • P
  • S
Kam and A W C F u Disc
  • vering
T emp
  • r
al Patterns for IntervalBase d Events Pro c Second In ternational Conference
  • n
Data W arehousing and Kno wl edge Disco v ery
  • H
Lu J Han and L F eng Sto ck Movement Pr e dic ation and NDimensional Inter T r ansaction Asso ciation R ules Pro c
  • f
SIGMOD w
  • rkshop
  • n
Researc h Issues
  • n
Data Mining and Kno wledge Disco v ery DMKD
  • H
Mannila and H T
  • iv
  • nen
Disc
  • vering
gener alize d episo des using minimal
  • c
cur r enc es nd In ternational Conf On Kno wledge Disco v ery and Data Mining August
  • Ann
y Ng and KH Lee Event Extr action fr
  • m
Chinese Financial News In terna tional Conference
  • n
Chinese Language Computing ICCLC
  • J
P ei J Han B Mortaza viAsl H Pin to Q Chen U Da y al and MC Hsu Pr e xSp an Mining Se quentia l Patterns Eciently by Pr exPr
  • je
cte d Pattern Gr
  • wth
Pro ceedings
  • f
the th IEEE In ternational Conference
  • n
Data Engineering
  • P
  • C
W
  • ng
W Co wley
  • H
F
  • te
and E Jurrus Visualizin g Se quential Patterns for T ext Mining Pro ceedings IEEE Information Visualizati
  • n
Octob er