Detecting ecting Chan ange ge in in Mult ltivar ivariate iate - - PowerPoint PPT Presentation

detecting ecting chan ange ge in in mult ltivar ivariate
SMART_READER_LITE
LIVE PREVIEW

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate - - PowerPoint PPT Presentation

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms Using g Minimum mum Subgra graphs phs Robert bert Koyak Op Opera ration tions s Research rch Dept. Na Nava val l Postgradua graduate


slide-1
SLIDE 1

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms Using g Minimum mum Subgra graphs phs Robert bert Koyak Op Opera ration tions s Research rch Dept. Na Nava val l Postgradua graduate te School

  • ol

Collabora llaborative tive work k wit ith h Dav ave Ruth, h, Emily ly Craparo, ro, and Kevin Wood

slide-2
SLIDE 2

Basic Setup

1 2 1

( ) : ( ) :

j N

N F j H F F F H , Have

  • bservations assumed to be sampled

independently from unknown, multivariate distributions distribution of observation T Homogeneity Hypothesis Heterogeneity Hypothesis

1 2 1 1 1

1

{2, , } , , ( , ) max ( , ) { 1, , }

k k k j r j

k r j

k N F F F F F F F F F j k N here exists some such that and is strictly positive and nondecreasing for

2

slide-3
SLIDE 3

Heterogeneity includes:

  • A single change in distribution at a known

change point (“two-sample problem”)

  • A single change in distribution at an unknown

change point

  • Directional drift (in mean or other features)

that begins at an unknown point in the

  • bservation sequence

3

slide-4
SLIDE 4

Distance Matrix

4

1 2 3 1 2 3

( , ) , , , , distance matrix (Euclidean, Manhattan, etc.) Maa, Pearl, and Bartoszynski (1996): independent, ~ independent, ~ if and only if

i j i j i j i j

D d N N d d Y Y Y F Z Z Z G F G y y y y

1 2 1 2 3 3

( , ) ( , ) ( , ) d Y Y d Z Z d Y Z

L L

slide-5
SLIDE 5

5

The distance matrix has the information needed to express departure from the homogeneity

  • hypothesis. For the types of

departure we want to detect, this information should be expressed in particular ways. How can we unlock it?

slide-6
SLIDE 6

6

The strategy we will explore is to fit a minimum subgraph (of some type) to the data treated as vertices in a complete, undirected graph. From the subgraph a statistic is derived that is sensitive to the departures from homogeneity that we wish to detect.

slide-7
SLIDE 7

A Graph-Theoretic Approach

7

( , )

( , ), , | | ( 1) / 2 ˆ ˆ ˆ ( , ) ˆ

N N N N

G V E

G V E V E N N G V E G Complete undirected graph Subgraph family (e.g. spanning trees, k-factors, Hamiltonian paths or circuits) Minimum subgraph is defined by argmin G G

( , )

ˆ ˆ ( )

N

i j

i j E d

G The test statistic is

G

slide-8
SLIDE 8

Minimum Spanning Trees (MSTs)

  • Friedman and Rafsky (1979) used MSTs to

define a multivariate extension of the runs test in the context of the two-sample problem

  • The test statistic is the number of edges in the

MST that join vertices belonging to different samples

  • Small values of the statistic are evidence

against homogeneity

8

slide-9
SLIDE 9

9

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 74 80

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 74 80

MST for breast cancer mortality rates, 1969 to 1988 (N = 20), relative to 1968 base. Next, treat Sample 1 as the years 1969–1978 and Sample 2 as the years 1979–1988

slide-10
SLIDE 10

10

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 74 80

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 74 80

There are edges that join vertices in different samples. The p-value,

  • btained by a

permutation test, is about 0.41

ˆ 11

MST

slide-11
SLIDE 11

Is anything really happening?

11

Spearman rank correlations vs. time, p-values: Philadelphia .0004 Schuylkill .01

slide-12
SLIDE 12

Minimum Non-bipartite Matching (MNBM)

  • Also known as unipartite matching, 1-factor
  • Rosenbaum (2005) defined a “cross-match”

test using MNBM analogous to that of Friedman and Rafsky

  • The test statistic is the number of edges in the

MNBM that join vertices belonging to different samples

  • Small values of the statistic are evidence

against homogeneity

12

slide-13
SLIDE 13

Cross-match test (Rosenbaum)

13

2

/ 2 2 ( ) 2 (number of matching edges) Group 1 has observations Group 2 has

  • bservations

number of cross-matches number of matches within Group 1

C C k r

n N k N k M M M M k n k r N P M r k r r k

1

, ( ), , / 2 r k n k

slide-14
SLIDE 14

14

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 80

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 80

MNBM fit to the breast cancer mortality data. Count the number of edges that join vertices in different groups

slide-15
SLIDE 15

15

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 80

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 80

There are edges that join vertices in different samples. The p-value,

  • btained from

the exact null distribution, is about 0.87

ˆ 6

CM

slide-16
SLIDE 16

Extensions of the Cross-Match Test

16

1 :

Ruth (2009) and Ruth & Koyak (2011) introduce two extensions of the cross-match test to detect departures from homogeneity in the direction

  • f

(1) An exact, simultaneous cross-match test for an H

1 1

ˆ ( , ) 1 1 ˆ 2 4 ( , )

ˆ ˆ ( ) min ( ) ( , , ) ˆ | | ( 1)

SCM CM SPM

unspecified change-point (2) A sum of (vertex) pair maxima test

k

k k k i j E i j E

k q k k i j i j N N

slide-17
SLIDE 17

17

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 80

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

0.95 1.05 1.15 1.25 1.0 1.2 1.4 1.6 Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

74 80

SCM test has exact p-value

  • f 0.59 for

testing against an unspecified change-point SPM test has approximate p-value of 0.41

slide-18
SLIDE 18

Some Theory

  • Friedman & Rafsky’s

– Asymptotic normality under H0 – Universal consistency under H1 for the two-sample problem (Henze & Penrose, 1999)

  • Rosenbaum’s

– Asymptotic normality under H0 – Consistency under restrictive assumptions

  • Ruth’s SPM test

– Asymptotic normality under H0 – Consistency remains to be proven

18

ˆMST ˆCM ˆSPM

slide-19
SLIDE 19

Ensemble Tests

19

Problem with graph-theoretic tests: a single minimum subgraph contains very limited information about and as such these tests are not very powerful Tukey suggested fitting multiple "orthogonal" MST D s in Friedman & Rafsky's test and combining them (in a manner that was not specified) Two subgraphs are orthgonal if they share no common edges For MSTs this is problematic: existence of a / 2 fixed number

  • f orthogonal MSTs (even two) is not assured!

For MNBMs we are assured at least

  • rthogonal

subgraphs (Anderson, 1971) constructed sequentially N

slide-20
SLIDE 20

0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.0 1.2 1.4 1.6

Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

First MNBM Fit to the Breast Cancer Mortality Data

slide-21
SLIDE 21

0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.0 1.2 1.4 1.6

Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

First Two MNBMs Fit to the Breast Cancer Mortality Data

slide-22
SLIDE 22

0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.0 1.2 1.4 1.6

Philadelphia Schuylkill

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

First Three NMBMs Fit to the Breast Cancer Mortality Data

slide-23
SLIDE 23

Structure of Ensembles

  • Ensemble pairs decompose into Hamiltonian cycles

each having an even number of vertices – Under H0 all 1-factors are equally likely but it is not true that all ensemble 2-factors are equally likely! – However, conditional on the cyclic structure uniformity is true – Second-order properties do not depend on the cyclic structure

  • Ensemble 3-factors have more complex cyclic

behavior and also exhibit triangles – Prevalence of triangles depends on the dimensionality of the data: lower dimension = more triangles

23

slide-24
SLIDE 24

Ensemble Tests

24

/ 2 Ruth (2009) proposed an Ensemble Sum of Pair Maxima (ESPM) test based on fitting a sequence

  • f
  • rthogonal MNBMs and taking the

cumulative sums of the SPM statistics. The test takes the followi n N

1 {1, , } , 1 2 2 ,

ˆ ˆ max ( ) ( 1)( 1) /180, ( 1) / 3

ESPM SPM

ng form:

k N k n k N j N k N

c j c N N N kN N

slide-25
SLIDE 25

Ensemble Tests

25

1 , 1

ˆ ( ) / ( 1)

SPM

(1) Under the process has the same first two moments as a Brownian bridge, (2) Although the summands individually are asymptotically normal

k N k N k N j k

H B t c j t k N , the same is not true of the process itself! (3) Unless the dimensionality of the observations is very large, classical Brownian bridge theory (Shorack & Wellner, 1987) produces critical values that violate the nominal level (4) Ruth (2009) produced critical values for different values of and dimensionality using extensive simulations N d

slide-26
SLIDE 26

Simulated critical values for N = 200

26

slide-27
SLIDE 27

100 Simulated , Bivar. Normal, Homogeneous

27

Critic ical al (.05) 5) = 1.19 19

( )

N k

B t

slide-28
SLIDE 28

100 Simulated , Bivar. Normal, Mean Jump

28

Critic ical al (.05) 5) = 1.19 19

( )

N k

B t

slide-29
SLIDE 29

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0

= .05 critical value = .01 critical value

Number of Orthogonal Matchings (k )

Normalized Process ( )

N

B

ESPM

ˆ 2.24 has p-value less than .01

Heterogeneity is signaled when six

  • r more matchings are used

( )

k

t

slide-30
SLIDE 30

Power simulations, N = 200, jump at observation 101, = norm

  • f mean vector after the jump, nominal .05-level tests

30

a) Multivariate normal,

mean ,

5 p

Jump SCM SPM ESPM JJS .05 .06 .04 .05 .5 .09 .10 .60 .52 1.0 .33 .41 1.00 1.00 ) Multivariate normal,

mean ,

20 p

Jump SCM SPM ESPM JJS .05 .05 .05 .03 .5 .07 .09 .33 .20 1.0 .16 .22 .95 .95

x, 5

.5

slide-31
SLIDE 31

Power simulations, N = 200, jump at observation 101, nominal .05-level tests

31

c) Multivariate normal,

covariance matrix, 5 p

Jump SCM SPM ESPM JJS .05 .06 .05 .04 .5 .42 .51 .97 .15 1.0 .99 .99 1.00 .24 ) Multivariate normal mixture,

mean ,

5 p

Jump SCM SPM ESPM JJS .05 .05 .04 .27 .5 .08 .09 .56 .38 1.0 .25 .36 .99 .85

1+ mult. norm

slide-32
SLIDE 32

Graph-theoretic Tests: Some Challenges and Possible Directions

  • 1. Computational
  • 2. Theoretical
  • 3. Alternate graph-theoretic approaches
  • 4. Adaptation to real-world problems

32

slide-33
SLIDE 33

Computational Challenges

33

2 4

( log( )) . ( log( )) Nm N m N N N Finding a MNBM requires computation time using the Blossom V algorithm (Kolmogorov, 2009). For the complete graph, For ensemble tests the order of computation is about wh 1000 N m N ich is prohibitive with large sample sizes (e.g. ). Possible strategies: (1) Use a greedy algorithm (2) Restrict the edge set ( ) (3) Try something else

slide-34
SLIDE 34

Faster Matchings?

34

Simple greedy heuristics are difficult to extend to multiple matchings Edge restriction heuristics. Sufficient conditions for a perfect matching to exist ( even) include

  • - A regular grap

N / 2 ( ) h of degree

  • - A connected, claw-free graph
  • - A Delaunay triangulation

Necessary and sufficient conditions: Tutte's Theorem

  • dd

for all N V S S S V

slide-35
SLIDE 35

Are MNBM tests universally consistent?

35

Asymptotic theory for MNBM is not straight- forward even for a single matching, let alone ensembles. Aldous & Steele (1992) theory for MSTs exploits perturbation localizability of MSTs (not applicable to matchings). Interesting recent work: "Poisson Matching" (Holroyd . 2008) et al

slide-36
SLIDE 36

36

, 1

( ) {0, 1}, 1, 1, , MNBM is a solution to the integer linear program Minimize: Subject to: By replacing the integrality constraints with the interval constraints

i j i j n i j i j i j i i j

n i j

f x d x x j n x

1 1 2 4

1 ˆ ˆ ˆ | | ( 1)

RSPM

a solution can be

  • btained using LP. A "relaxed" SPM statistic can

be defined by

i j i j i j

n n i j i j

x j x i j x N N

slide-37
SLIDE 37

37

1 2

ˆ {0, ,1} ˆ , 1, , Solutions to RNBM satisfy To fit ensembles enforce the constraints

  • ver a sequence of
  • problems. There is no assurance that solutions

will be "nested", howeve

i j i j

x x k k n r, which complicates theory Performance of relaxed MNBM statistics compares favorably with that of regular MNBM What about nearest neighbors?

slide-38
SLIDE 38

38

slide-39
SLIDE 39

39

slide-40
SLIDE 40

40

slide-41
SLIDE 41

Possible Applications

  • Process control (off-line, on-line)
  • Mechanical prognostics
  • Threat detection
  • Syndromic surveillance

In high-dimensional problems, it may be useful to couple graph-theoretic methods with methods to reduce dimensionality

41

slide-42
SLIDE 42

Dimension reduction

42

( , ) ( , )

min ( ) s.t. ( ) argmin ( ) {0, 1} Consider the optimization problem Vector projects into a low -dimensional space to minimize the sum of pair i

X E ij i j E T i j ij i j p r r

i j x x w p'

w x

w x w w y y w w ndex differences in the resulting minimum- weight matching

slide-43
SLIDE 43
  • Simplification 1: use Manhattan distance:
  • Simplification 2: use relaxed matching instead of

exact matching; enforce minimum-weight matching using strong duality.

43

,

ij r r ijr ijr ir jr

d d y y d w

{0,1 ( , ) } , , ,( , ) ( , )

( , ) min s.t.

p

v i j r V i j i j E i ijr r r i j V i v v ijr v j E r r r

i A j x a x d w i j E d w w p'

w x 0 π

1 x