 
              Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms Using g Minimum mum Subgra graphs phs Robert bert Koyak Op Opera ration tions s Research rch Dept. Na Nava val l Postgradua graduate te School ool Collabora llaborative tive work k wit ith h Dav ave Ruth, h, Emily ly Craparo, ro, and Kevin Wood
Basic Setup Have observations assumed to be sampled N independently from unknown, multivariate distributions distribution of observation , F j j Homogeneity Hypothesis ( ) : H 0 F F F 1 2 N Heterogeneity Hypothesis ( ) : H 1 T here exists some such that {2, , } k N and , , F F F F F 1 2 1 1 k k k is ( , ) max ( , ) F F F F 1 1 k r j j r j strictly positive and nondecreasing for { 1, , } j k N 2
Heterogeneity includes: • A single change in distribution at a known change point (“two - sample problem”) • A single change in distribution at an unknown change point • Directional drift (in mean or other features) that begins at an unknown point in the observation sequence 3
Distance Matrix distance matrix (Euclidean, D d N N i j Manhattan, etc.) ( , ) d y y d y y i j i j i j Maa, Pearl, and Bartoszynski (1996): independent, ~ , , Y Y Y F 1 2 3 independent, ~ , , Z Z Z G 1 2 3 if and only if F G L L d Y Y ( , ) ( , ) ( , ) d Z Z d Y Z 1 2 1 2 3 3 4
The distance matrix has the information needed to express departure from the homogeneity hypothesis. For the types of departure we want to detect, this information should be expressed in particular ways. How can we unlock it? 5
The strategy we will explore is to fit a minimum subgraph (of some type) to the data treated as vertices in a complete, undirected graph. From the subgraph a statistic is derived that is sensitive to the departures from homogeneity that we wish to detect. 6
A Graph-Theoretic Approach Complete undirected graph ( , ), , G V E V N N | | ( 1) / 2 E N N Subgraph family (e.g. spanning trees, G N k-factors, Hamiltonian paths or circuits) ˆ ˆ ˆ Minimum subgraph is defined by G ( , ) G V E N ˆ argmin G E d G ( , ) G V E ( , ) i j i j N ˆ The test statistic is ˆ ( ) G 7
Minimum Spanning Trees (MSTs) • Friedman and Rafsky (1979) used MSTs to define a multivariate extension of the runs test in the context of the two-sample problem • The test statistic is the number of edges in the MST that join vertices belonging to different samples • Small values of the statistic are evidence against homogeneity 8
78 78 78 MST for breast 1.6 1.6 1.6 cancer mortality 86 86 86 85 85 85 rates, 1969 to 88 88 88 1988 ( N = 20), 1.4 1.4 1.4 84 84 84 Schuylkill Schuylkill Schuylkill relative to 77 77 77 72 72 72 83 83 83 1968 base. 82 82 82 71 71 71 74 74 74 80 80 81 81 81 73 73 73 1.2 1.2 1.2 74 74 80 80 80 Next, treat 74 74 Sample 1 as 79 79 79 the years 87 87 87 70 70 70 69 69 69 1969 – 1978 1.0 1.0 1.0 75 75 75 and Sample 2 76 76 76 as the years 1979 – 1988 0.95 0.95 0.95 1.05 1.05 1.05 1.15 1.15 1.15 1.25 1.25 1.25 Philadelphia Philadelphia Philadelphia 9
78 78 There are ˆ 11 1.6 1.6 MST edges that join 86 86 85 85 vertices in different 88 88 1.4 1.4 samples. 84 84 Schuylkill Schuylkill 77 77 72 72 83 83 82 82 71 71 The p-value, 74 74 80 80 81 81 73 73 obtained by a 1.2 1.2 74 74 80 80 74 74 permutation test, is about 79 79 0.41 87 87 70 70 69 69 1.0 1.0 75 75 76 76 0.95 0.95 1.05 1.05 1.15 1.15 1.25 1.25 Philadelphia Philadelphia 10
Is anything really happening? Spearman rank correlations vs. time, p-values: Philadelphia .0004 Schuylkill .01 11
Minimum Non-bipartite Matching (MNBM) • Also known as unipartite matching, 1-factor • Rosenbaum (2005) defined a “cross - match” test using MNBM analogous to that of Friedman and Rafsky • The test statistic is the number of edges in the MNBM that join vertices belonging to different samples • Small values of the statistic are evidence against homogeneity 12
Cross-match test (Rosenbaum) (number of matching edges) / 2 n N Group 1 has observations k Group 2 has observations N k number of cross-matches C M number of matches within Group 1 M C 2 M M k 1 n k r N 2 k r ( ) 2 , P M r k r r k 0 ( ), , / 2 r k n k 13
78 78 78 MNBM fit to 1.6 1.6 1.6 the breast 86 86 86 cancer 85 85 85 mortality data. 88 88 88 1.4 1.4 1.4 84 84 84 Schuylkill Schuylkill Schuylkill Count the 77 77 77 72 72 72 83 83 83 82 82 82 71 71 71 number of 74 74 74 80 80 81 81 81 edges that join 73 73 73 1.2 1.2 1.2 74 74 80 80 80 vertices in different 79 79 79 groups 87 87 87 70 70 70 69 69 69 1.0 1.0 1.0 75 75 75 76 76 76 0.95 0.95 0.95 1.05 1.05 1.05 1.15 1.15 1.15 1.25 1.25 1.25 Philadelphia Philadelphia Philadelphia 14
78 78 78 There are ˆ 6 1.6 1.6 1.6 CM edges that join 86 86 86 vertices in 85 85 85 different 88 88 88 1.4 1.4 1.4 samples. 84 84 84 Schuylkill Schuylkill Schuylkill 77 77 77 72 72 72 83 83 83 82 82 82 71 71 71 The p-value, 74 74 74 80 80 81 81 81 obtained from 73 73 73 1.2 1.2 1.2 74 74 80 80 80 the exact null distribution, is 79 79 79 about 0.87 87 87 87 70 70 70 69 69 69 1.0 1.0 1.0 75 75 75 76 76 76 0.95 0.95 0.95 1.05 1.05 1.05 1.15 1.15 1.15 1.25 1.25 1.25 Philadelphia Philadelphia Philadelphia 15
Extensions of the Cross-Match Test Ruth (2009) and Ruth & Koyak (2011) introduce two extensions of the cross-match test to detect departures from homogeneity in the direction of 1 : H (1) An exact, simultaneous cross-match test for an unspecified change-point ˆ ˆ ( ) min ( ) ( , , ) k q k k SCM CM k k k 0 1 k 0 1 (2) A sum of (vertex) pair maxima test ˆ i j ˆ SPM ( , ) i j E 1 1 | | ( 1) i j N N ˆ 2 4 ( , ) i j E 16
78 78 78 SCM test has 1.6 1.6 1.6 exact p-value 86 86 86 of 0.59 for 85 85 85 testing against 88 88 88 1.4 1.4 1.4 an unspecified 84 84 84 Schuylkill Schuylkill Schuylkill change-point 77 77 77 72 72 72 83 83 83 82 82 82 71 71 71 74 74 74 80 80 81 81 81 SPM test has 73 73 73 1.2 1.2 1.2 74 74 80 80 80 approximate p-value of 79 79 79 0.41 87 87 87 70 70 70 69 69 69 1.0 1.0 1.0 75 75 75 76 76 76 0.95 0.95 0.95 1.05 1.05 1.05 1.15 1.15 1.15 1.25 1.25 1.25 Philadelphia Philadelphia Philadelphia 17
Some Theory • Friedman & Rafsky’s ˆ MST – Asymptotic normality under H 0 – Universal consistency under H 1 for the two-sample problem (Henze & Penrose, 1999) • Rosenbaum’s ˆ CM – Asymptotic normality under H 0 – Consistency under restrictive assumptions • Ruth’s SPM test ˆ SPM – Asymptotic normality under H 0 – Consistency remains to be proven 18
Ensemble Tests Problem with graph-theoretic tests: a single minimum subgraph contains very limited information about and D as such these tests are not very powerful Tukey suggested fitting multiple "orthogonal" MST s in Friedman & Rafsky's test and combining them (in a manner that was not specified) Two subgraphs are orthgonal if they share no common edges For MSTs this is problematic: existence of a fixed number of orthogonal MSTs (even two) is not assured! For MNBMs we are assured at least orthogonal / 2 N subgraphs (Anderson, 1971) constructed sequentially 19
First MNBM Fit to the Breast Cancer Mortality Data 78 1.6 86 85 88 1.4 84 Schuylkill 77 72 83 82 81 80 71 73 74 1.2 79 87 70 69 1.0 75 76 0.95 1.00 1.05 1.10 1.15 1.20 1.25 Philadelphia
First Two MNBMs Fit to the Breast Cancer Mortality Data 78 1.6 86 85 88 1.4 84 Schuylkill 77 72 83 82 81 80 71 73 74 1.2 79 87 70 69 1.0 75 76 0.95 1.00 1.05 1.10 1.15 1.20 1.25 Philadelphia
First Three NMBMs Fit to the Breast Cancer Mortality Data 78 1.6 86 85 88 1.4 84 Schuylkill 77 72 83 82 81 80 71 73 74 1.2 79 87 70 69 1.0 75 76 0.95 1.00 1.05 1.10 1.15 1.20 1.25 Philadelphia
Structure of Ensembles • Ensemble pairs decompose into Hamiltonian cycles each having an even number of vertices – Under H 0 all 1-factors are equally likely but it is not true that all ensemble 2-factors are equally likely! – However, conditional on the cyclic structure uniformity is true – Second-order properties do not depend on the cyclic structure • Ensemble 3-factors have more complex cyclic behavior and also exhibit triangles – Prevalence of triangles depends on the dimensionality of the data: lower dimension = more triangles 23
Ensemble Tests Ruth (2009) proposed an Ensemble Sum of Pair Maxima (ESPM) test based on fitting a sequence of orthogonal MNBMs and taking the / 2 n N cumulative sums of the SPM statistics. The test takes the followi ng form: k ˆ ˆ 1 max ( ) c j ESPM SPM {1, , } , N k n k N 1 j 2 2 ( 1)( 1) /180, ( 1) / 3 c N N N kN N , N k N 24
Recommend
More recommend