Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard - PowerPoint PPT Presentation

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Example on the 6 first mvad sequences Non-normalized LCP Distance LCP R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = FALSE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 140 140 140 140 140 [2,] 140 0 140 140 90 140 [3,] 140 140 0 92 140 140 [4,] 140 140 92 0 140 140 [5,] 140 90 140 140 0 140 [6,] 140 140 140 140 140 0 8/7/2009gr 11/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Example on the 6 first mvad sequences Non-normalized LCP Distance LCP R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = TRUE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 1.0000000 1.0000000 1.0000000 1.0000000 1 [2,] 1 0.0000000 1.0000000 1.0000000 0.6428571 1 [3,] 1 1.0000000 0.0000000 0.6571429 1.0000000 1 [4,] 1 1.0000000 0.6571429 0.0000000 1.0000000 1 [5,] 1 0.6428571 1.0000000 1.0000000 0.0000000 1 [6,] 1 1.0000000 1.0000000 1.0000000 1.0000000 0 8/7/2009gr 12/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LLCS: example R> x <- c(1, 1, 1, 2, 2, 3, 3) R> y <- c(1, 1, 1, 4, 3, 3, 4) R> seqdist(seqdef(rbind(x, y)), method = "LCS") [,1] [,2] [1,] 0 4 [2,] 4 0 R> seqdist(seqdef(rbind(x, y)), method = "LCS", norm = TRUE) [,1] [,2] [1,] 0.0000000 0.2857143 [2,] 0.2857143 0.0000000 8/7/2009gr 14/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM: substitution example Consider the 2 sequences 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 8/7/2009gr 18/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Assigning indel and substitution costs Same cost for each ‘insert’ or ‘deletion’. indel cost is a single constant. Substitution costs: Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost c i , j = c j , i 8/7/2009gr 19/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Unique Costs R> subm.unique <- seqsubm(mvad.seq, method = "CONSTANT", cval = 2) R> subm.unique EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0 2 2 2 2 2 FE-> 2 0 2 2 2 2 HE-> 2 2 0 2 2 2 JL-> 2 2 2 0 2 2 SC-> 2 2 2 2 0 2 TR-> 2 2 2 2 2 0 8/7/2009gr 22/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Custom Costs R> subm.custom <- matrix(c(0, 1, 1, 2, 1, 1, 1, 0, 1, 2, + 1, 2, 1, 1, 0, 3, 1, 2, 2, 2, 3, 0, 3, 1, 1, 1, 1, + 3, 0, 2, 1, 2, 2, 1, 2, 0), nrow = 6, ncol = 6, byrow = TRUE, + dimnames = list(mvad.shortlab, mvad.shortlab)) R> subm.custom EM FE HE JL SC TR EM 0 1 1 2 1 1 FE 1 0 1 2 1 2 HE 1 1 0 3 1 2 JL 2 2 3 0 3 1 SC 1 1 1 3 0 2 TR 1 2 2 1 2 0 8/7/2009gr 23/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Based on Transition Rates R> subm.txrate <- seqsubm(mvad.seq, method = "TRATE") R> subm.txrate EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0.00000 1.97008 1.98723 1.95173 1.98536 1.95950 FE-> 1.97008 0.00000 1.99318 1.98266 1.99092 1.99235 HE-> 1.98723 1.99318 0.00000 1.99584 1.98184 1.99949 JL-> 1.95173 1.98266 1.99584 0.00000 1.99385 1.97808 SC-> 1.98536 1.99092 1.98184 1.99385 0.00000 1.99666 TR-> 1.95950 1.99235 1.99949 1.97808 1.99666 0.00000 8/7/2009gr 24/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Computing the distances Using the substitution cost matrix, we compute distances R> mvad.dist <- seqdist(mvad.seq, method = "OM", indel = 4, + sm = subm.custom, norm = TRUE) R> round(mvad.dist[1:10, 1:10], digits = 2) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [1] 0.00 1.03 0.86 0.90 1.03 0.47 0.46 0.34 0.27 0.57 [2] 1.03 0.00 1.23 1.93 0.16 1.49 0.57 0.69 1.30 1.37 [3] 0.86 1.23 0.00 1.01 1.39 0.70 1.14 1.20 0.59 1.26 [4] 0.90 1.93 1.01 0.00 1.93 0.46 1.36 1.24 0.63 0.90 [5] 1.03 0.16 1.39 1.93 0.00 1.49 0.64 0.69 1.30 1.37 [6] 0.47 1.49 0.70 0.46 1.49 0.00 0.91 0.80 0.20 0.99 [7] 0.46 0.57 1.14 1.36 0.64 0.91 0.00 0.11 0.73 0.80 [8] 0.34 0.69 1.20 1.24 0.69 0.80 0.11 0.00 0.61 0.69 [9] 0.27 1.30 0.59 0.63 1.30 0.20 0.73 0.61 0.00 0.79 [10] 0.57 1.37 1.26 0.90 1.37 0.99 0.80 0.69 0.79 0.00 8/7/2009gr 25/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 26/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Cluster analysis Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library agnes() : agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana() : divisive analysis. pam() : partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori). 8/7/2009gr 27/100

8/7/2009gr 28/100 Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Hierarchical clustering (Ward) R> plot(mvad.clusterward, ask = F, which.plots = 2) R> mvad.clusterward <- agnes(mvad.dist, diss = T, method = "ward") R> library(cluster) Height 0 5 10 15 [68] [26] [1] [120] [116] [150] [76] [169] [179] [155] [202] [193] [201] [213] [280] [263] [237] [306] [310] [292] [281] [373] [348] [345] [314] [392] [394] [399] [424] [421] [413] [404] [432] [481] [428] [427] [512] [525] [496] [527] [567] [558] [545] [617] [649] [598] [586] [680] [694] [695] [703] [178] [166] [289] [186] [483] [134] [384] [425] [528] [684] [702] [253] [566] [82] [638] [266] [352] [400] [39] [90] [278] [183] [224] [360] [65] Dendrogram of agnes(x = mvad.dist, diss = T, method = "ward") [368] [388] [81] [372] [163] [636] [357] [338] [560] [212] [412] [375] [159] [77] [570] [242] [350] [571] [396] [361] [114] [108] [633] [305] [469] [250] [398] [340] [477] [593] [575] [559] [648] [634] [502] [107] [407] [701] [46] [553] [123] [164] [416] [518] [479] [149] [151] [402] [328] [344] [56] [550] [119] [73] [532] [574] [240] [563] [591] [515] [12] [547] [635] [643] [79] [248] [507] [162] [80] [600] [437] [490] [690] [655] [449] [318] [657] [100] [681] [249] [661] [707] [7] [291] [293] [287] [509] [595] [117] [596] [74] [167] [172] [146] [619] [603] [678] [691] [125] [488] [700] [331] [364] [177] [61] [298] [497] [192] [98] [109] [168] [408] [662] [517] [200] [441] [555] [472] [308] [284] [176] [64] [8] [217] [211] [214] [180] [468] [346] [382] [353] [506] [478] [523] [582] [597] [683] [327] [605] [302] [255] [543] [199] [313] [30] [447] [602] [312] [659] [624] [708] [467] [430] [530] [157] [585] [277] [438] [145] [189] [465] [244] [54] [197] [243] [436] [70] Agglomerative Coefficient = 0.99 [247] [304] [446] [534] [653] [330] [406] [154] [152] [363] [111] [513] [494] [522] [625] [124] [271] [3] [267] [611] [264] [55] [127] [118] [9] [276] [362] [87] [66] [205] [126] [139] [184] [251] [252] [272] [371] [482] [326] [355] [628] [606] [579] [519] [141] [698] [78] [626] [387] [711] [59] [632] [629] [667] [334] [426] [351] [704] [580] [616] [29] [18] [637] [23] [92] [135] [121] [374] [397] [303] [409] [60] [22] [322] [386] [96] [696] [439] [420] [85] [343] [673] [105] [457] mvad.dist [299] [106] [122] [128] [419] [443] [672] [599] [140] [321] [401] [147] [161] [223] [682] [110] [160] [639] [546] [395] [95] [568] [699] [642] [6] [435] [319] [195] [471] [589] [354] [93] [493] [131] [288] [675] [225] [174] [58] [393] [136] [442] [132] [536] [187] [476] [296] [630] [97] [511] [268] [526] [564] [356] [389] [190] [309] [185] [524] [377] [486] [231] [671] [423] [4] [697] [644] [101] [86] [226] [473] [21] [191] [540] [69] [84] [265] [548] [499] [156] [712] [165] [535] [241] [290] [520] [38] [631] [41] [91] [440] [652] [508] [42] [501] [19] [315] [204] [539] [148] [103] [664] [210] [88] [71] [153] [325] [588] [10] [171] [463] [62] [336] [349] [14] [16] [562] [679] [414] [24] [219] [670] [102] [647] [307] [232] [196] [640] [317] [28] [270] [705] [381] [455] [229] [514] [188] [89] [342] [668] [221] [665] [227] [15] [20] [510] [262] [94] [40] [641] [138] [584] [627] [113] [366] [104] [254] [529] [347] [709] [537] [405] [99] [429] [660] [495] [403] [620] [663] [674] [689] [666] [594] [669] [687] [83] [618] [2] [581] [458] [269] [115] [335] [434] [710] [445] [324] [480] [533] [129] [448] [5] [491] [175] [294] [459] [561] [622] [230] [130] [541] [503] [531] [198] [556] [601] [385] [220] [112] [369] [466] [216] [379] [391] [538] [376] [651] [222] [489] [516] [233] [572] [554] [158] [142] [557] [484] [215] [246] [492] [339] [286] [645] [245] [311] [239] [462] [285] [301] [11] [418] [577] [576] [380] [370] [170] [261] [383] [433] [542] [173] [464] [487] [676] [182] [218] [297] [569] [300] [337] [470] [500] [549] [295] [275] [378] [341] [431] [590] [475] [444] [573] [415] [320] [551] [17] [578] [203] [650] [706] [688] [329] [43] [504] [677] [206] [45] [474] [460] [52] [692] [209] [181] [13] [235] [608] [27] [34] [49] [53] [32] [258] [238] [228] [57] [422] [417] [359] [279] [461] [505] [607] [259] [35] [48] [50] [234] [133] [256] [51] [358] [332] [283] [609] [604] [587] [454] [612] [685] [656] [498] [207] [36] [411] [257] [67] [25] [63] [333] [282] [451] [614] [615] [621] [452] [583] [544] [456] [44] [613] [610] [410] [646] [623] [143] [208] [323] [450] [31] [37] [273] [693] [365] [485] [316] [33] [236] [552] [390] [453] [521] [565] [367] [194] [144] [654] [592] [47] [274] [686] [75] [658] [260] [137] [72]

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Warning!!! Do not forget to specify the diss = T option. Otherwise (i.e. by default) functions agnes(), diana(), pam(), ... first compute the Euclidean distance matrix between rows of the dissimilarity matrix. 8/7/2009gr 29/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Retrieving cluster membership Select the number of clusters, cut tree at chosen level, and store cluster membership into a vector. R> mvad.cl3 <- cutree(mvad.clusterward, k = 3) R> mvad.cl3[1:10] [1] 1 2 1 1 2 1 1 1 1 3 R> clust.labels <- c("Employment", "Education", "Jobless") R> mvad.cl3.factor <- factor(mvad.cl3, levels = c(1, 2, + 3), labels = clust.labels) 8/7/2009gr 30/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Exploring clusters graphically Three types of graphics Transversal distribution with seqdplot() 1 Frequency plots with seqfplot() 2 Individual index-plots seqiplot() 3 Required argument: state sequence object. Use group = cluster.membership.factor to get plots by cluster. 8/7/2009gr 31/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Transversal Distributions R> seqdplot(mvad.seq, group = mvad.cl3.factor) 8/7/2009gr 32/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Most frequent sequences R> seqfplot(mvad.seq, group = mvad.cl3.factor) 8/7/2009gr 33/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Individual sequences R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0) 8/7/2009gr 34/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sorting sequences for i-plot display Previous i-plots become clearer if we sort sequences. Several possibilities: According to distance to most frequent sequence; distance to centro-type or any other useful reference. scores on first factor of a MDS analysis; 8/7/2009gr 35/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Computing distance to most frequent sequence Compute, in each cluster, distances to most frequent sequence ( refseq = 0) . Using here the custom substitution cost matrix. R> mvad.distom <- numeric(nrow(mvad)) R> mvad.distom[mvad.cl3 == 1] <- seqdist(mvad.seq[mvad.cl3 == + 1, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 2] <- seqdist(mvad.seq[mvad.cl3 == + 2, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 3] <- seqdist(mvad.seq[mvad.cl3 == + 3, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) 8/7/2009gr 36/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sort: Distance to most frequent sequence R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mvad.distom) 8/7/2009gr 37/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sort: First factor of MDS analysis R> mds1d <- cmdscale(mvad.dist, k = 1) R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mds1d) 8/7/2009gr 38/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Scatterplot (MDS) Through Multidimensional Scaling (MDS), we get a scatter plot of sequences R> mds2d <- cmdscale(mvad.dist, k = 2) R> plot(mds2d, type = "n") R> points(mds2d[mvad.cl3 == 1, ], pch = 16, col = "red") R> points(mds2d[mvad.cl3 == 2, ], pch = 16, col = "blue") R> points(mds2d[mvad.cl3 == 3, ], pch = 16, col = "green") R> legend("bottomright", fill = c("red", "blue", "green"), + legend = clust.labels) 8/7/2009gr 39/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sequence scatterplot colored by cluster ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mds2d[,2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● Employment ● ● ● ● ● ● Education ● Jobless ● ● −1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 mds2d[,1] 8/7/2009gr 40/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Code for scatterplot colored by sex R> plot(mds2d, type = "n") R> points(mds2d[mvad$male == "yes", ], pch = 16, col = "red") R> points(mds2d[mvad$male == "no", ], pch = 23, col = "blue") R> legend("bottomright", col = c("red", "blue"), pch = c(16, + 23), legend = c("Men", "Women")) 8/7/2009gr 41/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sequence scatterplot colored by sex ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mds2d[,2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● Men ● Women −1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 mds2d[,1] 8/7/2009gr 42/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 43/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Compute the sequence dispersion R> distMatLCS <- seqdist(mvad.seq, method = "LCS") R> distMatLCS[1:6, 1:7] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0 140 116 108 140 64 60 [2,] 140 0 72 140 22 140 80 [3,] 116 72 0 68 90 72 60 [4,] 108 140 68 0 140 46 112 [5,] 140 22 90 140 0 140 90 [6,] 64 140 72 46 140 0 68 R> dissvar(distMatLCS) [1] 42.74502 8/7/2009gr 45/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 46/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Analysis of sequence discrepancy ANOVA like analysis based on pairwise dissimilarities We decompose the SS (Sum of squares equivalent) SS T = SS B + SS W Here, with the formula shown earlier n n 1 � � SS T = d ij n i =1 j = i +1 � 1 n g n g � � � � = SS W d ij , g n g g i =1 j = i +1 SS B = SS T − SS W 8/7/2009gr 47/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo R-square and ANOVA Table ANOVA table for m groups Discrepancy df Mean Discr. F SS B SS B df W Between SS B df B = m − 1 df B df B SS W SS W Within SS W df W = � g n g − m df W Total SS T df T = n − 1 Pseudo R 2 SS B R 2 = SS T 8/7/2009gr 48/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo F Pseudo F SS B / ( m − 1) = F SS W / ( n − m ) Normality is not defendable in this setting. F cannot be compared with an F distribution. The significance is assesses through a permutation test Permutation test: iteratively randomly reassign each covariate profile to one of the observed sequence and recompute the F . Empirical distribution of F under independence. 8/7/2009gr 49/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Analysis of sequence discrepancy Running an ANOVA like analysis for gcse5eq R> mvad.lcs <- seqdist(mvad.seq, method = "LCS") R> da <- dissassoc(mvad.lcs, group = mvad$gcse5eq, R = 1000) 8/7/2009gr 50/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy ANOVA output R> print(da) Pseudo ANOVA table: SS df MSE Exp 2499.945 1 2499.94539 Res 27934.510 710 39.34438 Total 30434.455 711 42.80514 Test values (p-values based on 999 permutation): PseudoF PseudoR2 PseudoF_Pval PseudoT PseudoT_Pval 63.54009 0.08214195 0 1.199912 0 Variance per level: n variance no 452 37.48481 yes 260 42.27453 Total 712 42.74502 8/7/2009gr 51/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Distribution of pseudo F R> hist(da, col = "cyan") Distribution of PseudoF 120 100 80 Frequency 60 40 20 0 1 2 3 4 PseudoF 8/7/2009gr 52/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Multiple factor analysis Generalize previous approach for multiple covariates. There are different approaches. Here, we Measure the additional contribution of each covariate v when we accounted for all other covariates. The F statistics reads F v = ( SS B c − SS B v ) / p SS W c / ( n − m − 1) where the SS B c and SS W c are the explained and residual sums of squares of the full model, SS B v the explained sum of squares of the model after removing variable v , and p the number of indicators or contrasts used to encode the covariate v . significance is assessed again through permutation tests. 8/7/2009gr 53/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Running a Multiple factor analysis R> da.mfac <- dissmfac(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 1000) R> print(da.mfac) Variable PseudoF PseudoR2 p_value 1 male 3.274802 0.003840223 0.026 2 Grammar 21.124081 0.024771330 0.000 3 funemp 4.483016 0.005257046 0.003 4 gcse5eq 75.725976 0.088800698 0.000 5 fmpr 2.715988 0.003184926 0.045 6 livboth 2.314571 0.002714201 0.078 7 Total 24.829102 0.174448528 0.000 8/7/2009gr 54/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time R> mvad.diff <- seqdiff(mvad.seq, group = mvad$gcse5eq) R> mvad.diff$stat[1:4, ] PseudoF PseudoR2 PseudoT Sep.93 29.09196 0.03936176 2.313692 Oct.93 29.39664 0.03975760 2.223468 Nov.93 29.76849 0.04024027 2.265784 Dec.93 30.09793 0.04066750 2.304112 R> mvad.diff$variance[1:4, ] no yes Total Sep.93 0.3688107 0.3113979 0.3620982 Oct.93 0.3691362 0.3127219 0.3629661 Nov.93 0.3704210 0.3133136 0.3642237 Dec.93 0.3725771 0.3146893 0.3663363 8/7/2009gr 56/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Plotting R-squares over time R> plot(mvad.diff) 0.12 0.10 PseudoR2 0.08 0.06 0.04 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99 8/7/2009gr 57/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Plotting residual discrepancy over time R> plot(mvad.diff, stat = "Variance") no yes Total 0.35 0.30 Variance 0.25 0.20 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99 8/7/2009gr 58/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Tree structured discrepancy analysis Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R 2 . Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant. 8/7/2009gr 59/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Growing the tree R> dt <- disstree(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 5000) R> print(dt) Dissimilarity tree Global R2: 0.113 |-- Root [ 712 ] var: 42.7 |-> gcse5eq R2: 0.0821 |-- no [ 452 ] var: 37.5 |-> funemp R2: 0.0107 |-- no [ 362 ] var: 35.9 |-> male R2: 0.0123 |-- no [ 146 ] var: 38.7 |-- yes [ 216 ] var: 33.3 |-- yes [ 90 ] var: 41.8 |-- yes [ 260 ] var: 42.3 |-> Grammar R2: 0.0534 |-- no [ 183 ] var: 42.2 |-- yes [ 77 ] var: 34.9 8/7/2009gr 60/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Creating a Graphviz plot of the tree Using simplified interface to generate a file for GraphViz R> seqtree2dot(dt, "fg_mvadseqtree", seqdata = mvad.seq, type = "d", + border = NA, withlegend = FALSE, axes = FALSE, ylab = "", + yaxis = FALSE) 8/7/2009gr 61/100

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Graphical Tree 8/7/2009gr 62/100

Sequential data analysis - 2 Mining event sequences Outline Dissimilarities among pairs of state sequences 1 Mining event sequences 2 Conclusion: Sequence of analyses 3 8/7/2009gr 63/100

Sequential data analysis - 2 Mining event sequences Event sequences Section outline Mining event sequences 2 Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints 8/7/2009gr 64/100

Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100

Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard - PowerPoint PPT Presentation

Sequential data analysis - 2 Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard Department of Econometrics and Laboratory of Demography University of Geneva http://mephisto.unige.ch/biomining APA-ATI Workshop on Exploratory Data

Sequential data analysis with TraMineR, Part 1 Gilbert Ritschard Department of Econometrics and

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Sequence Analysis with TraMineR Gilbert Ritschard Institute for Demographic and Life Course

1 Sequential data analysis Sequential data analysis Objects and operators Objects and operators

TraMineR: A toolbox for exploring and rendering sequences Gilbert Ritschard Institute for

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with

Hardware Design with VHDL Sequential Stmts ECE 443 Sequential Statements This slide set covers

Sequential Files : Outline ! Overview ! Ordered vs. Unordered ! Physical sequential Files !

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Chapter 5 Synchronous Sequential Logic 5-1 Outline ! Sequential Circuits ! Latches ! Flip-Flops

Introduction to Synchronous Sequential Introduction to Synchronous Sequential Circuits Circuits

Highlighting changes and differences in 20th century Swiss life trajectories with TraMineR

Highlighting changes and differences in 20th century Swiss life trajectories with TraMineR

Outline Exploring Sequential Data A Tutorial Introduction 1 Overview of what sequence analysis

Reachability Analysis for Reachability Analysis for Sequential Circuits Sequential Circuits

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Localization III Localization Local optimization: Global optimization:

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Workshop 15: Q-mode MVA Murray Logan 06 Aug 2016 R-mode analyses preserve euclidean

GH: definition Z,f,g d Z d GH ( X, Y ) = inf H ( f ( X ) , g ( Y )) 1 The Elad-Kimmel approach

Dim imensionality ty Redu eduction: Th Theoretic ical Ana nalysis of Pr Practi tical Mea

Machine Learning in Conceptual Spaces Two Learning Processes Lucas Bechberger