1 Mining Event Histories Mining Event Histories Sequence Analysis - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Mining Event Histories Mining Event Histories Sequence Analysis - - PDF document

Mining Event Histories Mining Event Histories My talk is about life courses, So, let me start with an example of scientific life course Mining Event Histories: date event Some New Insights on Personal Swiss Life Courses 1970-1979 Studies in


slide-1
SLIDE 1

Mining Event Histories

Mining Event Histories: Some New Insights on Personal Swiss Life Courses

Gilbert Ritschard

Dept of Econometrics and Laboratory of Demography, University of Geneva http://mephisto.unige.ch

PaVie Seminar, Lausanne, October 22, 2008

21/10/2008gr 1/95 Mining Event Histories

My talk is about life courses, So, let me start with an example of scientific life course

date event 1970-1979 Studies in econometrics 1980-1992 Mathematical Economics 1985-... Work with Social scientists (Family studies) Interest in Statistics for social sciences 1990-1995 Interest in Neural Networks 2000-... KDD and data mining (Clustering, supervised learning) 2003-... Work with historians, demographers, psychologists (longitudinal data) 2005-... KDD and Data mining approaches for analysing life course data 2007-... Start a SNF project on “Mining Event Histories”

21/10/2008gr 2/95 Mining Event Histories

Outline

1

Sequence Analysis in Social Sciences

2

Survival Trees

3

Characterizing, rendering and clustering sequence data

4

Mining Frequent Episodes

21/10/2008gr 3/95 Mining Event Histories Sequence Analysis in Social Sciences Motivation

Motivation

Individual life course paradigm.

Following macro quantities (e.g. #divorces, fertility rate, mean education level, ...) over time insufficient for understanding social behavior. Need to follow individual life courses.

Data availability

Large panel surveys in many countries (SHP, CHER, SILC, GGP, ...) Biographical retrospective surveys (FFS, ...). Statistical matching of censuses, population registers and other administrative data.

21/10/2008gr 6/95 Mining Event Histories Sequence Analysis in Social Sciences Motivation

Motivation

Need for suited methods for discovering interesting knowledge from these individual longitudinal data. Social scientists use

Essentially Survival analysis (Event History Analysis) More rarely sequential data analysis (Optimal Matching, Markov Chain Models)

Could social scientists benefit from data-mining approaches?

Which methods? Are there specific issues with those methods for social scientists?

21/10/2008gr 7/95 Mining Event Histories Sequence Analysis in Social Sciences Motivation

Motivation: KD in Social sciences

In KDD (Knowledge discovery in databases) and data mining, focus on prediction and classification. Improve prediction and classification errors. In Social science, aim is understanding/explaining (social) behaviors. Hence focus is on process rather than output.

21/10/2008gr 8/95

1

slide-2
SLIDE 2

Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

What kind of data?

What kind of data are we dealing with? Mainly categorical longitudinal data describing life courses Data can be in different forms ...

21/10/2008gr 10/95 Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

  • ntology of longitudinal data (Aristotelean tree)

Longitudinal data States

  • ne state per time unit t

not several states at each t not not Events time stamped events not event sequence not not spell duration not 21/10/2008gr 11/95 Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

Alternative views of Individual Longitudinal Data

Table: Time stamped events, record for Sandra ending secondary school in 1970 first job in 1971 marriage in 1973 Table: State sequence view, Sandra year 1969 1970 1971 1972 1973 civil status single single single single married education level primary secondary secondary secondary secondary job no no first first first

21/10/2008gr 12/95 Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

Transforming time stamped events into state sequences

Example: the “BioFam” data

Data from the retrospective survey conducted in 2002 by the Swiss Household Panel (SHP) (with support of Federal Statistical Office, Swiss National Fund for Scientific Research, University of Neuchatel.) Retrospective survey: 5560 individuals Retained familial life events: Leaving Home, First childbirth, First marriage and First divorce. Age 15 to 45 → 2601 remaining individuals, born between 1909 et 1957.

21/10/2008gr 13/95 Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

Creating state sequences

Example of time stamped data: individual LHome marriage childbirth divorce 1 1989 1990 1992 NA

21/10/2008gr 14/95 Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

Deriving the states

Need one state for each combination of events: LHome marriage childbirth divorce no no no no 1 yes no no no 2 no yes yes/no no 3 yes yes no no 4 no no yes no 5 yes no yes no 6 yes yes yes no 7 yes/no yes yes/no yes

21/10/2008gr 15/95

2

slide-3
SLIDE 3

Mining Event Histories Sequence Analysis in Social Sciences What kind of data?

From events to states

Example of transformation : events: individual LHome marriage childbirth divorce 1 1989 1990 1992 NA states: individual ... 1988 1989 1990 1991 1992 1993 ... 1 ... 1 3 3 6 ...

21/10/2008gr 16/95 Mining Event Histories Sequence Analysis in Social Sciences Issues with life course data

Issues with life course data

Incomplete sequences

Censored and truncated data: Cases falling out of observation before experiencing an event of interest. Sequences of varying length.

Time varying predictors.

Example: When analysing time to divorce, presence of children is a time varying predictor.

Data collected by clusters

Example: Household panel surveys. Multi-level analysis to account for unobserved shared characteristics of members of a same cluster.

21/10/2008gr 18/95 Mining Event Histories Sequence Analysis in Social Sciences Issues with life course data

Multi-level: Simple linear regression example

y = 3.2 + 0.2 x y = 6.2 - 0.8 x y = 15.6 - 0.8 x y = 12.5 - 0.8 x 1 2 3 4 5 6 7 8 9 1 3 5 7 9 11 13 15 Education Children 21/10/2008gr 19/95 Mining Event Histories Sequence Analysis in Social Sciences Methods for Longitudinal Data

Classical statistical approaches

Survival Approaches

Survival or Event history analysis (Blossfeld and Rohwer, 2002)

Focuses on one event. Concerned with duration until event occurs

  • r with hazard of experiencing event.

Survival curves: Distribution of duration until event occurs S(t) = p(T ≥ t) . Hazard models: Regression like models for S(t, x) or hazard h(t) = p(T = t | T ≥ t) h(t, x) = g

  • t, β0 + β1x1 + β2x2(t) + · · ·
  • .

21/10/2008gr 21/95 Mining Event Histories Sequence Analysis in Social Sciences Methods for Longitudinal Data

Survival curves (Switzerland, SHP 2002 biographical survey)

Women 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 AGE (years) Survival probability Leaving home Marriage 1st Chilbirth Parents' death Last child left Divorce Widowing

21/10/2008gr 22/95 Mining Event Histories Sequence Analysis in Social Sciences Methods for Longitudinal Data

Analysis of sequences

Frequencies of given subsequences

Essentially event sequences, e.g. (First job → Marriage). Subsequences considered as categories ⇒ Methods for categorical data apply (Frequencies, cross tables, log-linear models, logistic regression, ...).

Markov chain models

State sequences. Focuses on transition rates between states. Does the rate also depend on previous states? How many previous states are significant?

Optimal Matching (Abbott and Forrest, 1986) .

State sequences. Edit distance (Levenshtein, 1966; Needleman and Wunsch, 1970) between pairs of sequences. Clustering of sequences.

21/10/2008gr 23/95

3

slide-4
SLIDE 4

Mining Event Histories Sequence Analysis in Social Sciences Methods for Longitudinal Data

Typology of methods for life course data

Issues Questions duration/hazard state/event sequencing descriptive

  • Survival curves:
  • Frequencies of given

Parametric patterns (Weibull, Gompertz, ...)

  • Optimal matching

and non parametric clustering, MDS (Kaplan-Meier, Nelson-

  • Rendering sequences

Aalen) estimators.

  • Discovering typical

episodes causality

  • Hazard regression models
  • Markov models

(Cox, ...)

  • Mobility trees
  • Survival trees
  • Discriminating episodes
  • Sequence Heterogeneity

Analysis (Anova)

21/10/2008gr 24/95 Mining Event Histories Survival Trees Marriage survival, SHP biographical data

SHP biographical retrospective survey

http://www.swisspanel.ch

SHP retrospective survey: 2001 (860) and 2002 (4700 cases). We consider only data collected in 2002. Data completed with variables from 2002 wave (language). Characteristics of retained data for divorce (individuals who get married at least once) men women Total Total 1414 1656 3070 1st marriage dissolution 231 308 539 16.3% 18.6% 17.6%

21/10/2008gr 27/95 Mining Event Histories Survival Trees Marriage survival, SHP biographical data

Distribution by birth cohort

Birth year

year Frequency 1910 1920 1930 1940 1950 1960 100 200 300 400 500 21/10/2008gr 28/95 Mining Event Histories Survival Trees Marriage survival, SHP biographical data

Marriage duration until divorce

Survival curves

0 8 0.85 0.9 0.95 1 vie 0.5 0.55 0.6 0.65 0.7 0.75 0.8 10 20 30 40
  • prob. de surv
Durée du mariage, Femmes 0 8 0.85 0.9 0.95 1 vie 0.5 0.55 0.6 0.65 0.7 0.75 0.8 10 20 30 40
  • prob. de surv
Durée du mariage, Hommes 0 8 v 8 v 1942 et avant 1943-1952 1953 et après 21/10/2008gr 29/95 Mining Event Histories Survival Trees Marriage survival, SHP biographical data

Marriage duration until divorce

Hazard model

Discrete time model (logistic regression on person-year data) exp(B) gives the Odds Ratio, i.e. change in the odd h/(1 − h) when covariate increased by 1 unit. exp(B) Sig. birthyr 1.0088 0.002 university 1.22 0.043 child 0.73 0.000 language unknwn 1.47 0.000 French 1.26 0.007 German 1 ref Italian 0.89 0.537 Constant 0.0000000004 0.000

21/10/2008gr 30/95 Mining Event Histories Survival Trees Survival Tree Principle

Survival trees: Principle

Target is survival curve or some other survival characteristic. Aim: Partition data set into groups that differ as much as possible (max between class variability)

Example: Segal (1988) maximizes difference in KM survival curves by selecting split with smallest p-value of Tarone-Ware Chi-square statistics TW =

  • i

wi

  • di1 − E(Di)
  • w 2

i var(Di)

1/2

are as homogeneous as possible (min within class variability)

Example: Leblanc and Crowley (1992) maximize gain in deviance (-log-likelihood) of relative risk estimates.

21/10/2008gr 32/95

4

slide-5
SLIDE 5

Mining Event Histories Survival Trees Example

Divorce, Switzerland, Differences in KM Survival Curves I

Zoom 21/10/2008gr 34/95 Mining Event Histories Survival Trees Example

Divorce, Switzerland, Differences in KM Survival Curves II

10 20 30 40 0.5 0.6 0.7 0.8 0.9 1.0 Cohort <=1940 & Non French Speaking & University Cohort <=1940 & Non French Speaking & < University Cohort <=1940 & French Speaking Cohort > 1940 & No Child & University Cohort > 1940 & No Child & < University Cohort > 1940 & Child & German or Italian Speaking Cohort > 1940 & Child & French or Unknown Speaking

21/10/2008gr 35/95 Mining Event Histories Survival Trees Example

Divorce, Switzerland, Relative risk

21/10/2008gr 36/95 Mining Event Histories Survival Trees Example

Hazard model with interaction

Adding interaction effects detected with the tree approach improves significantly the fit (sig ∆χ2 = 0.004) exp(B) Sig. born after 1940 1.78 0.000 university 1.22 0.049 child 0.94 0.619 language unknwn 1.50 0.000 French 1.12 0.282 German 1 ref Italian 0.92 0.677 b_before_40*French 1.46 0.028 b_after_40*child 0.68 0.010 Constant 0.008 0.000

21/10/2008gr 37/95 Mining Event Histories Survival Trees Social Science Issues

Issues with survival trees in social sciences

1 Dealing with time varying predictors

Segal (1992) discusses few possibilities, none being really satisfactory. Huang et al. (1998) propose a piecewise constant approach suitable for discrete variables and limited number of changes. Room for development ...

2 Multi-level analysis

How can we account for multi-level effects in survival trees, and more generally in trees? Conjecture: Should be possible to include unobserved shared effect in deviance-based splitting criteria.

21/10/2008gr 39/95 Mining Event Histories Characterizing, rendering and clustering sequence data Life trajectories

Sequence analysis

Survival approaches not useful in a unitary (holistic) perspective of the whole life course. Sequence analysis of whole collection of life events better suited for such holistic approach (Billari, 2005). Rendering sequences Colorize your life courses Results from the analysis of the retrospective Swiss Household Panel (SHP) survey. Focus on visualization of life course data.

21/10/2008gr 42/95

5

slide-6
SLIDE 6

Mining Event Histories Characterizing, rendering and clustering sequence data Life trajectories

Evolution tendencies in familial life course trajectories

Sequence analysis techniques permit to test hypotheses about evolution in these familial life trajectories. (Elzinga and Liefbroer, 2007): De-standardization: Some states and events of familial life are shared by decreasing proportions of the population, occur at more dispersed ages and their duration is also more scattered. De-institutionalization: Social and temporal organization of life courses becomes less driven by normative, legal or institutional rules. Differentiation: Number of distinct steps lived by individual increases.

21/10/2008gr 43/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Characterizing sets of sequences

Sequence of transversal measures (between entropy, ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Summary of longitudinal measures (sequence entropy, ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Other global characteristics: Central sequence, Sequence diversity, ...

21/10/2008gr 45/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Entropy

Entropy: Measure of uncertainty regarding state predictability.

pi, proportion of cases (or time points) in state i. Shannon h(p) =

i −pi log2(pi)

Other types of entropies: Quadratic (Gini), Daroczy, ...

Two ways of using entropies.

(Transversal) entropy of the state at each time (age) point: Entropy increases with diversity of states observed at each time point (age). (Longitudinal) entropy of each individual sequences: Entropy increases with diversity of states during the observed life course and varies with the time spend in each state.

21/10/2008gr 46/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Illustrative data

Data from the 2002 SHP biographical survey Interested in relationship between

Cohabitational trajectories (10 states) Professional trajectories (8 states)

We use the coding retained by Gauthier (2007) Focus on ages 20 to 45 (sequence length = 26 years) 1503 cases (751 women, 752 men)

21/10/2008gr 47/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Transversal entropy at each time (age) point

Living Arrangement Trajectories Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 0.8 Professional Trajectories Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 0.8 1910−1940 1941−1950 1951−1957

21/10/2008gr 48/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Transversal entropy at each time (age) point

Men : Living Arrangement Trajectories Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 0.8 Women : Living Arrangement Trajectories Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 0.8 1910−1940 1941−1950 1951−1957

21/10/2008gr 49/95

6

slide-7
SLIDE 7

Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Transversal entropy at each time (age) point

Men: Professional Trajectories Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Women: Professional Trajectories Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1910−1940 1941−1950 1951−1957

21/10/2008gr 50/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Hypothesis about longitudinal entropies

Cohabitational and professional life trajectories

become less stable more diversified

Their entropy tends to increase for younger generations. Are increases in professional trajectories related to increases in cohabitational trajectories?

21/10/2008gr 51/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Entropy of cohabitational trajectories

  • 1910−1940
1941−1950 1951−1957 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Men: Living Arrangment Trajectories
  • 1910−1940
1941−1950 1951−1957 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Women: Living Arrangment Trajectories

p(F > f ) = .000∗∗∗ p(F > f ) = .073∗

all 2 by 2 differences significant coh3 significantly (.02) different from coh1

21/10/2008gr 52/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Entropy of professional trajectories

  • 1910−1940
1941−1950 1951−1957 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Men: Professional Trajectories 1910−1940 1941−1950 1951−1957 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Women: Professional Trajectories

p(F > f ) = .002∗∗∗ p(F > f ) = .001∗∗∗

coh3 not significantly different from coh2

21/10/2008gr 53/95 Mining Event Histories Characterizing, rendering and clustering sequence data Characterizing sets of sequences

Correlation between cohabitational and professional entropies

Overall Men Women 1910-1940 0.08 ∗ 0.11 ∗ 0.19 ∗∗∗ 1941-1950 0.12 ∗∗ 0.14 ∗∗ 0.30 ∗∗∗ 1951-1957 0.15 ∗∗ 0.25 ∗∗∗ 0.31 ∗∗∗

21/10/2008gr 54/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Clustering, Multidimensional scaling and more

Once you are able to compute 2 by 2 distances between sequences you can among others: Cluster sequences Analyse the trajectory heterogeneity (Generalized ANOVA) Make scatter plot representation of sets of sequences using multidimensional scaling.

21/10/2008gr 56/95

7

slide-8
SLIDE 8

Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Distances between sequences

Edit distance (known as Optimal matching in Social sciences)

(Levenshtein, 1966; Needleman and Wunsch, 1970; Abbott and Forrest, 1986) d(x, y) Total cost of insert, deletion and substitution changes required to transform sequence x into y. Different solutions depending on indel and substitution costs.

Other metrics proposed by (Elzinga, 2008)

LCP: Longest common prefix (also longest common postfix) LCS: Longest common subsequence (same as OM with indel cost = 1, and substitution cost = 2). NMS: Number of matching subsequences ...

Elzinga (2008) proposes a nice formalization of these metrics.

21/10/2008gr 57/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Clustering with OM distances: Dendrograms

1 192 290 319 360 1197 1333 1412 1036 504 532 204 388 1472 1280 343 40 633 42 520 632 357 833 41 642 944 1046 1123 673 257 685 795 1068 1074 318 508 1069 138 1161 306 1220 352 543 517 1188 162 819 1359 637 409 941 1215 706 1011 1375 221 986 255 437 940 576 1269 455 1391 760 1364 552 227 779 281 327 1445 326 270 1315 465 1494 743 1493 1203 1355 114 1022 1023 123 588 258 1248 428 210 1387 415 752 712 997 292 1331 524 505 957 317 439 929 955 1438 528 1039 1222 273 492 565 658 542 560 844 1356 634 933 1214 1040 1071 1450 324 1326 412 1093 792 15 175 224 783 1304 561 1075 122 1283 924 640 641 726 1012 1303 1433 1063 998 438 463 970 1052 1330 1436 1446 1371 31 67 488 782 1089 1107 1274 80 163 340 556 574 1108 1297 1432 553 402 789 519 987 1037 1279 222 307 499 1201 444 818 456 660 956 964 995 1374 26 157 811 1047 55 79 66 813 468 1148 1165 1357 97 692 1135 1017 1485 269 1219 1136 1133 464 530 1271 471 765 1281 149 181 784 807 981 1358 639 1175 1118 1117 747 764 1332 240 500 539 503 1256 1234 1496 1327 12 18 45 46 56 63 96 132 144 159 169 180 189 215 246 263 264 295 297 358 429 441 442 457 467 506 558 562 568 569 585 595 596 598 664 705 717 746 761 772 775 776 777 855 863 867 879 897 950 996 1005 1067 1083 1084 1100 1101 1113 1114 1144 1147 1258 1260 1344 1345 1354 1376 1498 1499 72 414 1353 59 938 493 945 559 850 296 1443 1019 1116 62 584 952 217 251 992 583 650 675 160 494 1079 250 671 23 1191 164 350 837 1405 1406 141 798 566 937 1038 1259 165 790 16 623 1225 425 1321 20 22 814 1264 225 237 1109 118 619 1018 1185 931 216 392 1465 934 1419 115 1268 1143 1221 496 511 862 184 1142 1210 435 346 669 1380 518 1154 636 1149 697 899 1119 117 305 405 466 932 936 228 875 781 145 202 1273 1457 170 1478 479 688 1322 155 193 943 1207 226 906 1196 19 778 1276 235 313 696 243 674 763 896 921 534 918 238 834 1488 510 868 1284 1313 1454 802 766 1173 1440 236 342 733 799 1470 801 284 331 302 976 1213 720 770 137 536 1179 601 1009 1307 294 902 1373 206 332 1430 578 1073 917 1302 1311 30 120 690 840 418 458 1287 1177 232 285 1455 311 1320 451 809 683 1247 179 947 815 1263 880 905 121 1403 286 1408 708 554 134 434 657 762 276 1104 383 2 651 652 102 912 201 275 391 1395 1336 390 1340 1458 39 627 1155 820 1453 50 1103 188 507 198 86 711 119 1285 1349 291 800 448 849 604 1126 1489 60 400 404 581 913 413 871 1111 771 866 1190 579 1397 1398 38 985 704 1246 361 698 710 1245 197 616 1473 1270 961 836 89 247 804 545 106 1267 105 1024 659 1085 207 1346 1352 320 1070 234 373 374 573 1474 1061 1449 1202 1312 239 626 299 960 1056 1125 14 680 889 885 362 821 53 396 575 420 1365 667 54 888 572 1456 154 729 218 1255 1080 1180 1206 140 419 486 860 916 73 1102 701 891 214 1416 848 544 1072 328 1388 793 980 1360 602 672 823 593 756 1235 1329 194 622 516 681 363 1164 514 702 785 1381 812 1413 1157 407 603 978 994 835 1242 1475 469 1139 1277 883 687 1442 51 101 85 287 1090 443 904 325 678 753 1409 1468 1492 112 1041 230 1350 769 1305 547 1097 1323 990 1088 1128 1013 1059 1368 1029 131 922 1006 367 1261 156 476 278 886 150 662 366 314 1160 724 478 830 853 48 1183 1193 725 1288 293 1112 1252 557 262 1053 1208 502 768 810 1249 61 832 143 1168 546 336 703 241 427 267 1055 1447 1003 1167 1195 139 1169 993 1091 839 620 635 213 1158 977 1134 1076 1239 1361 153 629 1174 828 887 1004 1166 454 1393 890 935 1328 277 645 1186 1045 1317 1461 509 907 908 563 1057 666 1184 3 323 653 654 1266 126 1394 354 283 911 1015 1341 526 1385 36 129 523 178 1106 256 242 845 846 113 203 774 322 1087 309 1316 694 1194 1278 693 1265 919 803 274 738 1205 339 540 894 371 1250 7 424 182 308 1490 44 282 300 1486 608 1007 1189 1426 364 1400 794 1244 982 5 1211 489 723 1044 1129 1427 329 403 1324 289 399 430 408 735 927 870 1054 1121 1272 1290 35 580 587 1420 298 303 333 341 1152 453 1423 621 1081 379 700 452 490 431 570 676 707 21 49 1487 1159 1370 65 100 88 791 483 668 718 594 854 825 1343 529 838 1048 1229 43 744 426 369 1131 436 1431 1233 1096 87 773 872 1378 13 582 605 130 416 248 749 953 954 211 1138 686 1192 34 447 1034 252 882 187 1386 610 527 895 1477 233 1162 1163 930 1231 1422 321 1402 484 577 1362 37 498 1300 52 312 1026 1127 1292 93 107 748 841 1301 541 1294 1338 84 1141 910 1286 1401 147 345 968 249 1031 1367 174 1262 625 691 754 1452 344 1021 567 847 713 1187 8 555 1212 355 1241 69 571 33 1467 714 148 245 272 822 865 1217 1464 387 618 1410 208 900 191 624 648 1178 316 334 1216 1254 591 989 10 94 95 356 377 460 551 592 649 677 928 942 1098 1124 1218 1407 1428 946 70 1156 599 1404 183 1092 780 521 806 817 965 1051 1237 1383 71 1172 338 1451 423 1425 135 901 151 1060 533 788 372 1335 401 647 564 1153 874 1318 1382 91 1314 158 1309 1501 271 966 259 1481 1001 398 446 732 858 983 136 537 609 351 386 432 689 750 606 612 615 728 755 1033 1243 1459 736 903 1014 1137 47 104 172 376 394 589 878 1035 1238 1348 1429 1480 1497 477 116 665 859 1150 1384 1424 330 737 1058 417 550 1463 607 83 335 133 375 487 125 661 1293 111 829 742 979 1082 1130 4 385 538 684 740 876 1414 1415 146 185 923 974 310 353 730 613 614 124 265 406 462 915 991 1275 1500 186 279 347 393 459 670 864 884 898 909 962 963 1032 1049 1105 1176 1257 1295 597 1008 17 1199 646 1065 1306 1334 24 1351 1140 177 861 1441 27 209 450 90 231 722 739 958 1077 370 881 1227 1228 1363 365 1434 411 1122 395 973 797 1392 474 767 481 731 869 1151 25 472 851 972 1016 1372 190 513 1095 1484 337 525 959 1181 1310 92 1020 1337 1379 512 1042 1094 827 1389 969 1200 497 826 1062 1435 1476 68 98 410 656 709 719 1099 1298 1469 288 301 758 470 515 461 824 586 1232 831 1399 78 857 939 128 843 1396 786 1291 176 482 535 196 110 721 926 971 1078 304 449 805 1025 1064 1115 6 57 127 142 200 253 260 268 397 631 638 643 644 842 892 949 1028 1030 1120 1145 1146 1198 1204 1282 1503 75 82 108 440 491 751 796 1342 171 548 1027 1066 359 873 1209 378 1002 58 161 166 168 173 195 223 261 422 475 480 590 600 617 628 699 715 787 856 877 967 984 1000 1050 1224 1308 1319 1417 1482 1483 1491 1502 29 167 199 368 389 433 445 485 611 630 734 1086 1132 1170 1325 1377 1448 1110 1437 1296 109 220 229 280 384 421 655 920 975 219 380 1223 1471 74 76 77 212 266 549 759 948 1043 1289 1439 9 473 808 988 1182 1230 1299 64 315 152 816 1444 381 495 501 745 852 951 999 1010 1253 1366 1421 1466 757 99 663 205 741 11 103 244 348 679 682 893 925 1226 1236 1251 1339 1347 1390 1479 1495 28 32 254 522 531 716 914 1171 1240 1369 1411 1418 1460 1462 81 349 695 382 727 5000 10000 15000 20000 Cohabitational trajectories, Ward method (OM Distances, Indel=1, Subst. Cost based on Trans. Rate) Individual trajectories Height 1 2 5 6 7 13 24 29 30 34 36 37 38 42 43 46 50 53 62 64 68 69 76 84 86 92 95 97 101 104 112 116 117 118 120 124 129 137 139 149 156 158 163 166 168 172 173 177 181 188 190 197 202 206 210 214 215 217 220 224 228 229 230 232 235 236 246 249 254 257 258 264 269 271 272 275 276 278 282 287 289 294 296 297 302 307 308 314 315 318 322 325 330 332 333 337 343 347 348 351 352 353 360 362 365 367 369 372 375 376 378 384 392 393 399 403 412 415 419 421 426 428 434 438 441 451 453 457 461 463 464 466 468 477 478 483 485 489 490 492 494 495 499 505 506 507 514 518 521 525 526 528 532 533 538 540 543 549 550 552 554 558 561 562 566 568 569 570 576 578 579 582 586 588 592 594 595 596 598 599 604 606 613 616 626 629 631 632 633 638 640 648 653 657 661 662 664 669 671 676 679 684 685 689 691 693 700 705 708 710 712 717 720 733 734 736 737 738 740 747 748 752 753 754 756 758 761 769 794 798 799 800 801 802 803 805 808 816 820 827 828 835 840 850 853 856 858 866 868 870 871 875 877 881 884 888 890 892 894 899 902 904 909 913 914 917 924 925 929 937 938 940 942 943 945 948 949 952 958 960 964 966 969 970 971 972 973 974 976 978 983 984 987 990 991 995 996 1000 1004 1006 1011 1017 1018 1020 1023 1024 1025 1027 1032 1033 1034 1038 1040 1043 1051 1057 1059 1066 1070 1071 1073 1075 1077 1087 1089 1090 1094 1096 1112 1113 1114 1115 1120 1121 1123 1128 1130 1152 1157 1162 1166 1169 1171 1175 1176 1183 1184 1185 1187 1188 1189 1191 1195 1197 1199 1201 1204 1208 1211 1213 1215 1219 1222 1228 1238 1243 1247 1253 1258 1261 1262 1263 1265 1267 1270 1274 1275 1277 1279 1282 1290 1297 1302 1303 1307 1309 1313 1318 1320 1324 1325 1329 1330 1332 1333 1339 1341 1343 1348 1358 1360 1363 1365 1366 1367 1371 1374 1376 1377 1383 1387 1392 1396 1398 1404 1409 1410 1415 1417 1418 1419 1420 1422 1425 1427 1428 1431 1432 1438 1441 1444 1445 1447 1448 1449 1450 1456 1463 1464 1470 1473 1476 1479 1489 1490 1494 1495 1498 1499 1501 1408 12 15 89 98 114 150 174 192 260 281 355 391 410 465 471 475 523 531 539 567 600 617 619 635 636 646 651 658 659 678 713 765 789 807 809 818 842 847 886 906 955 1007 1081 1099 1104 1110 1135 1182 1236 1250 1288 1295 1345 1350 1354 1382 1388 1389 1401 1403 1405 1413 1430 1439 1442 1471 1484 87 305 1207 196 1069 1194 498 1047 649 123 171 199 209 855 250 843 320 656 1009 922 1052 1165 31 1019 1101 143 1393 911 354 274 534 472 602 707 851 1256 1264 182 216 238 266 313 383 406 470 724 766 773 860 915 919 956 1002 1030 1055 1056 1063 1144 1147 1198 1210 1272 1314 1322 1327 1446 1452 992 48 331 481 61 1435 179 993 1352 134 442 417 907 1015 1353 270 1496 791 918 981 4 637 504 910 212 1235 373 444 723 625 1370 459 735 690 27 41 234 251 279 439 563 672 681 683 739 782 1029 1092 1125 1143 1196 1202 1232 1300 1349 1486 40 502 82 146 285 370 519 557 781 811 830 837 1045 1048 1091 1203 1223 1229 1242 1455 91 22 205 496 545 593 715 741 751 825 861 1170 1173 1234 1244 1276 138 1064 721 23 67 75 345 359 361 445 590 608 719 743 767 864 873 1061 1131 1304 1334 1373 1481 47 385 1337 730 824 335 1159 1381 630 810 1072 916 954 10 336 155 364 982 170 786 931 327 509 19 610 1039 1306 930 680 574 1311 1485 1375 35 1037 255 624 17 20 49 55 110 221 277 310 407 517 839 876 1323 1368 57 921 836 1416 59 901 1126 430 790 997 78 572 1036 79 194 848 100 1008 838 780 119 620 660 262 328 947 1200 1257 622 267 1190 857 74 655 1317 1468 164 729 905 564 1326 1328 1174 1453 3 772 418 1231 950 1312 1338 1491 1076 1053 1134 141 431 1454 284 1108 1158 286 796 1217 280 565 1139 1378 204 456 785 1298 1041 241 686 529 1399 560 703 1237 26 169 544 1119 985 1466 1168 60 244 1097 245 63 673 11 56 131 144 223 227 300 317 349 357 366 397 401 413 423 429 433 449 454 486 488 515 516 536 542 553 573 575 641 644 663 674 709 760 777 784 813 822 834 926 934 963 999 1100 1105 1145 1154 1160 1209 1214 1249 1269 1273 1287 1305 1407 1459 1472 1483 1497 1502 72 159 273 524 696 1078 1083 1412 111 480 1035 559 1179 845 126 145 775 301 804 744 88 201 440 776 252 701 1137 157 731 1013 1284 530 988 213 763 879 986 51 132 380 692 778 1259 295 611 1423 797 897 1058 1079 1177 189 381 424 455 479 556 121 771 208 727 770 1281 414 1005 8 81 99 482 218 779 416 826 953 812 832 1268 154 994 831 1192 420 891 535 687 476 634 883 793 547 1357 764 1088 1474 697 698 511 589 153 1133 312 1141 334 603 338 1218 1347 377 73 427 142 1475 107 462 903 371 702 1085 1239 500 577 1074 1163 21 1167 1193 823 555 1149 1315 1488 45 1252 239 342 382 762 833 510 85 165 288 977 363 774 1107 374 94 1461 819 1148 1046 1436 77 1060 340 513 1102 1356 844 1103 896 167 546 821 585 989 1022 1111 1316 1331 814 1434 198 503 872 887 1336 339 639 1465 1118 1151 726 667 1117 900 1372 1054 1364 1492 32 247 583 1361 725 1221 54 222 1385 666 240 874 344 677 841 1246 311 1240 324 704 319 1255 1248 458 512 409 927 1180 1245 9 140 1346 1181 16 395 1026 309 405 746 957 1310 103 647 1344 852 541 1065 732 788 1062 44 368 920 422 1283 1308 652 1391 125 1362 1095 127 147 1129 862 268 787 408 256 936 497 1355 668 80 469 1395 178 1294 1164 299 128 306 548 935 263 944 467 1451 65 795 882 728 699 1342 152 303 1116 584 1429 102 183 878 1001 316 527 675 829 979 1254 39 52 66 113 219 446 493 597 654 749 854 908 959 1028 1289 58 283 1003 1014 1106 1132 1220 1271 1394 321 1042 1296 237 1359 1049 1351 1487 109 122 162 329 452 491 615 745 1010 1477 1503 358 975 1044 591 1016 195 755 939 1068 1319 1458 341 1227 1402 1421 750 389 628 161 968 350 912 742 757 185 846 961 346 450 96 688 1155 1335 105 980 207 1379 1278 1340 627 242 1140 759 1127 898 291 1122 115 248 435 849 869 933 411 323 623 203 265 326 923 1205 1493 443 614 694 1457 1500 193 1050 711 642 25 1369 893 1280 33 148 191 394 1156 1380 965 1241 93 390 447 618 650 932 1299 1462 108 151 261 612 1153 356 551 621 1233 387 716 941 474 670 28 487 259 1478 1293 133 298 386 520 537 682 951 1012 1084 1251 1460 187 1021 1301 130 1443 211 184 253 398 501 643 967 1146 1216 1482 396 226 290 1150 1424 1086 1124 1390 425 437 460 865 867 885 928 1172 1321 1480 14 571 1260 998 1031 70 581 817 508 1206 160 1224 473 580 645 176 718 889 71 1067 607 783 1142 665 714 293 1384 863 400 946 1225 18 379 605 1292 1406 895 1138 1386 83 292 587 1098 1433 200 768 1082 1178 1467 1469 225 1400 484 722 243 1426 1437 90 706 1285 880 304 106 1286 609 1093 1411 1161 792 135 1440 186 1186 231 601 1230 1136 136 436 815 1109 1226 175 522 388 962 180 402 859 1212 1414 233 695 806 1397 1080 404 1291 432 448 1266 5000 10000 15000 20000 Professional trajectories, Ward method (OM Distances, Indel=1, Subst. Cost based on Trans. Rate) Individual trajectories Height

LA Prof

21/10/2008gr 58/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

LA, State distribution by age, within cluster

A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 1 : Conjugal Trajectories (16 %)
  • Freq. (n=235)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 2 : Parental Trajectories, Slow Transition (19 %)
  • Freq. (n=285)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 3: Parental Trajectories, Fast Transition (48 %)
  • Freq. (n=714)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 4 : Nestalgic Trajectories (7 %)
  • Freq. (n=110)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 5 : Solo or Reconstituted Family Trajectories (11 %)
  • Freq. (n=159)
0.0 0.2 0.4 0.6 0.8 1.0 Biological father and mother One biological parent One biological parent with her/his partner Alone With partner Partner and biological child Partner and non biological child Biological child and no partner Friends Other

21/10/2008gr 59/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

LA, Most frequent sequences by cluster

3.4% 3.4% 3% 2.5% 2.5% 2.1% 2.1% 2.1% 2.1% 1.7% Type 1 : Conjugal Trajectories (16 %)
  • Freq. (n=235)
A20 A23 A26 A29 A32 A35 A38 A41 A44 1.1% 1.1% 1.1% 1.1% 0.7% 0.7% 0.7% 0.7% 0.7% 0.7% Type 2 : Parental Trajectories, Slow Transition (19 %)
  • Freq. (n=285)
A20 A23 A26 A29 A32 A35 A38 A41 A44 4.5% 3.5% 2.5% 2.4% 2.4% 2.2% 2% 1.8% 1.7% 1.5% Type 3: Parental Trajectories, Fast Transition (48 %)
  • Freq. (n=714)
A20 A23 A26 A29 A32 A35 A38 A41 A44 61.8% 1.8% 1.8% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% 0.9% Type 4 : Nestalgic Trajectories (7 %)
  • Freq. (n=110)
A20 A23 A26 A29 A32 A35 A38 A41 A44 3.1% 1.9% 1.9% 1.9% 1.9% 1.3% 1.3% 1.3% 1.3% 1.3% Type 5 : Solo or Reconstituted Family Trajectories (11 %)
  • Freq. (n=159)
A20 A23 A26 A29 A32 A35 A38 A41 A44 Biological father and mother One biological parent One biological parent with her/his partner Alone With partner Partner and biological child Partner and non biological child Biological child and no partner Friends Other

21/10/2008gr 60/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

LA, Sequence diversity within cluster

21/10/2008gr 61/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

LA, Birth year distribution by cluster

Type 1 : Conjugal Trajectories (16 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 2 : Parental Trajectories, Slow Transition (19 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 3: Parental Trajectories, Fast Transition (48 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 4 : Nestalgic Trajectories (7 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 5 : Solo or Reconstituted Family Trajectories (11 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Overall Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

21/10/2008gr 62/95

8

slide-9
SLIDE 9

Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Prof, State distribution by age, within cluster

A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 1 : Full Time Trajectoires (53 %)
  • Freq. (n=795)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 2 : Mixed Part Time − Home Trajectories (13 %)
  • Freq. (n=155)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 3 : At Home Trajectories (16 %)
  • Freq. (n=277)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 4 : Part Time Trajectories (7 %)
  • Freq. (n=101)
0.0 0.2 0.4 0.6 0.8 1.0 A20 A23 A26 A29 A32 A35 A38 A41 A44 Type 5 : Missing Data (11 %)
  • Freq. (n=175)
0.0 0.2 0.4 0.6 0.8 1.0 Missing Full time Part time Negative break Positive break At home Retired Education

21/10/2008gr 63/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Prof, Most frequent sequences by cluster

56.6% 8.4% 3.8% 2.8% 2.4% 2.3% 1.9% 1.8% 0.8% 0.6% Type 1 : Full Time Trajectoires (53 %)
  • Freq. (n=795)
A20 A23 A26 A29 A32 A35 A38 A41 A44 2.6% 2.6% 2.6% 1.9% 1.9% 1.9% 1.9% 1.9% 1.3% 1.3% Type 2 : Mixed Part Time − Home Trajectories (13 %)
  • Freq. (n=155)
A20 A23 A26 A29 A32 A35 A38 A41 A44 5.4% 4% 4% 3.2% 3.2% 3.2% 2.9% 2.2% 2.2% 2.2% Type 3 : At Home Trajectories (16 %)
  • Freq. (n=277)
A20 A23 A26 A29 A32 A35 A38 A41 A44 5% 5% 5% 4% 2% 2% 2% 2% 2% 2% Type 4 : Part Time Trajectories (7 %)
  • Freq. (n=101)
A20 A23 A26 A29 A32 A35 A38 A41 A44 34.9% 4.6% 3.4% 3.4% 2.9% 2.3% 2.3% 1.7% 1.7% 1.7% Type 5 : Missing Data (11 %)
  • Freq. (n=175)
A20 A23 A26 A29 A32 A35 A38 A41 A44 Missing Full time Part time Negative break Positive break At home Retired Education

21/10/2008gr 64/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Prof, Sequence diversity within cluster

21/10/2008gr 65/95 Mining Event Histories Characterizing, rendering and clustering sequence data Distances between sequences: Clustering

Prof, Birth year distribution by cluster

Type 1 : Full Time Trajectoires (53 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 2 : Mixed Part Time − Home Trajectories (13 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 3 : At Home Trajectories (16 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 4 : Part Time Trajectories (7 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Type 5 : Missing Data (11 %) Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Overall Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

21/10/2008gr 66/95 Mining Event Histories Characterizing, rendering and clustering sequence data Heterogeneity analysis and sequence discrimination

Heterogeneity of set of sequences

Sum of squares can be expressed in terms of the distances between each pair of points SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij to the OM, LCP, LCS, ... distance, we get a measure of diversity or heterogeneity of sequences. Can apply ANOVA principle to sequences.

21/10/2008gr 68/95 Mining Event Histories Characterizing, rendering and clustering sequence data Heterogeneity analysis and sequence discrimination

Heterogeneity analysis

Professional trajectories by sex

21/10/2008gr 69/95

9

slide-10
SLIDE 10

Mining Event Histories Characterizing, rendering and clustering sequence data Heterogeneity analysis and sequence discrimination

Sequence Tree

21/10/2008gr 70/95 Mining Event Histories Characterizing, rendering and clustering sequence data Multidimensional Scaling representation of sequences

Multidimensional Scaling: Principle

Let D be a distance matrix between sequences. D computed using OM, LPS, LCS, ... metrics. Multidimensional Scaling consists in

Finding a set of real valued variables (f1, f2) such that the δij =

  • (fi1 − fj1)2 + (fi2 − fj2)2 best approximate the

distances between sequences. Plotting the points in the (f1, f2) space.

21/10/2008gr 72/95 Mining Event Histories Characterizing, rendering and clustering sequence data Multidimensional Scaling representation of sequences

Multidimensional Scaling

−20 −10 10 20 −20 −10 10 20 30 40 Multidimensional scaling representation, colored cluster of cohabitation MDS axis 1 MDS axis 2
  • Type 1 : Conjugal Trajectories (16 %)
Type 2 : Parental Trajectories, Slow Transition (19 %) Type 3: Parental Trajectories, Fast Transition (48 %) Type 4 : Nestalgic Trajectories (7 %) Type 5 : Solo or Reconstituted Family Trajectories (11 %) −20 −10 10 20 −20 −10 10 20 30 40 Multidimensional scaling representation, colored cluster of professional trajectories MDS axis 1 MDS axis 2
  • Type 1 : Full Time Trajectoires (53 %)
Type 2 : Mixed Part Time − Home Trajectories (13 %) Type 3 : At Home Trajectories (16 %) Type 4 : Part Time Trajectories (7 %) Type 5 : Missing Data (11 %)

21/10/2008gr 73/95 Mining Event Histories Mining Frequent Episodes

Mining Frequent Episodes

(Time stamped) event sequences What can we expect from frequent episodes mining?

GSP (Srikant and Agrawal, 1996) MINEPI, WINEPI (Mannila et al., 1997) TCG, TAG (Bettini et al., 1996) SPADE (Zaki, 2001)

Are there specific issues when applying these methods in social sciences?

21/10/2008gr 75/95 Mining Event Histories Mining Frequent Episodes What Is It About?

Frequent episodes. What is it?

Episode: Collection of events occurring frequently together. Mining typical (frequent) episodes:

Specialized case of mining frequent itemsets. Time dimension ⇒ Partially ordered events.

More complex than unordered itemsets: User must

specify time constraints (and episode structure constraints). select a counting method.

21/10/2008gr 77/95 Mining Event Histories Mining Frequent Episodes What Is It About?

Episode structure constraints

For people who leave home within 2 years from their 17, what are typical events occurring until they get married and have a first child? LH,17

w = 2

??

w = 1

C1 M

(0, 4) (0, 3) (0, 1, 10)

elastic event constraints parallel node constraint edge constraints

21/10/2008gr 78/95

10

slide-11
SLIDE 11

Mining Event Histories Mining Frequent Episodes What Is It About?

Counting methods (Joshi et al., 2001)

20 21 22 23 24

U U U C C C Searching (U,C)

min gap= 1, max gap= 2, win size= 2

  • indiv. with episode

COBJ = 1 windows with episode CWIN = 3 min win. with episode CminWIN = 2 distinct occurrences CDIS_o = 5

  • dist. occ. without overlap

CDIS = 3

21/10/2008gr 79/95 Mining Event Histories Mining Frequent Episodes Example: Counting Alternate Episode Structures

Example: Counting alternate structures (COBJ, no max gap)

0% 5% 10% 15% 20% 25% 30% C h i l d < M a r r i a g e M a r r i a g e < C h i l d C h i l d = M a r r i a g e C h i l d < J

  • b

J

  • b

< C h i l d C h i l d = J

  • b

C h i l d < E d u c e n d E d u c e n d < C h i l d C h i l d = E d u c e n d M a r r i a g e < J

  • b

J

  • b

< M a r r i a g e M a r r i a g e = J

  • b

M a r r i a g e < E d u c e n d E d u c e n d < M a r r i a g e M a r r i a g e = E d u c e n d J

  • b

< E d u c e n d E d u c e n d < J

  • b

J

  • b

= E d u c e n d

Switzerland, SHP 2002 biographical survey (n = 5560).

21/10/2008gr 81/95 Mining Event Histories Mining Frequent Episodes Frequent and discriminant episodes

Frequent episodes, cohabitational and professional trajectories

0.0 0.1 0.2 0.3 0.4 0.5 0.6 (Biological father and mother) (With partner>Partner and biological child) (Biological father and mother)−(With partner>Partner and biological child) (Alone>With partner) (Biological father and mother>With partner) (Biological father and mother)−(Biological father and mother>With partner) (Biological father and mother>Alone) (Biological father and mother)−(Biological father and mother>Alone) (Biological father and mother>Partner and biological child) (Biological father and mother)−(Biological father and mother>Partner and biological child) (Biological father and mother>With partner)−(With partner>Partner and biological child) (Biological father and mother)−(Biological father and mother>With partner)−(With partner>Partner and biological child) (Alone>With partner)−(With partner>Partner and biological child) (Biological father and mother)−(Alone>With partner) (Biological father and mother>Alone)−(Alone>With partner) 0.0 0.1 0.2 0.3 0.4 0.5 (Full time) (Education) (Education>Full time) (Education)−(Education>Full time) (Full time>At home) (Full time)−(Full time>At home) (Full time>Part time) (At home>Part time) (Missing) (Full time)−(Full time>Part time) (Full time>At home)−(At home>Part time) (Full time)−(At home>Part time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>Part time)

21/10/2008gr 83/95 Mining Event Histories Mining Frequent Episodes Frequent and discriminant episodes

Discriminant episodes, professional trajectories, women

FullTime Mixed AtHome PartTime (Full time>At home) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 FullTime Mixed AtHome PartTime (Full time)−(Full time>At home) 0.0 0.1 0.2 0.3 0.4 0.5 FullTime Mixed AtHome PartTime (At home>Part time) 0.0 0.1 0.2 0.3 0.4 FullTime Mixed AtHome PartTime Missing (Full time>Part time) 0.0 0.1 0.2 0.3 0.4 0.5 Mixed AtHome PartTime (Full time>At home)−(At home>Part time) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Mixed AtHome PartTime (Full time)−(At home>Part time) 0.00 0.05 0.10 0.15 0.20 0.25

21/10/2008gr 84/95 Mining Event Histories Mining Frequent Episodes Frequent and discriminant episodes

Discriminant episodes, professional trajectories, men

FullTime Mixed Missing (Missing) 0.0 0.1 0.2 0.3 0.4 FullTime Mixed PartTime Missing (Education>Missing) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 FullTime Mixed PartTime Missing (Full time) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 FullTime Mixed PartTime Missing (Education>Full time) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 FullTime Mixed PartTime Missing (Education)−(Education>Full time) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 FullTime Mixed PartTime Missing (Education) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

21/10/2008gr 85/95 Mining Event Histories Summary

Summary

Data mining approaches (survival trees, clustering sequences, frequent episodes) have promising future in life course analysis.

Complement classical statistical outcomes with new insights.

Their use within social sciences raises specific issues:

Accounting for multi-level effects when growing survival tree or mining association rules. Handling time varying predictors in survival trees. Selecting relevant counting methods (event dependent)? Suitable criteria for measuring association strength between frequent episodes. ...

21/10/2008gr 86/95

11

slide-12
SLIDE 12

Mining Event Histories Summary

Our TraMineR R-package

Let me finish with an Add ... TraMineR, a free life trajectory mining tool for the free open source R statistical environment. downloadable from http://cran.r-project.org (CRAN) see also http://mephisto.unige.ch/biomining

21/10/2008gr 87/95 Mining Event Histories Summary

Thank You! Thank You!

21/10/2008gr 88/95 Mining Event Histories Appendix Zoomed tree

Divorce, Switzerland, Differences in KM Survival Curves I

21/10/2008gr 89/95 Mining Event Histories Appendix For Further Reading

For Further Reading I

Abbott, A. and J. Forrest (1986). Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 471–494. Bettini, C., X. S. Wang, and S. Jajodia (1996). Testing complex temporal relationships involving multiple granularities and its application to data mining (extended abstract). In PODS ’96: Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, New York, pp. 68–78. ACM Press.

21/10/2008gr 90/95 Mining Event Histories Appendix For Further Reading

For Further Reading II

Billari, F. C. (2005). Life course analysis: Two (complementary) cultures? Some reflections with examples from the analysis of transition to adulthood. In P. Ghisletta, J.-M. Le Goff, R. Levy,

  • D. Spini, and E. Widmer (Eds.), Towards an Interdisciplinary

Perspective on the Life Course, Advances in Life Course Research, Vol. 10, pp. 267–288. Amsterdam: Elsevier. Blossfeld, H.-P. and G. Rohwer (2002). Techniques of Event History Modeling, New Approaches to Causal Analysis (2nd ed.). Mahwah NJ: Lawrence Erlbaum. Elzinga, C. H. (2008). Sequence analysis: Metric representations

  • f categorical time series. Sociological Methods and Research.

forthcoming.

21/10/2008gr 91/95 Mining Event Histories Appendix For Further Reading

For Further Reading III

Elzinga, C. H. and A. C. Liefbroer (2007). De-standardization of family-life trajectories of young adults: A cross-national comparison using sequence analysis. European Journal of Population 23, 225–250. Gauthier, J.-A. (2007). Empirical categorizations of social trajectories: A sequential view on the life course. Thèse, Université de Lausanne, Faculté des sciences sociales et politique (SSP), Lausanne. Huang, X., S. Chen, and S. Soong (1998). Piecewise exponential survival trees with time-dependent covariates. Biometrics 54, 1420–1433.

21/10/2008gr 92/95

12

slide-13
SLIDE 13

Mining Event Histories Appendix For Further Reading

For Further Reading IV

Joshi, M. V., G. Karypis, and V. Kumar (2001). A universal formulation of sequential patterns. In Proceedings of the KDD’2001 workshop on Temporal Data Mining, San Fransisco, August 2001. Leblanc, M. and J. Crowley (1992). Relative risk trees for censored survival data. Biometrics 48, 411–425. Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710. Mannila, H., H. Toivonen, and A. I. Verkamo (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1(3), 259–289.

21/10/2008gr 93/95 Mining Event Histories Appendix For Further Reading

For Further Reading V

Needleman, S. and C. Wunsch (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453. Segal, M. R. (1988). Regression trees for censored data. Biometrics 44, 35–47. Segal, M. R. (1992). Tree-structured methods for longitudinal

  • data. Journal of the American Statistical Association 87(418),

407–418. Srikant, R. and R. Agrawal (1996). Mining sequential patterns: Generalizations and performance improvements. In P. M. G. Apers, M. Bouzeghoub, and G. Gardarin (Eds.), Advances in Database Technologies – 5th International Conference on Extending Database Technology (EDBT’96), Avignon, France, Volume 1057, pp. 3–17. Springer-Verlag.

21/10/2008gr 94/95 Mining Event Histories Appendix For Further Reading

For Further Reading VI

Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning 42(1/2), 31–60.

21/10/2008gr 95/95

13