Extracting knowledge from life courses: clustering and visualization - - PowerPoint PPT Presentation

extracting knowledge from life courses
SMART_READER_LITE
LIVE PREVIEW

Extracting knowledge from life courses: clustering and visualization - - PowerPoint PPT Presentation

Extracting knowledge from life courses Extracting knowledge from life courses: clustering and visualization 1 Nicolas S. Mller, Alexis Gabadinho, Gilbert Ritschard, Matthias Studer Department of Econometrics, University of Geneva 10th


slide-1
SLIDE 1

Extracting knowledge from life courses

Extracting knowledge from life courses: clustering and visualization1

Nicolas S. Müller, Alexis Gabadinho, Gilbert Ritschard, Matthias Studer

Department of Econometrics, University of Geneva

10th International Conference on Data Warehousing and Knowledge Discovery, Torino 2008

1This study has been realized within the Swiss National Science Foundation project SNSF 100012-113998/1. 12/9/2008nsm 1/34

slide-2
SLIDE 2

Extracting knowledge from life courses

Outline

1

Introduction to the life course perspective

2

Working with life course data

3

Familial life course analysis

4

Visualization

5

Conclusion

12/9/2008nsm 2/34

slide-3
SLIDE 3

Extracting knowledge from life courses Introduction to the life course perspective Sociological theory

Individual life course paradigm.

Following macro quantities (e.g. #divorces, fertility rate, mean education level, ...) over time insufficient for understanding social behavior. Need to follow individual life courses. The life course must be seen as a "whole", not only separate events

Data availability for familial life courses

Large panel surveys in many countries (SHP, CHER, SILC, GGP, ...) Biographical retrospective surveys (FFS, ...). Statistical matching of censuses, population registers and other administrative data.

12/9/2008nsm 4/34

slide-4
SLIDE 4

Extracting knowledge from life courses Introduction to the life course perspective Sociological theory

An example : my academic life

My academic life as an example of life course In 2006, I receive a master in sociology In 2006, I begin working as a research assistant at the Department of Econometrics In 2007, I begin working as a teaching assistant at the Department of Econometrics (statistics for social sciences) In 2008, I receive a master in information systems This is why I’m here today, presenting you a study that is a mix of algorithms, statistics and sociology

12/9/2008nsm 5/34

slide-5
SLIDE 5

Extracting knowledge from life courses Introduction to the life course perspective Sociological theory

What are we looking for

We wanted to see how typical life courses evolved through the 20th century. We created a typology of familial life courses in order to verify some sociological hypotheses. We decided to use sequence analysis in order to be consistent with the life course paradigm.

12/9/2008nsm 6/34

slide-6
SLIDE 6

Extracting knowledge from life courses Working with life course data Data structures

How can we represent a life course?

12/9/2008nsm 8/34

slide-7
SLIDE 7

Extracting knowledge from life courses Working with life course data Data structures

Alternative views of Individual Longitudinal Data

Table: Time stamped events sequence leaving home in 1970 marriage in 1971 first child in 1973 Table: State sequence view year 1969 1970 1971 1972 1973 left home no yes yes yes yes is married no no yes yes yes has child no no no no yes

12/9/2008nsm 9/34

slide-8
SLIDE 8

Extracting knowledge from life courses Working with life course data From events to states

To create a single sequence per individual, we define one state per combination of events that have occured or not LHome marriage childbirth divorce no no no no 1 yes no no no 2 no yes yes/no no 3 yes yes no no 4 no no yes no 5 yes no yes no 6 yes yes yes no 7 yes/no yes yes/no yes

12/9/2008nsm 10/34

slide-9
SLIDE 9

Extracting knowledge from life courses Working with life course data From events to states

The previous example can then be translated into a single sequence

Table: State sequence view individual 1969 1970 1971 1972 1973 id1 1 3 3 6

12/9/2008nsm 11/34

slide-10
SLIDE 10

Extracting knowledge from life courses Working with life course data Methods

Analysis of sequences

Frequencies of given subsequences

Essentially event sequences. Subsequences considered as categories ⇒ Methods for categorical data apply (Frequencies, cross tables, log-linear models, logistic regression, ...).

Markov chain models

State sequences. Focuses on transition rates between states. Does the rate also depend on previous states? How many previous states are significant?

Optimal Matching

Based on the Levenshtein distance (Edit distance between pairs of sequences) State sequences Allows the clustering of sequences.

12/9/2008nsm 12/34

slide-11
SLIDE 11

Extracting knowledge from life courses Working with life course data Methods

Distances between sequences

Levenshtein distance (known as Optimal matching in Social sciences)

d(x, y) Total cost of insert, deletion and substitution changes required to transform sequence x into y. For example :

sequence x is "0-0-0-1-3" and sequence y is "0-0-1-1" If a substitution op. costs 2 and an insertion costs 1, d(x, y) = 3 (inserts "3", substitute "0" by "1")

Different solutions depending on indel and substitution costs. We can attribute specific substitution costs Details of the algorithm are in the paper (Needleman-Wunsch algorithm)

12/9/2008nsm 13/34

slide-12
SLIDE 12

Extracting knowledge from life courses Familial life course analysis Data source

Presentation of the “BioFam” data

Data from the retrospective survey conducted in 2002 by the Swiss Household Panel (SHP) (with support of Federal Statistical Office, Swiss National Fund for Scientific Research, University of Neuchatel.) Retrospective survey: 5560 individuals Retained familial life events: Leaving Home, First childbirth, First marriage and First divorce. Age 15 to 30 → 4318 remaining individuals, born between 1909 et 1972.

12/9/2008nsm 15/34

slide-13
SLIDE 13

Extracting knowledge from life courses Familial life course analysis Optimal matching method

Application to the familial life courses data

1 Creation of sequences of states 2 Optimal matching analysis

Indel were fixed at 1 Substitution costs were based on the rate of transition

c[w(i, j)] = c[w(j, i)] = 2 − p(it|jt−1) − p(jt|it−1)

We compute the distance between each pair of sequences

3 Resulting distances matrix used in an agglomerative cluster

analysis (Ward method)

4 Vizualisation and interpretation of the results with specific

plots

12/9/2008nsm 16/34

slide-14
SLIDE 14

Extracting knowledge from life courses Familial life course analysis Clustering

1 255 1688 2191 2816 2991 349 707 1181 1310 1467 1706 2005 2023 2198 2317 2777 3646 3661 3900 4065 82 450 767 1390 1572 1631 2867 2895 84 112 443 2326 3047 3184 3410 152 196 351 352 679 706 1265 1927 2399 2481 2727 2877 3024 3568 25 74 138 254 678 714 1141 1144 1926 2387 2452 2597 2739 2910 3177 3708 3738 4027 4110 333 334 2424 1962 3415 3470 3579 1182 1248 1479 312 3048 698 1943 1649 3905 383 610 667 802 944 1026 1615 1984 2146 2223 2508 2811 2870 2898 3621 3788 4146 16 252 328 414 474 522 552 568 570 771 1083 1222 1295 1318 1366 1851 1863 2278 2318 2446 2607 2619 3018 3034 3233 3508 3647 3731 3817 4108 4138 4312 185 2257 251 541 1130 1757 3819 28 409 530 619 648 779 788 809 810 819 860 1117 1191 1192 1272 1891 1953 2219 2270 2448 2449 2549 2609 2789 2812 3020 3216 3287 3310 3311 3482 3689 3952 4030 1150 3975 2320 2415 2491 3227 1653 72 259 260 270 532 633 1380 1441 2202 3053 3301 4000 4297 1317 1804 2225 2996 646 1069 1119 1120 1202 1268 1862 2630 2660 2706 2855 3027 3140 3195 3265 3266 4186 94 179 180 250 287 555 611 716 838 1112 1166 1294 1422 1805 1913 2148 2184 3044 3385 3386 3457 3783 3856 3860 3988 3989 4037 4047 4075 4307 4315 23 85 1391 2355 3136 3220 3418 3596 3791 4168 219 359 1412 2498 2622 3089 91 1042 1145 1455 1468 1948 2182 2482 2949 2954 2955 2981 3273 3569 3966 3969 4097 4124 480 2256 2425 2925 1786 3677 2143 3938 4228 107 1401 1911 3739 4096 884 888 1702 2176 1019 1595 3134 3778 3888 111 171 2403 768 2036 1253 2581 1949 723 1766 3494 4032 3774 1025 2948 3920 3955 1259 1803 1558 3196 1798 3639 30 167 314 342 512 513 1689 2947 3601 3793 3818 113 301 2087 2247 3919 1386 2513 1556 1746 444 1523 3185 3591 1555 1745 2828 2829 2839 3417 3600 3792 3821 1550 1611 1836 3658 3659 2246 3491 3193 3235 160 931 298 1385 2721 3011 1936 3197 1060 1532 2136 4317 191 313 1324 3465 3994 3995 677 1458 1507 2327 3094 3100 3801 3915 3495 3496 1311 4169 26 330 681 826 1198 1413 1589 1699 1700 2883 3153 3261 106 1087 1124 1139 1370 1547 1678 1818 2139 2252 2514 2517 2661 3343 3434 3483 3996 4174 4229 105 174 239 323 617 711 1036 1196 2042 2105 2108 2336 2704 2875 2975 3182 3288 3606 3631 3787 3971 4038 4068 88 99 234 343 344 437 622 655 659 1049 1151 1158 1418 1423 1543 1553 1574 2120 2126 2128 2137 2211 2316 2362 2381 2762 2852 2896 2902 3074 3341 3389 3396 3610 3616 3664 3917 4008 4094 4126 4215 468 2965 3092 3820 215 295 336 564 875 1175 1240 1305 1473 1714 1770 1775 1966 1998 2344 2551 2575 2620 2685 2698 2831 3002 3030 3306 3342 3520 3533 3611 4121 4265 858 1126 2172 1533 3254 4209 1972 2534 2665 3480 34 35 118 187 283 310 649 744 1052 1118 1273 1435 1892 2003 2102 2119 2167 2207 2268 2457 2610 2631 2644 2659 2957 3806 3936 4164 3916 4081 335 772 984 1097 1542 1546 1554 1641 1682 1989 2164 2244 2297 2361 2447 2612 2725 2734 2749 2859 3049 3388 3412 3609 3632 3637 3892 3906 4028 4264 4298 566 1061 3924 996 2569 1148 1267 2531 1146 2321 3425 3590 2254 3077 451 1788 2833 2253 2515 3022 66 531 689 760 786 789 941 945 1157 1408 1677 1850 1917 2059 2076 2104 2592 2643 2823 2857 2901 2956 3105 3307 3580 3961 4173 4272 4273 4287 172 186 291 556 565 624 866 867 1147 1266 1393 1417 1779 1965 2299 2453 2500 2574 3015 3315 4090 4223 4288 4291 4311 2 3832 560 1472 4314 46 2713 3701 4199 233 447 1074 1724 1747 2430 296 713 1690 3303 1012 1918 2969 4165 2543 2926 221 989 3875 621 748 1905 3234 4109 4145 21 2577 1833 3387 3588 3151 4285 896 3373 946 1162 1011 2608 3617 4300 745 1000 897 1796 31 2670 1560 3537 1077 3592 275 1982 2724 1612 2071 3573 1323 2014 3029 3652 1367 1742 1754 1457 3328 3329 2434 2599 3550 4067 4219 48 1646 2067 53 292 436 1207 2539 3455 3759 561 606 1758 2220 3407 2794 2555 3186 4245 119 1496 1756 1802 3171 182 455 910 942 398 1399 1774 2604 3478 3512 3773 281 356 549 2144 2570 4192 1743 2157 2552 3584 3912 29 1200 1241 2258 2477 2505 2616 3442 202 645 712 1038 2860 3607 3045 3397 308 2785 3248 3559 997 3551 67 175 1140 1203 1354 1619 1920 1983 2492 2645 2752 2796 2798 3513 1759 1846 2034 3426 235 316 485 623 1075 1443 2642 2699 2760 2911 3103 3833 4063 208 315 377 489 656 1219 1744 1950 1954 2121 2350 2735 2841 3003 3247 4009 4048 2363 3190 49 2497 2509 3297 3319 4106 212 673 1256 1618 1789 4220 199 1607 3556 1882 2116 3217 3539 173 486 1368 2096 2388 3293 4221 1453 3262 2037 150 1703 2523 3344 165 3557 3558 3641 961 994 1089 1832 3968 10 1096 3896 322 1235 2047 127 519 943 1898 1924 1925 2053 2679 2737 2918 3780 4218 61 216 2629 2890 3300 3872 1415 3277 3693 4277 883 2158 2348 2673 2869 3718 3760 4133 894 3179 3323 3534 3813 3973 4276 15 1424 1129 2511 369 3615 2932 4111 282 2442 3165 3346 3628 1041 4271 1655 2897 408 441 742 1616 2107 2646 3062 3864 4039 493 2264 3040 3466 3957 3991 1218 1416 2986 3408 4119 1633 3033 3281 3523 3859 39 1922 1923 2179 3414 3990 210 654 1128 1650 1659 1938 2542 2680 3263 217 1500 1859 581 1583 3620 2915 3855 666 747 2958 3211 4074 693 2080 2152 3756 4052 906 1114 2963 3191 3676 2710 2741 3061 3782 4196 64 3768 4004 240 503 2291 2668 3576 562 844 847 2501 2701 3712 784 2674 3521 1002 2208 4198 4236 909 955 1086 3160 3427 4171 861 1013 1419 1444 1691 2142 2412 2456 2474 2702 3634 3682 1044 1072 1073 1964 2341 2521 2603 2691 4054 370 2421 563 3525 1040 3908 1834 3060 3958 4005 4152 407 1037 1103 1606 1940 2428 2524 491 1143 1333 1420 2060 2206 3175 1795 2073 3083 3546 3848 3992 586 1262 1290 1442 2669 3503 3847 110 780 1571 4256 575 968 3372 190 2169 2906 3721 2994 632 2406 965 2995 1122 4006 2537 3929 2241 2920 158 3001 1024 4049 653 4217 4258 1220 1719 1727 395 3882 4100 4177 2095 2747 869 2303 1708 1709 326 4007 4213 1785 3096 2755 2921 3362 3363 3327 11 2248 755 1142 471 680 1890 3430 33 535 1929 700 4216 482 887 2587 139 3152 1332 1445 2711 415 4283 416 1250 375 1562 2443 2444 2672 4132 1713 2473 3226 3840 3013 3645 109 2927 170 391 1661 76 1732 1914 1945 2125 2723 3054 3309 3619 4112 241 2149 539 1289 454 1817 4268 1027 1154 1808 2306 3078 553 2740 3073 3218 1102 1811 1900 1901 12 970 1705 922 1994 2529 3142 913 1249 2999 3133 156 425 1630 3380 3709 3899 223 776 3441 4046 886 1822 2024 3561 1844 1877 2255 2990 3416 3886 1043 1519 2526 2527 3161 44 731 3050 3853 1254 1440 1729 108 4018 2262 3642 935 1183 1152 3838 3688 1474 3828 1559 4017 2276 2557 4274 77 1233 285 3353 472 1015 1201 3258 442 504 665 1521 2489 2572 3021 3997 457 2138 3236 3330 1063 3085 1985 2300 3166 384 1587 1800 3608 3063 3570 585 904 2782 3331 3624 4163 1671 3939 2640 1131 3865 1171 1801 2044 3524 8 588 672 864 951 1915 3282 3468 3577 3941 3942 4117 4295 467 3746 3467 1750 1931 2790 1035 1058 1277 1916 1581 1664 324 597 636 1079 1165 1180 1514 1762 2017 2165 2204 2385 2952 2977 3076 3108 3201 3449 54 950 1666 2032 2283 567 637 692 756 757 1057 1078 1373 1642 1654 1827 2016 2058 2553 3377 3700 3796 4103 4247 4249 129 286 354 403 1331 1339 1382 1601 1625 1896 2132 2155 2205 2373 2779 2806 2922 2933 2984 3075 3091 3296 3378 3379 3699 3795 3825 4129 986 1717 2245 2315 2359 3902 192 1635 3432 1383 2064 2271 3183 3540 4088 4201 526 871 2009 3694 3804 3879 4011 4235 627 874 1551 2783 628 1167 2192 3509 3754 833 4190 863 1168 872 3042 2030 3629 3555 1238 1740 1577 3268 3390 4139 311 371 1540 3087 1065 1867 2118 2627 3271 3564 440 2083 2314 1030 2953 353 1981 1449 402 2532 781 1580 1921 2115 2462 2827 374 631 828 1531 2161 2626 2641 2732 3162 3419 3420 3816 3982 1006 1545 2134 1300 1536 1861 4060 4194 4248 2773 4202 661 778 827 919 991 1091 1438 1934 2408 2413 3004 3005 4179 4243 4259 2657 3358 3663 3741 4020 2754 3149 3219 3589 3675 3732 22 3504 1212 3200 2109 3337 634 1462 3365 4130 885 2340 3070 86 979 1285 1567 1870 3493 4073 117 327 825 1098 1505 3058 3858 269 2465 98 449 877 1979 2294 2372 3037 3284 340 845 1206 1365 2081 2369 2830 2864 3492 297 393 975 1596 1712 1933 2151 2945 3117 4058 81 3678 4255 1007 1179 3636 521 1099 2261 3130 3536 1881 2466 2617 3447 3669 177 1133 1229 2266 3067 222 544 3384 3553 546 1228 1528 1999 2082 2802 3428 3625 3626 4144 1109 1539 2843 3516 3655 3685 3844 4071 1644 1721 1755 1843 3035 3948 4092 427 3481 3762 3267 3316 329 576 1718 1768 3245 3649 1243 1320 3923 959 987 993 1325 2057 2201 2432 2536 2780 3673 3935 729 3440 2177 2826 2887 3535 3784 1286 1309 2905 3567 3635 956 1082 2647 3507 1426 3984 4040 43 1593 1525 3683 3703 4150 294 1600 1814 2836 2479 4137 998 2663 122 3705 3947 230 1234 2000 3208 332 284 2751 1667 439 3056 4041 533 1866 3979 4149 1104 4025 696 1064 3122 3207 3269 4010 1170 2844 4234 1135 2383 3674 3949 9 273 368 420 475 543 1430 1648 1722 1749 1841 2100 2227 2304 2850 3097 3102 3250 3448 3532 3691 3725 3797 4128 4189 4260 103 225 766 777 890 934 971 1062 1239 1307 1359 1429 1448 1590 1647 1765 1996 2239 2282 2390 2778 2964 3081 3170 3249 3401 3443 3444 3499 3517 3734 3781 4166 4178 69 3638 4021 93 1121 2580 3181 3464 3530 3593 3594 3956 306 718 774 1315 1784 2114 2168 2185 2838 2088 3038 3192 87 626 658 1161 1330 2463 3135 3391 3987 487 1569 3212 640 662 2193 687 688 724 728 769 1252 1326 1460 1463 1815 1826 1876 1894 1992 2089 2094 2309 3498 3587 3702 3730 3946 4294 1208 2743 2200 2459 14 348 536 591 1153 1321 1395 1588 2015 2170 2312 2879 2909 3090 4083 4084 506 590 1287 2594 2876 1066 1504 1694 1971 2141 2354 2395 4131 143 290 367 558 765 773 882 1005 1345 1761 1872 1939 1988 2113 2224 2343 2687 2820 2849 2940 3010 3099 3807 3978 331 1299 1319 1830 2111 2329 2476 2621 2835 2866 2912 2992 3148 3230 3623 3633 3724 3777 4029 4099 104 456 1409 2571 253 926 2092 515 516 3933 842 2433 358 1247 1329 1480 2625 2962 3225 4024 990 1957 2240 1023 1032 1552 1679 2382 2943 51 226 502 952 1258 3719 3945 574 3204 4031 4155 735 2231 2769 750 1730 3308 3582 3728 3729 272 898 964 1018 1132 2525 1394 2260 2307 4043 364 3285 3921 2181 1549 1748 2715 604 856 1797 3869 1001 2110 3514 4251 2325 3007 3290 325 510 1092 1557 1634 1379 1464 2913 3095 3213 3501 3704 2512 2923 3750 341 1352 1364 1509 1565 2063 3502 3964 4059 366 397 1526 1568 2222 2709 3452 3974 1731 3486 2097 2098 2379 2720 2983 3720 3878 60 2145 154 1392 1613 2888 4134 762 763 954 2234 3429 4191 153 2020 2502 3911 4244 134 916 1478 1960 1963 2216 2302 3036 3572 459 734 808 1136 1350 1995 2573 2714 3485 787 2103 2874 137 595 1599 2159 3505 652 1506 3913 2236 2259 80 3885 3333 3334 3446 264 401 1088 2554 280 559 3925 2441 3028 3194 1199 1617 2190 3874 4026 2681 2818 3256 307 1428 3479 355 492 892 2101 2175 2380 2682 3453 999 2439 3761 4079 1848 4263 1605 2440 3292 3651 126 664 2788 266 1598 3069 3927 703 704 2070 2313 3744 4113 4114 1137 1322 2386 1874 1969 2131 2690 157 1470 1903 4284 699 1529 1604 1904 868 1530 686 1456 1736 1737 1847 2862 3497 4053 1059 2397 3914 4257 3000 3887 27 2520 782 2951 438 3065 4002 4187 495 3696 2277 2323 3500 32 939 1381 2538 3835 59 2195 2401 2742 3743 132 508 1637 1312 3421 1327 2358 1626 3549 2567 3940 2880 3402 388 4232 4246 4309 2611 2006 3246 4056 557 1693 2471 3748 1033 458 3932 4014 4183 2031 940 3603 2556 3376 1614 3489 1676 4170 3228 133 573 3366 587 2367 3903 2919 4143 2423 3581 146 1337 2873 2942 548 1791 2033 1022 2853 1643 3270 3398 3445 20 1433 1928 2437 2914 3799 38 584 1842 2035 599 2683 3575 609 2084 3255 3850 917 3039 3252 3884 390 1764 2154 2528 3172 1489 2959 3602 554 2351 2960 3118 3164 3552 831 1753 1935 3080 3802 288 303 683 694 796 797 834 966 967 1048 1355 1490 1498 1510 1563 1640 1663 1692 1734 1781 1821 1829 1880 1884 2002 2163 2194 2263 2274 2280 2290 2635 2658 2693 2707 2718 2786 2807 2819 2930 2950 2979 3104 3111 3144 3154 3231 3242 3336 3348 3472 3543 3706 3767 3772 3809 3815 4022 4077 4080 96 1236 1582 3238 3052 3735 3907 100 726 727 758 807 905 1279 2483 3477 3963 3967 247 870 3831 3893 237 274 305 460 498 572 613 660 725 1039 1125 1316 1363 1845 1856 1987 1990 2026 2162 2292 2360 2378 2427 2429 2438 2809 3203 3251 3274 3286 3527 3554 3604 3613 178 188 193 265 394 426 500 505 524 540 550 669 751 775 853 1134 1177 1516 1586 1591 1701 2025 2038 2046 2117 2129 2174 2242 2243 2305 2374 2593 2656 2675 2716 2858 2982 3131 3484 3488 3640 3667 3671 3803 4057 4089 4175 4210 4267 62 144 189 242 268 392 452 483 607 614 642 722 736 739 812 832 879 924 927 929 949 982 1149 1246 1293 1374 1384 1402 1465 1603 1680 1839 1908 2069 2077 2124 2186 2353 2445 2506 2583 2761 2768 2799 3006 3032 3139 3205 3206 3214 3294 3361 3424 3471 3473 3528 3548 3648 3742 3868 3883 4188 4275 97 166 169 243 542 603 657 702 738 901 1009 1155 1187 1389 1431 1439 1461 1475 1488 1561 1602 1645 1687 1725 1855 1860 1865 1909 1970 1991 2018 2392 2405 2410 2411 2468 2544 2566 2662 2694 2805 2810 2840 3009 3017 3023 3115 3169 3209 3260 3349 3563 3662 3814 3845 3928 3976 4013 4015 4036 4044 4069 4262 4279 4280 4281 3450 4233 3622 145 245 289 969 1053 1978 2133 2342 2371 2652 3119 3723 4250 980 1195 1224 1656 2770 2772 2825 2863 3113 3158 3374 4062 889 2846 4206 1432 3159 3716 4162 2695 3672 2697 2885 2973 4142 181 404 981 1004 2893 3983 4091 534 1225 1276 1610 2075 3461 3469 668 1081 1763 2337 2368 2801 3275 3668 257 647 2293 820 1739 2153 1629 1787 1792 3643 2052 4050 547 1283 3132 579 2834 1127 1404 2832 1138 1620 2393 4185 300 1522 3409 3986 2464 3325 629 630 2530 836 3790 2022 4184 1296 2470 983 1459 2935 2357 3125 3215 3051 3173 3 5 209 224 302 411 412 1050 1071 1346 1450 1486 1675 1869 1952 1955 2040 2056 2684 2791 3304 3595 3769 4241 37 101 168 256 278 709 1085 1410 1660 1875 2396 2451 2495 2576 2689 2767 3280 3779 4157 4200 4240 4286 17 68 164 176 434 445 525 528 710 953 1197 1347 1377 1638 1946 2049 2068 2086 2171 2653 2748 2929 2941 2997 3123 3188 3313 3422 3877 3898 3931 4242 3109 36 1651 920 2127 65 261 496 520 881 1070 1405 1930 2027 2041 2478 2496 2561 2563 2676 2781 2871 2889 2970 3223 3393 3395 3763 3889 3951 4093 4153 4154 70 262 319 386 596 605 976 1056 1231 1291 1335 1573 1669 1684 1698 1726 1741 1751 1772 1816 2048 2051 2209 2488 2946 2980 3026 3138 3156 3199 3253 3314 3727 3745 3764 3930 3959 4214 4230 4303 1209 1215 3749 4140 4148 1216 1362 1452 1592 3506 1627 2758 3064 18 207 3697 4120 4125 4195 2744 3766 1172 2784 1777 2615 2776 3106 3224 3571 24 130 204 206 529 960 1080 1106 1369 1477 1487 1502 1773 1853 1912 2028 2183 2272 2499 2618 2795 3107 3121 3198 3901 4078 4238 131 141 183 236 244 435 446 477 501 695 900 977 1255 1497 1715 1974 2273 2908 3431 3433 3565 4001 4231 3110 3597 55 135 218 387 429 578 600 612 670 972 1020 1481 1723 1778 2106 2267 2436 2458 2518 2601 2872 3392 3656 3785 3836 3910 3937 3985 4172 4302 1728 56 95 162 360 405 469 476 589 753 761 799 1164 1336 1397 1471 1535 1579 1639 2815 3157 3276 3375 3394 3400 3681 3897 3909 4101 4269 4301 4316 3423 4 279 318 582 1396 1760 1794 1951 2454 2560 2907 2966 3654 3765 3965 4104 4151 4159 4292 194 205 248 481 749 921 1303 1313 1493 1520 1622 1623 1668 2339 2419 2613 3352 3529 3811 4158 13 159 293 478 479 676 717 1344 1454 1513 1893 2238 2295 2562 2686 2738 3335 3403 3531 3800 3904 4116 4176 163 801 903 1269 1356 1501 1907 2516 2650 2705 2731 203 804 1711 3487 3755 912 149 973 1029 1221 1298 1776 2931 3707 4306 406 2987 494 843 1051 1174 1353 1576 1852 1973 1975 2484 2522 2639 2649 2917 3072 3404 594 1887 1910 1944 2090 2589 2648 2700 3025 3829 4082 862 2548 2797 249 1564 2847 2848 3210 3322 1111 3653 58 197 365 423 682 817 915 995 1068 1280 1492 1828 1864 2228 2365 2435 2461 2588 3174 3178 3259 3454 3585 3686 3715 3722 3757 3834 3894 4141 4160 304 465 518 697 705 821 891 1100 1173 1193 1297 1548 1716 1733 1878 2050 2135 2178 2475 2533 2568 2671 2678 3120 3740 3753 3808 3846 4051 4085 4086 19 42 57 71 79 89 90 123 124 136 142 161 184 195 211 238 258 263 277 309 346 347 362 372 376 378 379 381 382 413 418 421 428 431 432 470 484 490 497 509 514 527 537 569 571 577 602 638 650 671 675 701 708 715 720 752 764 770 790 798 800 811 818 841 850 852 855 857 859 878 880 907 928 938 957 978 1016 1028 1055 1084 1101 1105 1110 1113 1156 1163 1169 1176 1178 1186 1204 1214 1217 1230 1244 1245 1260 1261 1270 1271 1274 1275 1288 1304 1314 1338 1343 1357 1360 1372 1398 1400 1403 1407 1411 1427 1436 1495 1503 1538 1544 1575 1585 1597 1621 1628 1657 1670 1672 1683 1686 1696 1707 1752 1771 1812 1813 1823 1837 1849 1854 1858 1871 1886 1899 1947 1958 1976 1986 2004 2007 2010 2019 2039 2065 2066 2093 2112 2188 2189 2210 2215 2226 2230 2250 2281 2288 2289 2296 2301 2319 2328 2345 2349 2366 2375 2384 2398 2402 2404 2414 2417 2418 2422 2460 2467 2480 2486 2493 2504 2510 2519 2540 2546 2547 2550 2559 2578 2596 2598 2605 2624 2632 2634 2636 2651 2655 2688 2717 2719 2730 2745 2756 2757 2763 2764 2765 2766 2774 2775 2787 2813 2817 2824 2856 2865 2878 2884 2892 2894 2916 2928 2936 2937 2967 2968 2998 3014 3041 3068 3086 3093 3098 3101 3127 3137 3146 3163 3222 3240 3244 3264 3298 3305 3312 3317 3321 3324 3326 3332 3355 3356 3359 3369 3382 3435 3451 3458 3462 3474 3526 3578 3583 3660 3670 3698 3710 3713 3714 3726 3736 3752 3775 3776 3786 3794 3798 3830 3843 3857 3861 3867 3876 3880 3890 3918 3943 3950 3977 3981 3993 4023 4072 4076 4095 4098 4115 4122 4135 4136 4161 4180 4181 4182 4197 4205 4212 4222 4224 4254 4266 4289 4305 4313 220 545 583 592 691 1184 1226 1482 2091 2286 2311 2703 2814 2822 2985 3084 3143 3189 3291 3839 3960 3970 4003 4207 1820 4105 4290 2837 848 936 947 988 1205 1328 2029 2160 2199 2235 2891 3046 3299 3460 3644 3824 3852 6 115 120 125 229 232 350 618 625 737 822 851 899 974 985 1211 1223 1348 1425 1738 1883 2196 2677 2944 3318 3367 3519 3541 3560 3562 3687 3692 3737 3823 3895 4035 4239 4278 83 147 200 345 424 835 1067 1116 1194 1508 1902 1932 1956 2013 2061 2203 2218 2356 2376 2507 2535 2606 2712 2746 2904 2938 2988 2993 3239 3354 3717 3805 3873 3954 4237 321 1308 3849 918 1094 2212 2310 3019 4270 320 380 419 517 538 620 644 651 814 840 948 1031 1210 1242 1302 1376 1566 1584 1624 1769 1780 1819 1838 1879 2072 2078 2147 2166 2322 2400 2407 2600 2637 2692 2971 2989 3012 3586 3751 3934 4016 4087 4107 4296 4299 361 523 1264 1349 1483 1658 1681 2008 2729 3237 3241 3605 3841 3980 3128 45 2394 911 1351 1868 2140 3088 3141 4034 271 1014 1447 2352 373 1835 4118 4167 2455 2771 3405 1825 2792 2821 2654 488 803 992 1227 1491 3368 4070 663 759 824 1810 958 2062 2284 2886 4208 1047 2726 1434 1906 1919 2804 719 1524 3338 3598 794 2122 1093 3016 151 2972 3351 430 2736 2900 2961 337 1189 2487 2666 3758 1857 2793 3538 2156 3167 3545 227 228 267 3812 2012 2472 2842 2602 3071 598 1807 3599 3826 3998 639 3953 1799 2614 3542 730 1594 2426 2808 830 846 1578 2249 2251 2335 2733 1993 2197 2861 4042 7 50 52 63 128 198 385 396 448 464 473 499 551 641 643 685 732 733 754 795 806 815 837 854 873 925 932 962 1003 1008 1034 1107 1108 1188 1190 1282 1340 1358 1387 1388 1494 1517 1527 1632 1720 1767 1783 1885 1942 2054 2074 2173 2187 2217 2298 2346 2409 2450 2899 2974 3055 3066 3116 3150 3340 3370 3406 3463 3475 3627 3770 3851 3854 3999 4102 4253 4261 1292 3459 4045 2150 4318 73 78 140 299 338 339 363 410 462 507 511 635 783 792 793 876 893 895 923 933 1160 1213 1232 1341 1361 1414 1437 1446 1451 1537 1608 1652 1665 1673 1697 1782 1793 1873 1941 1968 1980 2001 2043 2045 2079 2123 2180 2213 2221 2233 2237 2265 2275 2332 2333 2364 2370 2420 2431 2503 2541 2564 2728 2854 2882 2924 2934 2978 3008 3043 3112 3126 3155 3168 3229 3243 3320 3411 3438 3490 3510 3522 3544 3547 3612 3618 3650 3666 3680 3866 3870 4055 4123 4204 4211 4226 4308 4310 201 389 1017 1021 1185 1889 2269 4282 593 1095 1499 2800 3345 3871 1237 40 41 47 75 102 114 148 213 317 357 399 400 422 433 453 461 463 466 608 616 690 740 743 785 791 805 813 823 839 849 937 963 1010 1046 1054 1076 1090 1159 1251 1263 1284 1301 1342 1406 1421 1469 1476 1484 1485 1512 1515 1518 1534 1541 1609 1674 1695 1806 1831 1897 1937 1961 1967 1977 2055 2085 2229 2279 2287 2330 2347 2377 2391 2469 2485 2494 2558 2565 2579 2582 2584 2591 2595 2623 2638 2667 2722 2750 2753 2759 2845 2851 2939 3031 3057 3082 3124 3187 3202 3257 3272 3279 3283 3289 3347 3350 3360 3381 3436 3437 3439 3456 3476 3511 3515 3657 3679 3695 3747 3771 3822 3837 3863 3881 4012 4064 4066 4127 121 214 1281 2130 2338 2881 3180 3302 3364 3862 3413 3962 231 246 601 615 674 684 741 746 865 908 914 1115 1123 1257 1306 1334 1371 1375 1378 1466 1511 1704 1710 1790 1824 1895 1997 2011 2021 2099 2232 2324 2331 2334 2389 2490 2545 2633 2803 2868 2903 2976 3079 3145 3176 3232 3278 3295 3339 3357 3383 3399 3518 3574 3665 3690 3711 3733 3827 3891 3926 3944 3972 4019 4033 4061 4147 4203 4225 4227 4252 4304 92 2214 829 1809 3114 1278 1840 2416 2628 2664 3129 155 276 816 1045 1636 1662 1685 1959 2285 2308 2585 2586 2590 2696 2708 3059 3566 3614 3630 3789 3810 116 1735 3842 3922 4156 4193 417 580 721 902 930 1570 1888 3147 3221 3371 3684 4293 100 200 300 400 500

Dendrogram of optimal matching distances (indel 1)

Height

12/9/2008nsm 17/34

slide-15
SLIDE 15

Extracting knowledge from life courses Visualization Density plots

Description

A density plot shows the proportion of individual in each state for each age It presents aggregated data, it is not really suitable for a life course interpretation

12/9/2008nsm 19/34

slide-16
SLIDE 16

Extracting knowledge from life courses Visualization Density plots

Density plots (1/2)

a15 a17 a19 a21 a23 a25 a27 a29

Cluster 1

  • Freq. (n=1357)

0.0 0.2 0.4 0.6 0.8 1.0 a15 a17 a19 a21 a23 a25 a27 a29

Cluster 2

  • Freq. (n=854)

0.0 0.2 0.4 0.6 0.8 1.0

12/9/2008nsm 20/34

slide-17
SLIDE 17

Extracting knowledge from life courses Visualization Density plots

Density plots (2/2)

a15 a17 a19 a21 a23 a25 a27 a29

Cluster 3

  • Freq. (n=705)

0.0 0.2 0.4 0.6 0.8 1.0 a15 a17 a19 a21 a23 a25 a27 a29

Cluster 4

  • Freq. (n=1402)

0.0 0.2 0.4 0.6 0.8 1.0

12/9/2008nsm 21/34

slide-18
SLIDE 18

Extracting knowledge from life courses Visualization Density plots

Frequency plots

Plot of the n most frequent sequences. Individual life sequences are plotted The wider the bar representing the sequence, the more frequent it is

12/9/2008nsm 22/34

slide-19
SLIDE 19

Extracting knowledge from life courses Visualization Frequency plots

Frequency plots (1/2)

3% 2.5% 2.4% 2.3% 2.3% 2.2% 2.2% 2.1% 1.8% 1.7%

Cluster 1

  • Freq. (n=1357)

a15 a17 a19 a21 a23 a25 a27 a29 37.9% 4.7% 3.8% 3.6% 3.6% 3.6% 3.5% 3.3% 3.2% 2.8%

Cluster 2

  • Freq. (n=854)

a15 a17 a19 a21 a23 a25 a27 a29

12/9/2008nsm 23/34

slide-20
SLIDE 20

Extracting knowledge from life courses Visualization Frequency plots

Frequency plots (2/2)

18.2% 13.9% 10.9% 10.2% 6.4% 5.4% 5% 3% 2% 1.7%

Cluster 3

  • Freq. (n=705)

a15 a17 a19 a21 a23 a25 a27 a29 4.7% 4.5% 4.3% 3.5% 2.4% 2.4% 2% 1.9% 1.7% 1.6%

Cluster 4

  • Freq. (n=1402)

a15 a17 a19 a21 a23 a25 a27 a29

12/9/2008nsm 24/34

slide-21
SLIDE 21

Extracting knowledge from life courses Visualization Indexplots

Index plots

Each sequence represented by a stacked bar (or line) Plot n first sequences (not necessarily the most frequent) Sequences are sorted by their edit distance to the most frequent sequence Index plots of all sequences show diversity of the sequences.

12/9/2008nsm 25/34

slide-22
SLIDE 22

Extracting knowledge from life courses Visualization Indexplots

Indexplots (1/2)

12/9/2008nsm 26/34

slide-23
SLIDE 23

Extracting knowledge from life courses Visualization Indexplots

Indexplots (2/2)

12/9/2008nsm 27/34

slide-24
SLIDE 24

Extracting knowledge from life courses Visualization Indexplots

What can we learn from these clusters?

Using logistic regression modelling, we can identify cohort and gender effects in the cluster membership.

For example, a woman has an odd ratio of almost 2 to be in cluster 1, meaning they have 2 times more chances to be in this cluster than a man The same can be said about the birth year, the older the individual, the more chances he has to be in the cluster 1 ("classical" familial life courses)

12/9/2008nsm 28/34

slide-25
SLIDE 25

Extracting knowledge from life courses Visualization Characteristics of sequences: Entropy

Definition

Entropy: measure of uncertainty regarding sequence predictability.

pi, proportion of cases (or time points) in state i. Shannon h(p) =

i −pi log2(pi)

Other type of entropies: Quadratic (Gini), Daroczy, ...

Two ways of using entropies.

Entropy of the state at each time (age) point: Entropy increases with diversity of states observed at each time point (age). Entropy of each individual sequences: Entropy increases with diversity of states during the observed life course and varies with the time spend in each state.

12/9/2008nsm 29/34

slide-26
SLIDE 26

Extracting knowledge from life courses Visualization Characteristics of sequences: Entropy

Entropy of the state at each time (age) point

Entropy by age

Age Entropy

  • A15

A17 A19 A21 A23 A25 A27 A29 0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 1

Cluster 2 Cluster 3 Cluster 4

12/9/2008nsm 30/34

slide-27
SLIDE 27

Extracting knowledge from life courses Visualization Characteristics of sequences: Entropy

Entropy - boxplots

seqient(seqfam)

0.0 0.2 0.4 0.6 1 2 3 4

  • 12/9/2008nsm 31/34
slide-28
SLIDE 28

Extracting knowledge from life courses Conclusion R Module

TraMineR

The TraMineR R module provides methods to analyze life courses : Distance between sequences computation (optimal matching, LCS, LCP) Descriptive measures of sequences (entropy, turbulence) Sequence visualization tools (density/index/frequency plots) Frequent sub-sequence mining

12/9/2008nsm 33/34

slide-29
SLIDE 29

Extracting knowledge from life courses Conclusion This is

The End

Thank you!

12/9/2008nsm 34/34