Highlighting changes and differences in 20th century Swiss life - - PowerPoint PPT Presentation

highlighting changes and differences in 20th century
SMART_READER_LITE
LIVE PREVIEW

Highlighting changes and differences in 20th century Swiss life - - PowerPoint PPT Presentation

Introduction State sequences Event sequences Conclusion References Highlighting changes and differences in 20th century Swiss life trajectories with TraMineR Gilbert Ritschard Alexis Gabadinho, Nicolas S. M uller, Matthias Studer


slide-1
SLIDE 1

Introduction State sequences Event sequences Conclusion References

Highlighting changes and differences in 20th century Swiss life trajectories with TraMineR

Gilbert Ritschard Alexis Gabadinho, Nicolas S. M¨ uller, Matthias Studer

Department of Econometrics and Laboratory of Demography University of Geneva http://mephisto.unige.ch/traminer

Swiss Statistical Meeting, Geneva, October 28-30, 2009

23/10/2009gr 1/60

slide-2
SLIDE 2

Introduction State sequences Event sequences Conclusion References

Outline

1

Introduction

2

State sequences

3

Event sequences

4

Conclusion

23/10/2009gr 2/60

slide-3
SLIDE 3

Introduction State sequences Event sequences Conclusion References

Outline

1

Introduction

2

State sequences

3

Event sequences

4

Conclusion

23/10/2009gr 3/60

slide-4
SLIDE 4

Introduction State sequences Event sequences Conclusion References

Section outline

1

Introduction Objectives TraMineR Data

23/10/2009gr 4/60

slide-5
SLIDE 5

Introduction State sequences Event sequences Conclusion References

Objectives

Illustrate some of the many exploratory features of TraMineR A package for Life Trajectory Mining in R

State sequences (education, full time, at home, part time, ...) Event sequences (ending education, starting job, ...)

Highlighting results about Swiss occupational trajectories

Differences between women and men Evolution across birth cohorts

Using Data from the 2002 biographical retrospective survey carried on by the Swiss Household Panel

23/10/2009gr 5/60

slide-6
SLIDE 6

Introduction State sequences Event sequences Conclusion References

Objectives

Illustrate some of the many exploratory features of TraMineR A package for Life Trajectory Mining in R

State sequences (education, full time, at home, part time, ...) Event sequences (ending education, starting job, ...)

Highlighting results about Swiss occupational trajectories

Differences between women and men Evolution across birth cohorts

Using Data from the 2002 biographical retrospective survey carried on by the Swiss Household Panel

23/10/2009gr 5/60

slide-7
SLIDE 7

Introduction State sequences Event sequences Conclusion References

Section outline

1

Introduction Objectives TraMineR Data

23/10/2009gr 6/60

slide-8
SLIDE 8

Introduction State sequences Event sequences Conclusion References

TraMineR’s features

Handling of longitudinal data and conversion between various sequence formats Plotting sequences (density plot, frequency plot, index plot and more) Centro-type and discrepancy measure of a set of sequences Individual longitudinal characteristics of sequences (length, time in each state, longitudinal entropy, turbulence and more) Sequence transversal characteristics by age point (transversal state distribution, transversal entropy, modal state) Other aggregated characteristics (transition rates, average duration in each state, sequence frequency) Dissimilarities between pairs of sequences (Optimal matching, longest common subsequence, Hamming, Dynamic Hamming, Multichannel and more) ANOVA-like analysis of sequences and tree structured ANOVA from dissimilarities Extracting frequent event subsequences Identifying most discriminating event subsequences Association rules between subsequences

23/10/2009gr 7/60

slide-9
SLIDE 9

Introduction State sequences Event sequences Conclusion References

Section outline

1

Introduction Objectives TraMineR Data

23/10/2009gr 8/60

slide-10
SLIDE 10

Introduction State sequences Event sequences Conclusion References

The data

Derived from 2002 biographical SHP survey Yearly data 1503 life trajectories between ages 20 and 45 (25 years length) Focus on

Occupational trajectories (8 states) Cohabitational trajectories (10 states)

23/10/2009gr 9/60

slide-11
SLIDE 11

Introduction State sequences Event sequences Conclusion References

Outline

1

Introduction

2

State sequences

3

Event sequences

4

Conclusion

23/10/2009gr 10/60

slide-12
SLIDE 12

Introduction State sequences Event sequences Conclusion References

Section outline

2

State sequences Basic plots for state sequences Characterizing a set of sequences Individual longitudinal characteristics Computing and exploring pairwise dissimilarities Analysis of sequence discrepancy (ANOVA) Tree structured discrepancy analysis

23/10/2009gr 11/60

slide-13
SLIDE 13

Introduction State sequences Event sequences Conclusion References

Rendering state sequences

23/10/2009gr 12/60

slide-14
SLIDE 14

Introduction State sequences Event sequences Conclusion References

Mean time in each state

Missing Full time Part time

  • Neg. break
  • Pos. break

At home Retired Education

Men Women State Mean time in years 5 10 15 23/10/2009gr 13/60

slide-15
SLIDE 15

Introduction State sequences Event sequences Conclusion References

Section outline

2

State sequences Basic plots for state sequences Characterizing a set of sequences Individual longitudinal characteristics Computing and exploring pairwise dissimilarities Analysis of sequence discrepancy (ANOVA) Tree structured discrepancy analysis

23/10/2009gr 14/60

slide-16
SLIDE 16

Introduction State sequences Event sequences Conclusion References

Characterizing a set of sequences

Sequence of transversal measures (modal state, between entropy, ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Summary of longitudinal measures (within entropy, transition rates, mean duration ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Other global characteristics: Centro-type sequence, diversity

  • f sequences, ...

23/10/2009gr 15/60

slide-17
SLIDE 17

Introduction State sequences Event sequences Conclusion References

Characterizing a set of sequences

Sequence of transversal measures (modal state, between entropy, ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Summary of longitudinal measures (within entropy, transition rates, mean duration ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Other global characteristics: Centro-type sequence, diversity

  • f sequences, ...

23/10/2009gr 15/60

slide-18
SLIDE 18

Introduction State sequences Event sequences Conclusion References

Characterizing a set of sequences

Sequence of transversal measures (modal state, between entropy, ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Summary of longitudinal measures (within entropy, transition rates, mean duration ...) id t1 t2 t3 · · · 1 B B D · · · 2 A B C · · · 3 B B A · · · Other global characteristics: Centro-type sequence, diversity

  • f sequences, ...

23/10/2009gr 15/60

slide-19
SLIDE 19

Introduction State sequences Event sequences Conclusion References

Heterogeneity: Sequence of transversal entropies

Cohabitational vs Occupational

Cohabitational Trajectories

Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 0.8

Occupational Trajectories

Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 0.8 1910−1924 1925−1945 1946−1957

23/10/2009gr 16/60

slide-20
SLIDE 20

Introduction State sequences Event sequences Conclusion References

Heterogeneity: Sequence of transversal entropies

Occupational, Women vs Men

Women: Occupational Trajectories

Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Men: Occupational Trajectories

Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1910−1924 1925−1945 1946−1957

23/10/2009gr 17/60

slide-21
SLIDE 21

Introduction State sequences Event sequences Conclusion References

Section outline

2

State sequences Basic plots for state sequences Characterizing a set of sequences Individual longitudinal characteristics Computing and exploring pairwise dissimilarities Analysis of sequence discrepancy (ANOVA) Tree structured discrepancy analysis

23/10/2009gr 18/60

slide-22
SLIDE 22

Introduction State sequences Event sequences Conclusion References

Longitudinal entropy

  • 1910−1924

1925−1945 1946−1957 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Men: Occupational Trajectories

1910−1924 1925−1945 1946−1957 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Women: Occupational Trajectories

23/10/2009gr 19/60

slide-23
SLIDE 23

Introduction State sequences Event sequences Conclusion References

Number of distinct successive states (i.e. transitions)

  • 1910−1924

1925−1945 1946−1957 2 4 6 8 10

Men: Occupational Trajectories

  • 1910−1924

1925−1945 1946−1957 2 4 6 8 10

Women: Occupational Trajectories

23/10/2009gr 20/60

slide-24
SLIDE 24

Introduction State sequences Event sequences Conclusion References

Entropy versus Number of transitions

  • 0.0

0.1 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10

Men

Entropy #Transitions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 2 3 4 5 6 7 8

Women

Entropy #Transitions

23/10/2009gr 21/60

slide-25
SLIDE 25

Introduction State sequences Event sequences Conclusion References

Sequence complexity

Combines longitudinal entropy and number of transitions

  • 1910−1924

1925−1945 1946−1957 0.0 0.1 0.2 0.3 0.4 0.5

Men: Occupational Trajectories

  • 1910−1924

1925−1945 1946−1957 0.0 0.1 0.2 0.3 0.4 0.5

Women: Occupational Trajectories

23/10/2009gr 22/60

slide-26
SLIDE 26

Introduction State sequences Event sequences Conclusion References

Section outline

2

State sequences Basic plots for state sequences Characterizing a set of sequences Individual longitudinal characteristics Computing and exploring pairwise dissimilarities Analysis of sequence discrepancy (ANOVA) Tree structured discrepancy analysis

23/10/2009gr 23/60

slide-27
SLIDE 27

Introduction State sequences Event sequences Conclusion References

Pairwise dissimilarities between sequences

Distance between sequences

Different metrics (LCP, LCS, OM, HAM, DHD, ...)

Once we have pairwise dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Cluster a set of sequences MDS scatterplot representation of sequences Discrepancy analysis of a set of sequences (ANOVA) Tree Structured Discrepancy Analysis (Induction trees)

23/10/2009gr 24/60

slide-28
SLIDE 28

Introduction State sequences Event sequences Conclusion References

Pairwise dissimilarities between sequences

Distance between sequences

Different metrics (LCP, LCS, OM, HAM, DHD, ...)

Once we have pairwise dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Cluster a set of sequences MDS scatterplot representation of sequences Discrepancy analysis of a set of sequences (ANOVA) Tree Structured Discrepancy Analysis (Induction trees)

23/10/2009gr 24/60

slide-29
SLIDE 29

Introduction State sequences Event sequences Conclusion References

Pairwise dissimilarities between sequences

Distance between sequences

Different metrics (LCP, LCS, OM, HAM, DHD, ...)

Once we have pairwise dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Cluster a set of sequences MDS scatterplot representation of sequences Discrepancy analysis of a set of sequences (ANOVA) Tree Structured Discrepancy Analysis (Induction trees)

23/10/2009gr 24/60

slide-30
SLIDE 30

Introduction State sequences Event sequences Conclusion References

Pairwise dissimilarities between sequences

Distance between sequences

Different metrics (LCP, LCS, OM, HAM, DHD, ...)

Once we have pairwise dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Cluster a set of sequences MDS scatterplot representation of sequences Discrepancy analysis of a set of sequences (ANOVA) Tree Structured Discrepancy Analysis (Induction trees)

23/10/2009gr 24/60

slide-31
SLIDE 31

Introduction State sequences Event sequences Conclusion References

Deriving clusters from pairwise dissimilarities

For each of the two sets of sequences: cohabitational and

  • ccupational

Compute Pairwise dissimilarities (a 1503 × 1503 matrix) Here, we used Optimal Matching (OM)

For each pair {x, y} of sequences, OM is the minimal cost of transforming one sequence into the other

insert/deletion (indel) cost = 1 substitution cost ci,j = cj,i = 2 − p(it | jt−1) − p(jt | it−1)

Cluster by plugging obtained dissimilarity matrix in any cluster algorithm We used an agglomerative hierarchical method with Ward’s criteria and retained partition into 5 clusters

23/10/2009gr 25/60

slide-32
SLIDE 32

Introduction State sequences Event sequences Conclusion References

Deriving clusters from pairwise dissimilarities

For each of the two sets of sequences: cohabitational and

  • ccupational

Compute Pairwise dissimilarities (a 1503 × 1503 matrix) Here, we used Optimal Matching (OM)

For each pair {x, y} of sequences, OM is the minimal cost of transforming one sequence into the other

insert/deletion (indel) cost = 1 substitution cost ci,j = cj,i = 2 − p(it | jt−1) − p(jt | it−1)

Cluster by plugging obtained dissimilarity matrix in any cluster algorithm We used an agglomerative hierarchical method with Ward’s criteria and retained partition into 5 clusters

23/10/2009gr 25/60

slide-33
SLIDE 33

Introduction State sequences Event sequences Conclusion References

Cluster analysis: determining typologies

Type 1: Full Time Trajectoires (52 %)

  • Freq. (n=776)

A20 A24 A28 A32 A36 A40 A44 0.0 0.2 0.4 0.6 0.8 1.0

Type 2: Mixed Occupational Trajectories (22 %)

  • Freq. (n=333)

A20 A24 A28 A32 A36 A40 A44 0.0 0.2 0.4 0.6 0.8 1.0

Type 3: Return Trajectories (11 %)

  • Freq. (n=166)

A20 A24 A28 A32 A36 A40 A44 0.0 0.2 0.4 0.6 0.8 1.0

Type 4: At Home Trajectories (9.5 %)

  • Freq. (n=144)

A20 A24 A28 A32 A36 A40 A44 0.0 0.2 0.4 0.6 0.8 1.0

Type 5: Part Time Trajectories (5.5 %)

  • Freq. (n=84)

A20 A24 A28 A32 A36 A40 A44 0.0 0.2 0.4 0.6 0.8 1.0

Missing Full time Part time

  • Neg. break
  • Pos. break

At home Retired Education

23/10/2009gr 26/60

slide-34
SLIDE 34

Introduction State sequences Event sequences Conclusion References

Cluster analysis: i-plots (sorted by 1st MDS factor)

23/10/2009gr 27/60

slide-35
SLIDE 35

Introduction State sequences Event sequences Conclusion References

Cluster analysis: representative sequences

Type 1: Full Time Trajectoires (52 %)

(N=776) A20 A24 A28 A32 A36 A40 A44 Criterion=frequency, 708 neighbours (91.2%)

  • 13

26 39 52 B A (A) Discrepancy (mean dist. to center) (B) Mean dist. to representative seq.

Type 2: Mixed Occupational Trajectories (22 %)

(N=333) A20 A24 A28 A32 A36 A40 A44 Criterion=frequency, 186 neighbours (55.9%)

  • 13

26 39 52 B A (A) Discrepancy (mean dist. to center) (B) Mean dist. to representative seq.

Type 3: Return Trajectories (11 %)

(N=166) A20 A24 A28 A32 A36 A40 A44 Criterion=frequency, 63 neighbours (38%)

  • 13

26 39 52 B A (A) Discrepancy (mean dist. to center) (B) Mean dist. to representative seq.

Type 4: At Home Trajectories (9.5 %)

(N=144) A20 A24 A28 A32 A36 A40 A44 Criterion=frequency, 112 neighbours (77.8%)

  • 13

26 39 52 B A (A) Discrepancy (mean dist. to center) (B) Mean dist. to representative seq.

Type 5: Part Time Trajectories (5.5 %)

(N=84) A20 A24 A28 A32 A36 A40 A44 Criterion=frequency, 39 neighbours (46.4%)

  • 13

26 39 52 B A (A) Discrepancy (mean dist. to center) (B) Mean dist. to representative seq.

Missing Full time Part time

  • Neg. break
  • Pos. break

At home Retired Education

23/10/2009gr 28/60

slide-36
SLIDE 36

Introduction State sequences Event sequences Conclusion References

Birth year distribution by cluster

Type 1: Full Time Trajectoires (52 %)

Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Type 2: Mixed Occupational Trajectories (22 %)

Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Type 3: Return Trajectories (11 %)

Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Type 4: At Home Trajectories (9.5 %)

Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Type 5: Part Time Trajectories (5.5 %)

Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Overall

Birth year Density 1910 1920 1930 1940 1950 1960 0.00 0.01 0.02 0.03 0.04 0.05 0.06

23/10/2009gr 29/60

slide-37
SLIDE 37

Introduction State sequences Event sequences Conclusion References

MDS: Scatterplot view of sequences

−10 10 20 30 −30 −20 −10 10 20 mds.prof[,1] mds.prof[,2]

  • Type 1: Full Time Trajectoires (52 %)

Type 2: Mixed Occupational Trajectories (22 %) Type 3: Return Trajectories (11 %) Type 4: At Home Trajectories (9.5 %) Type 5: Part Time Trajectories (5.5 %)

−10 10 20 30 −30 −20 −10 10 20 mds.prof[,1] mds.prof[,2]

1910−1924 1925−1945 1946−1957

23/10/2009gr 30/60

slide-38
SLIDE 38

Introduction State sequences Event sequences Conclusion References

Section outline

2

State sequences Basic plots for state sequences Characterizing a set of sequences Individual longitudinal characteristics Computing and exploring pairwise dissimilarities Analysis of sequence discrepancy (ANOVA) Tree structured discrepancy analysis

23/10/2009gr 31/60

slide-39
SLIDE 39

Introduction State sequences Event sequences Conclusion References

Dispersion of the set of sequences

From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009).

23/10/2009gr 32/60

slide-40
SLIDE 40

Introduction State sequences Event sequences Conclusion References

Dispersion of the set of sequences

From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009).

23/10/2009gr 32/60

slide-41
SLIDE 41

Introduction State sequences Event sequences Conclusion References

Dispersion of the set of sequences

From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009).

23/10/2009gr 32/60

slide-42
SLIDE 42

Introduction State sequences Event sequences Conclusion References

Analysis of sequence discrepancy

ANOVA like analysis based on pairwise dissimilarities We decompose the SS (Sum of squares equivalent) SST = SSB + SSW Here, with the formula shown earlier SST = 1 n

n

  • i=1

n

  • j=i+1

dij SSW =

  • g

1 ng

ng

  • i=1

ng

  • j=i+1

dij,g

  • SSB

= SST − SSW

23/10/2009gr 33/60

slide-43
SLIDE 43

Introduction State sequences Event sequences Conclusion References

Pseudo R-square and ANOVA Table

ANOVA table for m groups

Discrepancy df Mean Discr. F Between SSB dfB = m − 1

SSB dfB

SSB SSW

dfW dfB

Within SSW dfW =

g ng − m SSW dfW

Total SST dfT = n − 1

Pseudo R2 R2 = SSB SST

23/10/2009gr 34/60

slide-44
SLIDE 44

Introduction State sequences Event sequences Conclusion References

Pseudo R-square and ANOVA Table

ANOVA table for m groups

Discrepancy df Mean Discr. F Between SSB dfB = m − 1

SSB dfB

SSB SSW

dfW dfB

Within SSW dfW =

g ng − m SSW dfW

Total SST dfT = n − 1

Pseudo R2 R2 = SSB SST

23/10/2009gr 34/60

slide-45
SLIDE 45

Introduction State sequences Event sequences Conclusion References

Pseudo F

Pseudo F F = SSB/(m − 1) SSW /(n − m) Normality is not defendable in this setting. F cannot be compared with an F distribution. The significance is assesses through a permutation test Permutation test: iteratively randomly reassign each covariate profile to one of the observed sequence and recompute the F. Empirical distribution of F under independence.

23/10/2009gr 35/60

slide-46
SLIDE 46

Introduction State sequences Event sequences Conclusion References

Pseudo F

Pseudo F F = SSB/(m − 1) SSW /(n − m) Normality is not defendable in this setting. F cannot be compared with an F distribution. The significance is assesses through a permutation test Permutation test: iteratively randomly reassign each covariate profile to one of the observed sequence and recompute the F. Empirical distribution of F under independence.

23/10/2009gr 35/60

slide-47
SLIDE 47

Introduction State sequences Event sequences Conclusion References

Analysis of sequence discrepancy

Running an ANOVA like analysis for cohort3b

Pseudo ANOVA table: SS df MSE Exp 106.4437 2 53.22183 Res 15645.8712 1500 10.43058 Total 15752.3148 1502 10.48756 Test values (p-values based on 999 permutation): PseudoF PseudoR2 PseudoF_Pval PseudoT PseudoT_Pval 5.10248 0.006757335 0 7.361347 Variance per level: n variance 1910-1924 71 7.713761 1925-1945 659 9.651546 1946-1957 773 11.303784 Total 1503 10.480582

23/10/2009gr 36/60

slide-48
SLIDE 48

Introduction State sequences Event sequences Conclusion References

Distribution of pseudo F

Distribution of PseudoF

PseudoF Frequency 1 2 3 4 20 40 60 80 100 120

23/10/2009gr 37/60

slide-49
SLIDE 49

Introduction State sequences Event sequences Conclusion References

Multiple factor analysis

Generalize previous approach for multiple covariates. Here, we consider Type III effects Measure the additional contribution of each covariate v when we accounted for all other covariates. The F statistics reads Fv = (SSBc − SSBv )/p SSWc/(n − m − 1)

where the SSBc and SSWc are the explained and residual sums of squares of the full model, SSBv the explained sum of squares of the model after removing variable v, and p the number of indicators or contrasts used to encode the covariate v.

Significance is assessed again through permutation tests.

23/10/2009gr 38/60

slide-50
SLIDE 50

Introduction State sequences Event sequences Conclusion References

Running a Multiple factor analysis

Variable PseudoF PseudoR2 p_value 1 sex 486.157573 0.222836269 0.000000000 2 cohort3b 5.297978 0.004856786 0.000999001 3 edu_lev 33.998319 0.046750636 0.000000000 4 Total 114.523325 0.314748465 0.000000000

23/10/2009gr 39/60

slide-51
SLIDE 51

Introduction State sequences Event sequences Conclusion References

Differences over time

How do differences between groups vary over time? At which age do trajectories most differ across birth cohorts? Compute R2 for short sliding windows (length 2) We get thus a sequence of R2, which can be plotted Similarly, we can plot series of

total residual discrepancy (SSW ) residual discrepancy of each group (SSG)

23/10/2009gr 40/60

slide-52
SLIDE 52

Introduction State sequences Event sequences Conclusion References

Differences over time

How do differences between groups vary over time? At which age do trajectories most differ across birth cohorts? Compute R2 for short sliding windows (length 2) We get thus a sequence of R2, which can be plotted Similarly, we can plot series of

total residual discrepancy (SSW ) residual discrepancy of each group (SSG)

23/10/2009gr 40/60

slide-53
SLIDE 53

Introduction State sequences Event sequences Conclusion References

Differences over time

How do differences between groups vary over time? At which age do trajectories most differ across birth cohorts? Compute R2 for short sliding windows (length 2) We get thus a sequence of R2, which can be plotted Similarly, we can plot series of

total residual discrepancy (SSW ) residual discrepancy of each group (SSG)

23/10/2009gr 40/60

slide-54
SLIDE 54

Introduction State sequences Event sequences Conclusion References

Plotting R-squares over time

Birth cohorts

0.004 0.005 0.006 0.007 0.008 0.009 PseudoR2 A20 A22 A24 A26 A28 A30 A32 A34 A36 A38 A40 A42 A44

23/10/2009gr 41/60

slide-55
SLIDE 55

Introduction State sequences Event sequences Conclusion References

Plotting residual discrepancy over time

Birth cohorts

0.26 0.28 0.30 0.32 Variance 1910−1924 1925−1945 1946−1957 Total A20 A22 A24 A26 A28 A30 A32 A34 A36 A38 A40 A42 A44

23/10/2009gr 42/60

slide-56
SLIDE 56

Introduction State sequences Event sequences Conclusion References

Plotting residual discrepancy over time

Birth cohorts

0.26 0.28 0.30 0.32 Variance 1910−1924 1925−1945 1946−1957 Total A20 A22 A24 A26 A28 A30 A32 A34 A36 A38 A40 A42 A44

Occupational Trajectories

Age Entropy A20 A23 A26 A29 A32 A35 A38 A41 A44 0.3 0.4 0.5 0.6 0.7 1910−1924 1925−1945 1946−1957

23/10/2009gr 43/60

slide-57
SLIDE 57

Introduction State sequences Event sequences Conclusion References

Section outline

2

State sequences Basic plots for state sequences Characterizing a set of sequences Individual longitudinal characteristics Computing and exploring pairwise dissimilarities Analysis of sequence discrepancy (ANOVA) Tree structured discrepancy analysis

23/10/2009gr 44/60

slide-58
SLIDE 58

Introduction State sequences Event sequences Conclusion References

Tree structured discrepancy analysis

Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R2. Significance of split is assessed through a permutation F test. Growing stops when the selected split is not significant.

23/10/2009gr 45/60

slide-59
SLIDE 59

Introduction State sequences Event sequences Conclusion References

Tree structured discrepancy analysis

Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R2. Significance of split is assessed through a permutation F test. Growing stops when the selected split is not significant.

23/10/2009gr 45/60

slide-60
SLIDE 60

Introduction State sequences Event sequences Conclusion References

Growing the tree

Dissimilarity tree Global R2: 0.229 |-- Root [ 1503 ] var: 10.5 |-> sex R2: 0.179 |-- man [ 752 ] var: 4.37 |-> edu_lev R2: 0.143 |-- University [ 157 ] var: 6.28 |-- Compulsory/College+Prof/Prof.HS [ 595 ] var: 3.08 |-- woman [ 751 ] var: 12.8 |-> edu_lev R2: 0.0206 |-- Compulsory/College+Prof [ 632 ] var: 12.5 |-> edu_lev R2: 0.00905 |-- Compulsory [ 116 ] var: 12.0 |-- College+Prof [ 516 ] var: 12.5 |-> cohort3b R2: 0.00714 |-- 1946-1957 [ 280 ] var: 12.5 |-- 1910-1924/1925-1945 [ 236 ] var: 12.2 |-- Prof.HS/University [ 119 ] var: 13.1

23/10/2009gr 46/60

slide-61
SLIDE 61

Introduction State sequences Event sequences Conclusion References

Graphical Tree

23/10/2009gr 47/60

slide-62
SLIDE 62

Introduction State sequences Event sequences Conclusion References

Outline

1

Introduction

2

State sequences

3

Event sequences

4

Conclusion

23/10/2009gr 48/60

slide-63
SLIDE 63

Introduction State sequences Event sequences Conclusion References

Event sequences

Time stamped events

(end education, 21) (start full time job, 21) (at home, 28) (start part time, 29)

Which are the most typical sequencings? Which are the most typical events that occur after the sub-sequence (leaving home, ending education)? Which sequencings do most differ among groups? ... Unlike state sequences, event sequences are hard to visualize

23/10/2009gr 49/60

slide-64
SLIDE 64

Introduction State sequences Event sequences Conclusion References

Event sequences

Time stamped events

(end education, 21) (start full time job, 21) (at home, 28) (start part time, 29)

Which are the most typical sequencings? Which are the most typical events that occur after the sub-sequence (leaving home, ending education)? Which sequencings do most differ among groups? ... Unlike state sequences, event sequences are hard to visualize

23/10/2009gr 49/60

slide-65
SLIDE 65

Introduction State sequences Event sequences Conclusion References

Event sequences

Time stamped events

(end education, 21) (start full time job, 21) (at home, 28) (start part time, 29)

Which are the most typical sequencings? Which are the most typical events that occur after the sub-sequence (leaving home, ending education)? Which sequencings do most differ among groups? ... Unlike state sequences, event sequences are hard to visualize

23/10/2009gr 49/60

slide-66
SLIDE 66

Introduction State sequences Event sequences Conclusion References

Events, Transitions and States

An event occurs at a given time

(leaving home, starting job, ...)

Transition: set of events occurring simultaneously A transition corresponds to a state change Easy to transform between state and transition sequences Converting to and from events requires additional information To illustrate, we consider hereafter the events defined by state changes in our previous trajectories

23/10/2009gr 50/60

slide-67
SLIDE 67

Introduction State sequences Event sequences Conclusion References

Events, Transitions and States

An event occurs at a given time

(leaving home, starting job, ...)

Transition: set of events occurring simultaneously A transition corresponds to a state change Easy to transform between state and transition sequences Converting to and from events requires additional information To illustrate, we consider hereafter the events defined by state changes in our previous trajectories

23/10/2009gr 50/60

slide-68
SLIDE 68

Introduction State sequences Event sequences Conclusion References

Event sequences: discriminating sub-sequences

Between sex, frequencies

man

0.0 0.1 0.2 0.3 0.4 0.5 0.6 (Full time>At home) (Full time)−(Full time>At home) (At home>Part time) (Full time>At home)−(At home>Part time) (Full time)−(At home>Part time) (Full time>Part time) (Full time)−(Full time>At home)−(At home>Part time) (Full time)−(Full time>Part time) (Education>Full time) (Education) (Education)−(Education>Full time) (Part time>Full time) (Full time) (Missing)

woman

0.0 0.1 0.2 0.3 0.4 0.5 0.6 (Full time>At home) (Full time)−(Full time>At home) (At home>Part time) (Full time>At home)−(At home>Part time) (Full time)−(At home>Part time) (Full time>Part time) (Full time)−(Full time>At home)−(At home>Part time) (Full time)−(Full time>Part time) (Education>Full time) (Education) (Education)−(Education>Full time) (Part time>Full time) (Full time) (Missing)

Pearson residuals

Negative 0.01 Negative 0.05 neutral Positive 0.05 Positive 0.01

23/10/2009gr 51/60

slide-69
SLIDE 69

Introduction State sequences Event sequences Conclusion References

Event sequences: discriminating sub-sequences

Between sex, residuals

man

−10 −5 5 10 (Full time>At home) (Full time)−(Full time>At home) (At home>Part time) (Full time>At home)−(At home>P (Full time)−(At home>Part time) (Full time>Part time) (Full time)−(Full time>At home)−(At home>P (Full time)−(Full time>Part time) (Education>Full time) (Education) (Education)−(Education>Full time) (Part time>Full time) (Full time) (Missing)

woman

−10 −5 5 10 (Full time>At home) (Full time)−(Full time>At home) (At home>Part time) (Full time>At home)−(At home>P (Full time)−(At home>Part time) (Full time>Part time) (Full time)−(Full time>At home)−(At home>P (Full time)−(Full time>Part time) (Education>Full time) (Education) (Education)−(Education>Full time) (Part time>Full time) (Full time) (Missing)

Pearson residuals

Negative 0.01 Negative 0.05 neutral Positive 0.05 Positive 0.01

23/10/2009gr 52/60

slide-70
SLIDE 70

Introduction State sequences Event sequences Conclusion References

Event sequences: discriminating between birth cohorts

frequencies

1910−1924

0.0 0.1 0.2 0.3 0.4 0.5 0.6 (Full time>Part time) (Full time)−(Full time>Part time) (At home>Part time) (Missing) (Education) (Full time)−(At home>Part time) (Full time>At home)−(At home>Part time) (Full time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>Part time) (Education)−(Education>Full time) (Education>Full time) (Full time)−(Full time>At home) (Full time>At home)

1925−1945

0.0 0.1 0.2 0.3 0.4 0.5 0.6 (Full time>Part time) (Full time)−(Full time>Part time) (At home>Part time) (Missing) (Education) (Full time)−(At home>Part time) (Full time>At home)−(At home>Part time) (Full time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>Part time) (Education)−(Education>Full time) (Education>Full time) (Full time)−(Full time>At home) (Full time>At home)

1946−1957

0.0 0.1 0.2 0.3 0.4 0.5 0.6 (Full time>Part time) (Full time)−(Full time>Part time) (At home>Part time) (Missing) (Education) (Full time)−(At home>Part time) (Full time>At home)−(At home>Part time) (Full time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>Part time) (Education)−(Education>Full time) (Education>Full time) (Full time)−(Full time>At home) (Full time>At home)

Pearson residuals

Negative 0.01 Negative 0.05 neutral Positive 0.05 Positive 0.01

23/10/2009gr 53/60

slide-71
SLIDE 71

Introduction State sequences Event sequences Conclusion References

Event sequences: discriminating between birth cohorts

residuals

1910−1924

−3 −2 −1 1 2 3 4 (Full time>Part time) (Full time)−(Full time>Part time) (At home>Part time) (Missing) (Education) (Full time)−(At home>Part time) (Full time>At home)−(At home>Part time) (Full time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>P (Education)−(Education>Full time) (Education>Full time) (Full time)−(Full time>At home) (Full time>At home)

1925−1945

−3 −2 −1 1 2 3 4 (Full time>Part time) (Full time)−(Full time>Part time) (At home>Part time) (Missing) (Education) (Full time)−(At home>Part time) (Full time>At home)−(At home>Part time) (Full time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>P (Education)−(Education>Full time) (Education>Full time) (Full time)−(Full time>At home) (Full time>At home)

1946−1957

−3 −2 −1 1 2 3 4 (Full time>Part time) (Full time)−(Full time>Part time) (At home>Part time) (Missing) (Education) (Full time)−(At home>Part time) (Full time>At home)−(At home>Part time) (Full time) (Part time>Full time) (Full time)−(Full time>At home)−(At home>P (Education)−(Education>Full time) (Education>Full time) (Full time)−(Full time>At home) (Full time>At home)

Pearson residuals

Negative 0.01 Negative 0.05 neutral Positive 0.05 Positive 0.01

23/10/2009gr 54/60

slide-72
SLIDE 72

Introduction State sequences Event sequences Conclusion References

Outline

1

Introduction

2

State sequences

3

Event sequences

4

Conclusion

23/10/2009gr 55/60

slide-73
SLIDE 73

Introduction State sequences Event sequences Conclusion References

Conclusion 1: about sequence analysis

Analyzing trajectories until 45, implies ignoring recent generations Most recent birth year is 1957 (2002 − 45) Missing data in sequences is a crucial issue TraMineR permits different handling for left, right and in between missings

consider as a specific state drop (shifts state sequences left) impute, but how?

Weights

Can be handled in sequence rendering (weighted transversal characteristics) Not really an issue for computing dissimilarities and longitudinal charcateristics We are working on a solution for permutation tests

23/10/2009gr 56/60

slide-74
SLIDE 74

Introduction State sequences Event sequences Conclusion References

Conclusion 1: about sequence analysis

Analyzing trajectories until 45, implies ignoring recent generations Most recent birth year is 1957 (2002 − 45) Missing data in sequences is a crucial issue TraMineR permits different handling for left, right and in between missings

consider as a specific state drop (shifts state sequences left) impute, but how?

Weights

Can be handled in sequence rendering (weighted transversal characteristics) Not really an issue for computing dissimilarities and longitudinal charcateristics We are working on a solution for permutation tests

23/10/2009gr 56/60

slide-75
SLIDE 75

Introduction State sequences Event sequences Conclusion References

Conclusion 1: about sequence analysis

Analyzing trajectories until 45, implies ignoring recent generations Most recent birth year is 1957 (2002 − 45) Missing data in sequences is a crucial issue TraMineR permits different handling for left, right and in between missings

consider as a specific state drop (shifts state sequences left) impute, but how?

Weights

Can be handled in sequence rendering (weighted transversal characteristics) Not really an issue for computing dissimilarities and longitudinal charcateristics We are working on a solution for permutation tests

23/10/2009gr 56/60

slide-76
SLIDE 76

Introduction State sequences Event sequences Conclusion References

Conclusion 2: extending analysis

Since it runs in R, TraMineR’s outcome can be easily combined in a same script with other R procedures We have shown: cluster analysis, MDS, ... In Widmer and Ritschard (2009), we studied

Relationship between occupational and cohabitational trajectories by regressing longitudinal entropies of each of them

  • n both occupational and cohabitational clusters while

controlling for birth cohorts and sex Studied also cluster membership by means of logistic regressions.

23/10/2009gr 57/60

slide-77
SLIDE 77

Introduction State sequences Event sequences Conclusion References

Conclusion 3: about TraMineR

TraMineR is a unique powerful tool for discrete sequences Can do much more than shown in this presentation, for instance

sequence data management conversion between event and state sequences multiple metrics, including multi-channel for parallel sequences dissimilarities between event sequences discovering association rules between event-subsequences ...

... and, as R, it is available for free on the CRAN http://cran.r-project.org See also the package web page http://mephisto.unige.ch/traminer

23/10/2009gr 58/60

slide-78
SLIDE 78

Introduction State sequences Event sequences Conclusion References

Thank You! Thank You!

23/10/2009gr 59/60

slide-79
SLIDE 79

Introduction State sequences Event sequences Conclusion References

References I

Gabadinho, A., G. Ritschard, M. Studer, and N. S. M¨ uller (2008). Mining sequence data in R with TraMineR: A user’s guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva. Ritschard, G., A. Gabadinho, N. S. M¨ uller, and M. Studer (2008). Mining event histories: A social science perspective. International Journal of Data Mining, Modelling and Management 1(1), 68–90. Studer, M., G. Ritschard, A. Gabadinho, and N. S. M¨ uller (2009). Discrepancy analysis of complex objects using dissimilarities. In H. Briand, F. Guillet,

  • G. Ritschard, and D. A. Zighed (Eds.), Advances in Knowledge Discovery

and Management, Studies in Computational Intelligence. Berlin: Springer. (forthcoming). Widmer, E. and G. Ritschard (2009). The de-standardization of the life course: Are men and women equal? Advances in Life course Research 14(1-2), 28–39.

23/10/2009gr 60/60