What matters in differences between life trajectories? A comparative - - PowerPoint PPT Presentation

what matters in differences between life trajectories a
SMART_READER_LITE
LIVE PREVIEW

What matters in differences between life trajectories? A comparative - - PowerPoint PPT Presentation

Sequence analysis Reviewing distances Simulations Conclusion References What matters in differences between life trajectories? A comparative review of sequence dissimilarity measures Matthias Studer 1 , 2 & Gilbert Ritschard 1 , 2 1 LIVES


slide-1
SLIDE 1

Sequence analysis Reviewing distances Simulations Conclusion References

What matters in differences between life trajectories? A comparative review of sequence dissimilarity measures

Matthias Studer1,2 & Gilbert Ritschard1,2

1LIVES NCCR 2Institute of Demography and Socioeconomics

University Geneva

FORS – SSP Methods and Research meetings, University of Lausanne, December 1, 2015

1/32

slide-2
SLIDE 2

Sequence analysis Reviewing distances Simulations Conclusion References

Outline

1

Sequence analysis

2

Reviewing distances

3

Simulations

4

Conclusion

2/32

slide-3
SLIDE 3

Sequence analysis Reviewing distances Simulations Conclusion References

Outline

1

Sequence analysis

2

Reviewing distances

3

Simulations

4

Conclusion

3/32

slide-4
SLIDE 4

Sequence analysis Reviewing distances Simulations Conclusion References

Sequence Analysis in the Social Sciences

SA aims to describe trajectories.

Professional carriers. Cohabitational life courses. History of organizations.

Typology of the trajectories. Common questions in sequence analysis.

What are the typical patterns of trajectories? How are the trajectories related to explanatory factors? How is a given outcome related to a previous trajectory?

4/32

slide-5
SLIDE 5

Sequence analysis Reviewing distances Simulations Conclusion References

Sequences analysis: common strategy

Code processes/trajectories as state sequences.

Sep.93 Sep.94 Sep.95 Sep.96 Sep.97 Sep.98 2 13 14 employment further education higher education joblessness school training

Compute distances between sequences, i.e. Optimal matching.

5/32

slide-6
SLIDE 6

Sequence analysis Reviewing distances Simulations Conclusion References

Typology of processes

Reveals main patterns.

Employment

  • Freq. (n=481)

Sep.93 Mar.95 Mar.96 Mar.97 Mar.98 Mar.99 0.0 0.4 0.8

Higher Education

  • Freq. (n=169)

Sep.93 Mar.95 Mar.96 Mar.97 Mar.98 Mar.99 0.0 0.4 0.8

Joblessness

  • Freq. (n=62)

Sep.93 Mar.95 Mar.96 Mar.97 Mar.98 Mar.99 0.0 0.4 0.8 Employment Further Education Higher Education Joblessness School Training

6/32

slide-7
SLIDE 7

Sequence analysis Reviewing distances Simulations Conclusion References

Optimal Matching

“Optimal Matching”: distance measure between sequences.

Definition: number of operation needed to transform one sequence into another one.

Substitution. Insertion–deletion.

Operation cost can be weighted.

Sep.93 Sep.94 Sep.95 Sep.96 Sep.97 Sep.98 2 13 14 employment further education higher education joblessness school training

7/32

slide-8
SLIDE 8

Sequence analysis Reviewing distances Simulations Conclusion References

Criticism

Many critics (Levine, 2000; Wu, 2000; Elzinga, 2003). Lack a sociological interpretation. High number of parameters. Parameters values set by the user. Timing and sequencing of sequences are not sufficiently taken into account.

8/32

slide-9
SLIDE 9

Sequence analysis Reviewing distances Simulations Conclusion References

New developments

New developments as answers to criticisms (Aisenbrey and Fasang, 2010):

New distances measures. New methods to automatically compute parameters values.

Result in many distances measures.

Seven article in Sociological Method and Research. Each having at least one parameter.

Scattered development.

Answer to one critic at a time. Only compare to classic OM.

9/32

slide-10
SLIDE 10

Sequence analysis Reviewing distances Simulations Conclusion References

New developments

New developments as answers to criticisms (Aisenbrey and Fasang, 2010):

New distances measures. New methods to automatically compute parameters values.

Result in many distances measures.

Seven article in Sociological Method and Research. Each having at least one parameter.

Scattered development.

Answer to one critic at a time. Only compare to classic OM.

9/32

slide-11
SLIDE 11

Sequence analysis Reviewing distances Simulations Conclusion References

Choosing a distance

SA users common questions:

How to choose distance measure? How to set the parameters?

Aim: Help SA users to choose a distance and set the parameters.

Review all distances measures. Provide guidelines.

10/32

slide-12
SLIDE 12

Sequence analysis Reviewing distances Simulations Conclusion References

Choosing a distance

SA users common questions:

How to choose distance measure? How to set the parameters?

Aim: Help SA users to choose a distance and set the parameters.

Review all distances measures. Provide guidelines.

10/32

slide-13
SLIDE 13

Sequence analysis Reviewing distances Simulations Conclusion References

Outline

1

Sequence analysis

2

Reviewing distances

3

Simulations

4

Conclusion

11/32

slide-14
SLIDE 14

Sequence analysis Reviewing distances Simulations Conclusion References

Review of distance measures properties

Type Description Properties Parameters Measure DisAttEdt Metric Eucl T.warp S.dep Ctxt Subst. Indels Others CHI2, EUCLID x Distance between per period state distributions x x x Number of periods K CHI2fut (Rousset) x Position-wise state distances based on shared future x x x Time-lag weighting function NMS (Elzinga) x Based on number of matching subsequences x x x x SVRspell (Elzinga & Studer) x Based on number of matching spell subsequences with spell-length weights x x x x x User Subsequence length weight a, spell duration weight b HAM (Hamming) x x Number of mismatches x xb generalized x Sum of mismatches with state-dependent weights xa xb,c x User DHD (Lesnard) x Sum of mismatches with position-wise state-dependent weights x x Data OM x Minimum cost for turning x into y using theoretically defined costs xa x x User Mult LCS / OM(1,2) / Levenshtein-II x x Based on length of LCS / Number of indels x x feature x Costs based on state features x x x Features Single State features future (new) x Costs based on similarity between conditional state distributions q periods ahead x x x Data Single Forward lag q trate x Costs based on transition rates x x Data Single Transition lag q

  • ptna(Gauthier)

x Costs adjusted to increase similarity between similar sequences

n

x x Data Single Similarity rate indels, indelslog (new) x State dependent indels based on inverse

  • r log inverse state frequencies.

x x x Auto OMloc (Holister) x Context dependent indel costs x x x User Auto Expansion cost e, Context g OMslen (Halpin) x Costs weighted by spell length x x x x User Multna Spell length weight h OMspell (new) x OM between sequences of spells xa x x x User Multna Expansion cost e OMstran (new) x OM between sequences of transitions xa x x x User Mult Origin-transition trade-off w, Transition indel cost function

a If costs fulfil the triangle inequality. b Squared Euclidean distance. c If costs are squared Euclidean distances. na Not available in TraMineR. n Can generate negative dissimilarities.

12/32

slide-15
SLIDE 15

Sequence analysis Reviewing distances Simulations Conclusion References

Review

Theoretical review. Many distance measures. Highlight mathematical distances properties. Many non-metric dissimilarities.

5 out of 7 distance published in SMR do not satisfy triangle inequality. 2 with serious issues (Wrong algorithm or negative distances).

Overlooked mathematical properties?

13/32

slide-16
SLIDE 16

Sequence analysis Reviewing distances Simulations Conclusion References

Reviewing distances

How to choose a distance measure? How to evaluate a distance measure? A distance measure defines how two sequences are compared. Which aspects should we use to compare trajectories?

Sociological issue. Five aspects based on Settersten and Mayer (1997) and Billari et al. (2006).

14/32

slide-17
SLIDE 17

Sequence analysis Reviewing distances Simulations Conclusion References

Reviewing distances

How to choose a distance measure? How to evaluate a distance measure? A distance measure defines how two sequences are compared. Which aspects should we use to compare trajectories?

Sociological issue. Five aspects based on Settersten and Mayer (1997) and Billari et al. (2006).

14/32

slide-18
SLIDE 18

Sequence analysis Reviewing distances Simulations Conclusion References

Reviewing distances

How to choose a distance measure? How to evaluate a distance measure? A distance measure defines how two sequences are compared. Which aspects should we use to compare trajectories?

Sociological issue. Five aspects based on Settersten and Mayer (1997) and Billari et al. (2006).

14/32

slide-19
SLIDE 19

Sequence analysis Reviewing distances Simulations Conclusion References

Sequence comparison aspects

Experienced states.

Similar sequence should have some states/events in common.

Distribution.

Total exposure time.

Timing.

Age in a state/time an event occurs.

Spell duration.

Consecutive time spent.

Sequencing.

Order of the states/events in the sequence.

15/32

slide-20
SLIDE 20

Sequence analysis Reviewing distances Simulations Conclusion References

Sequence comparison aspects

Experienced states.

Similar sequence should have some states/events in common.

Distribution.

Total exposure time.

Timing.

Age in a state/time an event occurs.

Spell duration.

Consecutive time spent.

Sequencing.

Order of the states/events in the sequence.

15/32

slide-21
SLIDE 21

Sequence analysis Reviewing distances Simulations Conclusion References

Sequence comparison aspects

Experienced states.

Similar sequence should have some states/events in common.

Distribution.

Total exposure time.

Timing.

Age in a state/time an event occurs.

Spell duration.

Consecutive time spent.

Sequencing.

Order of the states/events in the sequence.

15/32

slide-22
SLIDE 22

Sequence analysis Reviewing distances Simulations Conclusion References

Sequence comparison aspects

Experienced states.

Similar sequence should have some states/events in common.

Distribution.

Total exposure time.

Timing.

Age in a state/time an event occurs.

Spell duration.

Consecutive time spent.

Sequencing.

Order of the states/events in the sequence.

15/32

slide-23
SLIDE 23

Sequence analysis Reviewing distances Simulations Conclusion References

Sequence comparison aspects

Experienced states.

Similar sequence should have some states/events in common.

Distribution.

Total exposure time.

Timing.

Age in a state/time an event occurs.

Spell duration.

Consecutive time spent.

Sequencing.

Order of the states/events in the sequence.

15/32

slide-24
SLIDE 24

Sequence analysis Reviewing distances Simulations Conclusion References

Outline

1

Sequence analysis

2

Reviewing distances

3

Simulations

4

Conclusion

16/32

slide-25
SLIDE 25

Sequence analysis Reviewing distances Simulations Conclusion References

Simulations

The sensitivity to each aspect vary between distance measures. Use simulation to measure this sensitivity. Aim: describe the behaviour of each distance/configuration of parameter. Generate two groups of sequences.

Groups differ on one aspect. Measure ability of each distance to discriminate between groups. Based on discrepancy analysis (Pseudo-R2).

Randomize untested aspects: groups should only differ on one aspect.

17/32

slide-26
SLIDE 26

Sequence analysis Reviewing distances Simulations Conclusion References

Simulations

The sensitivity to each aspect vary between distance measures. Use simulation to measure this sensitivity. Aim: describe the behaviour of each distance/configuration of parameter. Generate two groups of sequences.

Groups differ on one aspect. Measure ability of each distance to discriminate between groups. Based on discrepancy analysis (Pseudo-R2).

Randomize untested aspects: groups should only differ on one aspect.

17/32

slide-27
SLIDE 27

Sequence analysis Reviewing distances Simulations Conclusion References

Sequencing Simulation

Generate two groups of sequences.

Group 1: x = (A, B, C) Group 2: x = (C, B, A) Durations and timings random in both groups.

2’000’000 sequences.

Base

V20 V17 V14 V11 V8 V5 V2 1 43 71 88 105 124

1

V20 V17 V14 V11 V8 V5 V2 1 41 59 72 86 112 150

18/32

slide-28
SLIDE 28

Sequence analysis Reviewing distances Simulations Conclusion References

Sets of simulations

State based:

Sequencing:

Difference of patterns. Random small perturbations.

Timing: age at the beginning of a spell. Duration: duration of a spell.

Event based (based on three events e1, e2, e3)

Sequencing: order of underlying events. Timing: age at a given event. Duration: “spacing” between events.

Simulations chosen among those considered in Studer (2012).

19/32

slide-29
SLIDE 29

Sequence analysis Reviewing distances Simulations Conclusion References

Distance included in analysis

Distance Configurations Distribution-based EUCLID(K=1, 5, 20) (Euclidean), CHI2(K=1, 5, 20), (χ2-distance between distributions within K periods), CHI2fut (metric based on distributions of subsequent states) Hamming HAM (simple and generalized Hamming), DHD (Dynamic Hamming) Optimal Matching (OM) OM, OM(i=1.5), OM(trate), OM(indelslog), OM(indels), OM(future) Localized Optimal Matching (OMloc) OMloc(e=0, 0.1, 0.25, 0.4) Spell-Length-Sensitive Optimal Matching (OMslen) OMslen(h=1, i=1, 1.5, 5), OMslen(i=1, 1.5, 5) Optimal Matching of Spell Sequences (OMspell) OMspell(e=0, 0.1, 0.5, 1), OMspell(e=0, 0.1, 0.5, 1, i=2) Optimal Matching of Transition Sequences (OMstran) OMstran(w=0, 0.1, 0.5), OMstran(i=1.5, w=0.1, 0.5), OMstran(i=5, w=0.1, 0.5), OMstran(tm=raw) Number of Matching Subsequences (NMS) NMS Subsequence Vectorial Representation (SVRspell) SVRspell(b=0, 1, 2, 3), SVRspell(b=0, 1, 2, 3, a=1) 20/32

slide-30
SLIDE 30

Sequence analysis Reviewing distances Simulations Conclusion References

Scores for state-based simulations

  • −2

−1 1 2 −3 −2 −1 1 2 Sequencing vs Temporality Duration vs Timing

CHI2(K=1) CHI2(K=5) EUCLID(K=1) EUCLID(K=5) EUCLID(K=20) CHI2fut CHI2(K=20) DHD HAM OM OM(indelslog) OM(indels) OMspell(e=0) OMspell(e=1) OMspell(e=1, i=2) OMspell(e=0.5, i=2) OMspell(e=0.1, i=2) OMloc(e=0) OMloc(e=0.25) OMloc(e=0.4) OMslen(h=1) OMslen(i=1.5, h=1) OMslen(i=5) OMslen(i=5, h=1) OMstran(w=0.5) OMstran(i=5, w=0.5) OMstran(w=0.1) OMstran(w=0) NMS SVRspell(b=0) SVRspell(b=3) SVRspell(b=0, a=1) SVRspell(b=1, a=1) SVRspell(b=2, a=1) SVRspell(b=3, a=1)●

  • CHI2

CHI2fut DHD EUCLID HAM NMS OM OMloc OMslen OMspell OMstran SVRspell

21/32

slide-31
SLIDE 31

Sequence analysis Reviewing distances Simulations Conclusion References

Scores for event-based simulations

  • −3

−2 −1 1 2 −3 −2 −1 1 2 Order vs Temporality Duration vs Positioning

CHI2(K=1) CHI2(K=5) EUCLID(K=1) EUCLID(K=5) EUCLID(K=20) CHI2fut CHI2(K=20) DHD HAM OM(i=1.5) OM OM(indelslog) OM(indels) OMspell(e=0) OMspell(e=1) OMspell(e=0.5) OMspell(e=0.1) OMloc(e=0) OMslen(h=1) OMslen(i=5, h=1) OMstran(w=0) NMS SVRspell(b=0) SVRspell(b=3) SVRspell(b=0, a=1) OMstran(ec, ti='sm', w=0.5) OMstran(ec, i=5, ti='sm', w=0.5) HAM(ec) OM(ec, i=1.5) OM(ec) OMspell(e=0, ec) OMspell(e=0, ec, i=2) OMspell(e=1, ec) OMspell(e=1, ec, i=2) OMloc(ec, e=0) OMloc(ec, e=0.25) OMloc(ec, e=0.4) OMslen(ec, i=1.5, h=1) OMslen(ec, i=5) OMslen(ec, i=5, h=1) SVRspell(ec, b=0) SVRspell(ec, b=1) SVRspell(ec, b=2)

  • CHI2

CHI2fut DHD EUCLID HAM NMS OM OMloc OMslen OMspell OMstran SVRspell

22/32

slide-32
SLIDE 32

Sequence analysis Reviewing distances Simulations Conclusion References

Random perturbation vs sequencing

  • −1

1 2 −1 1 2 3 4 Sequencing Random perturbation

CHI2(K=1) CHI2(K=5) EUCLID(K=1) EUCLID(K=5) EUCLID(K=20) CHI2fut CHI2(K=20) DHD HAM OM OM(indelslog) OM(indels) OM(future) OMspell(e=0) OMspell(e=0, i=2) OMspell(e=1) OMspell(e=1, i=2) OMspell(e=0.5) OMspell(e=0.1, i=2) OMloc(e=0) OMloc(e=0.1) OMloc(e=0.25) OMloc(e=0.4) OMslen(h=1) OMslen(i=1.5, h=1) OMslen(i=5, h=1) OMstran(i=5, w=0.5) OMstran(w=0) NMS SVRspell(b=0) SVRspell(b=3) SVRspell(b=0, a=1) SVRspell(b=1, a=1) SVRspell(b=3, a=1)

  • CHI2

CHI2fut DHD EUCLID HAM NMS OM OMloc OMslen OMspell OMstran SVRspell

23/32

slide-33
SLIDE 33

Sequence analysis Reviewing distances Simulations Conclusion References

Outline

1

Sequence analysis

2

Reviewing distances

3

Simulations

4

Conclusion

24/32

slide-34
SLIDE 34

Sequence analysis Reviewing distances Simulations Conclusion References

Conclusions

Similar overall scores for all distances, except NMS. Strange results for non-metric distances:

Localized OM. Duration-sensitive OM. “Optimized costs”.

Advice: avoid non-metric distances. Limited effect of data-driven substitution costs.

does the added complexity worth it?

Alternatives are available.

25/32

slide-35
SLIDE 35

Sequence analysis Reviewing distances Simulations Conclusion References

Guidelines

Similar overall scores implies that a choice is needed. Which aspects to focus on? Family destandardisation:

Pattern change (rise of unmarried cohabitation). Changes in age norms (age at marriage) Changes in spacing (marriage–first child).

Definition of the research question.

26/32

slide-36
SLIDE 36

Sequence analysis Reviewing distances Simulations Conclusion References

Guidelines

Timing:

Hamming distances.

Duration:

Optimal matching. Optimal matching of spells. Distribution-based distances.

Sequencing (depending of sensitivity to small perturbation).

SVRspell (Very sensitive). Optimal matching of spells (in between). Optimal matching of transitions (less sensitive).

Intermediary position:

SVRspell. Optimal matching of transitions. Optimal matching of spells.

27/32

slide-37
SLIDE 37

Sequence analysis Reviewing distances Simulations Conclusion References

Other uses

By using one distance measure sensitive to each aspect.

Distinction stemming from each aspect. Structuration of the data according to each aspect.

In practice, aspects may be correlated.

28/32

slide-38
SLIDE 38

Sequence analysis Reviewing distances Simulations Conclusion References

Contributions

Review of sequence dissimilarities. Guidelines. Methodology to evaluate sequence dissimilarities. New contribution:

Two new distances measures (OMspell and OM of transition) Mostly sensitive to sequencing. New strategies to set costs.

All distances measures will be included in TraMineR software. Currently in the development R package “seqdist2”.

29/32

slide-39
SLIDE 39

Sequence analysis Reviewing distances Simulations Conclusion References

Studer, M. and G. Ritschard (2015). What matters in differences between life trajectories: a comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society). DOI: 10.1111/rssa.12125.

30/32

slide-40
SLIDE 40

Sequence analysis Reviewing distances Simulations Conclusion References

References I

Aisenbrey, S. and A. E. Fasang (2010). New life for old ideas: The “second wave” of sequence analysis bringing the “course” back into the life course. Sociological Methods and Research 38(3), 430–462. Billari, F. C., J. Fürnkranz, and A. Prskawetz (2006). Timing, sequencing, and quantum of life course events: A machine learning approach. European Journal of Population 22(1), 37–65. Elzinga, C. H. (2003). Sequence similarity: A non-aligning

  • technique. Sociological Methods and Research 31, 214–231.

Levine, J. (2000). But what have you done for us lately. Sociological Methods & Research 29 (1), pp. 35–40. English

31/32

slide-41
SLIDE 41

Sequence analysis Reviewing distances Simulations Conclusion References

References II

Settersten, Richard A., J. and K. U. Mayer (1997). The measurement of age, age structuring, and the life course. Annual Review of Sociology 23, 233–261. Studer, M. (2012). Étude des inégalités de genre en début de carrière académique à l’aide de méthodes innovatrices d’analyse de données séquentielles. Thèse de doctorat no 777, Faculté des sciences économiquese et sociales, Université de Genève. Wu, L. L. (2000). Some comments on ‘Sequence analysis and

  • ptimal matching methods in sociology: Review and prospect’.

Sociological Methods Research 29(1), 41–64.

32/32