Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard - - PowerPoint PPT Presentation

sequential data analysis with traminer part 2
SMART_READER_LITE
LIVE PREVIEW

Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard - - PowerPoint PPT Presentation

Sequential data analysis - 2 Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard Department of Econometrics and Laboratory of Demography University of Geneva http://mephisto.unige.ch/biomining APA-ATI Workshop on Exploratory Data


slide-1
SLIDE 1

Sequential data analysis - 2

Sequential data analysis with TraMineR, Part 2

Gilbert Ritschard

Department of Econometrics and Laboratory of Demography University of Geneva http://mephisto.unige.ch/biomining

APA-ATI Workshop on Exploratory Data Mining University of Southern California, Los Angeles, CA, July 2009

8/7/2009gr 1/100

slide-2
SLIDE 2

Sequential data analysis - 2

Outline

1

Dissimilarities among pairs of state sequences

2

Mining event sequences

3

Conclusion: Sequence of analyses

8/7/2009gr 2/100

slide-3
SLIDE 3

Sequential data analysis - 2 Dissimilarities among pairs of state sequences

Outline

1

Dissimilarities among pairs of state sequences

2

Mining event sequences

3

Conclusion: Sequence of analyses

8/7/2009gr 3/100

slide-4
SLIDE 4

Sequential data analysis - 2 Dissimilarities among pairs of state sequences

(Recall) Creating the state sequence object

R> library(TraMineR) R> data(mvad) R> names(mvad) [1] "id" "weight" "male" "catholic" "Belfast" "N.Eastern" [7] "Southern" "S.Eastern" "Western" "Grammar" "funemp" "gcse5eq" [13] "fmpr" "livboth" "Jul.93" "Aug.93" "Sep.93" "Oct.93" [19] "Nov.93" "Dec.93" "Jan.94" "Feb.94" "Mar.94" "Apr.94" [25] "May.94" "Jun.94" "Jul.94" "Aug.94" "Sep.94" "Oct.94" [31] "Nov.94" "Dec.94" "Jan.95" "Feb.95" "Mar.95" "Apr.95" [37] "May.95" "Jun.95" "Jul.95" "Aug.95" "Sep.95" "Oct.95" [43] "Nov.95" "Dec.95" "Jan.96" "Feb.96" "Mar.96" "Apr.96" [49] "May.96" "Jun.96" "Jul.96" "Aug.96" "Sep.96" "Oct.96" [55] "Nov.96" "Dec.96" "Jan.97" "Feb.97" "Mar.97" "Apr.97" [61] "May.97" "Jun.97" "Jul.97" "Aug.97" "Sep.97" "Oct.97" [67] "Nov.97" "Dec.97" "Jan.98" "Feb.98" "Mar.98" "Apr.98" [73] "May.98" "Jun.98" "Jul.98" "Aug.98" "Sep.98" "Oct.98" [79] "Nov.98" "Dec.98" "Jan.99" "Feb.99" "Mar.99" "Apr.99" [85] "May.99" "Jun.99" R> mvad.lab <- seqstatl(mvad[, 17:86]) R> mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR") R> mvad.seq <- seqdef(mvad, 17:86, states = mvad.shortlab, labels = mvad.lab)

8/7/2009gr 4/100

slide-5
SLIDE 5

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Section outline

1

Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP LCS Optimal matching

Clustering and MDS

Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS)

Sequence dispersion Analysis of sequence discrepancy

8/7/2009gr 5/100

slide-6
SLIDE 6

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Dissimilarities between pairs of sequences

Distance between sequences

Different metrics (LCP, LCS, OM)

Once we have 2 by 2 dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Clustering a set of sequences MDS scatterplot representation of sequences Heterogeneity analysis of a set of sequences (ANOH) Dissimilarity analysis (Induction trees)

8/7/2009gr 6/100

slide-7
SLIDE 7

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Dissimilarities between pairs of sequences

Distance between sequences

Different metrics (LCP, LCS, OM)

Once we have 2 by 2 dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Clustering a set of sequences MDS scatterplot representation of sequences Heterogeneity analysis of a set of sequences (ANOH) Dissimilarity analysis (Induction trees)

8/7/2009gr 6/100

slide-8
SLIDE 8

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Dissimilarities between pairs of sequences

Distance between sequences

Different metrics (LCP, LCS, OM)

Once we have 2 by 2 dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Clustering a set of sequences MDS scatterplot representation of sequences Heterogeneity analysis of a set of sequences (ANOH) Dissimilarity analysis (Induction trees)

8/7/2009gr 6/100

slide-9
SLIDE 9

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Dissimilarities between pairs of sequences

Distance between sequences

Different metrics (LCP, LCS, OM)

Once we have 2 by 2 dissimilarities, we can

Determine a central sequence (centro-type) Measure the discrepancy between sequences Clustering a set of sequences MDS scatterplot representation of sequences Heterogeneity analysis of a set of sequences (ANOH) Dissimilarity analysis (Induction trees)

8/7/2009gr 6/100

slide-10
SLIDE 10

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Dissimilarity measures provided by TraMineR

Three measures available:

1

Longest Common Prefix (LCP)

2

Longest Common Subsequence (LCS)

3

Optimal Matching (OM)

8/7/2009gr 7/100

slide-11
SLIDE 11

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP : Longest Common Prefix

LCP: longest common prefix between two sequences. LLCP: Length of LCP

R> mvad.seq[2, ] Sequence 2 FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE- R> mvad.seq[5, ] Sequence 5 FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE-FE- R> seqLLCP(mvad.seq[2, ], mvad.seq[5, ]) [1] 25

The LLCP between the two sequences is 25.

8/7/2009gr 8/100

slide-12
SLIDE 12

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP Distance

LLCP is a measure of proximity. Following Elzinga (2008) we transform it into a distance

distance: dP(x, y) = |x| + |y| − 2Ap(x, y) normalized distance: DP(x, y) = 1 − Ap(x,y) √

|x|·|y|

Where Ap(x, y) is the LLCP between x et y, and | x | et | y | are the lengths of x and y.

8/7/2009gr 9/100

slide-13
SLIDE 13

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP Distance

LLCP is a measure of proximity. Following Elzinga (2008) we transform it into a distance

distance: dP(x, y) = |x| + |y| − 2Ap(x, y) normalized distance: DP(x, y) = 1 − Ap(x,y) √

|x|·|y|

Where Ap(x, y) is the LLCP between x et y, and | x | et | y | are the lengths of x and y.

8/7/2009gr 9/100

slide-14
SLIDE 14

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP Distance

LLCP is a measure of proximity. Following Elzinga (2008) we transform it into a distance

distance: dP(x, y) = |x| + |y| − 2Ap(x, y) normalized distance: DP(x, y) = 1 − Ap(x,y) √

|x|·|y|

Where Ap(x, y) is the LLCP between x et y, and | x | et | y | are the lengths of x and y.

8/7/2009gr 9/100

slide-15
SLIDE 15

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP Distance

LLCP is a measure of proximity. Following Elzinga (2008) we transform it into a distance

distance: dP(x, y) = |x| + |y| − 2Ap(x, y) normalized distance: DP(x, y) = 1 − Ap(x,y) √

|x|·|y|

Where Ap(x, y) is the LLCP between x et y, and | x | et | y | are the lengths of x and y.

8/7/2009gr 9/100

slide-16
SLIDE 16

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP in TraMineR

The seqdist() TraMineR function computes the matrix of the distances between all pairs of sequences. The method=... option should be used to select the distance measure. With option norm=TRUE we get the normalized form.

8/7/2009gr 10/100

slide-17
SLIDE 17

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP in TraMineR

The seqdist() TraMineR function computes the matrix of the distances between all pairs of sequences. The method=... option should be used to select the distance measure. With option norm=TRUE we get the normalized form.

8/7/2009gr 10/100

slide-18
SLIDE 18

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP in TraMineR

The seqdist() TraMineR function computes the matrix of the distances between all pairs of sequences. The method=... option should be used to select the distance measure. With option norm=TRUE we get the normalized form.

8/7/2009gr 10/100

slide-19
SLIDE 19

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Example on the 6 first mvad sequences

Non-normalized LCP Distance LCP

R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = FALSE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 140 140 140 140 140 [2,] 140 140 140 90 140 [3,] 140 140 92 140 140 [4,] 140 140 92 140 140 [5,] 140 90 140 140 140 [6,] 140 140 140 140 140

8/7/2009gr 11/100

slide-20
SLIDE 20

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Example on the 6 first mvad sequences

Non-normalized LCP Distance LCP

R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = TRUE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 1.0000000 1.0000000 1.0000000 1.0000000 1 [2,] 1 0.0000000 1.0000000 1.0000000 0.6428571 1 [3,] 1 1.0000000 0.0000000 0.6571429 1.0000000 1 [4,] 1 1.0000000 0.6571429 0.0000000 1.0000000 1 [5,] 1 0.6428571 1.0000000 1.0000000 0.0000000 1 [6,] 1 1.0000000 1.0000000 1.0000000 1.0000000

8/7/2009gr 12/100

slide-21
SLIDE 21

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCS: Longest Common Subsequences

LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example :

x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4

LLCS = 5 Distance measure: dLCS(x, y) = Aℓ(x, x) + Aℓ(y, y) − 2Aℓ(x, y) Normalized form: DLCS(x, y) = Aℓ(x,y) √

|x|·|y|

8/7/2009gr 13/100

slide-22
SLIDE 22

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCS: Longest Common Subsequences

LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example :

x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4

LLCS = 5 Distance measure: dLCS(x, y) = Aℓ(x, x) + Aℓ(y, y) − 2Aℓ(x, y) Normalized form: DLCS(x, y) = Aℓ(x,y) √

|x|·|y|

8/7/2009gr 13/100

slide-23
SLIDE 23

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCS: Longest Common Subsequences

LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example :

x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4

LLCS = 5 Distance measure: dLCS(x, y) = Aℓ(x, x) + Aℓ(y, y) − 2Aℓ(x, y) Normalized form: DLCS(x, y) = Aℓ(x,y) √

|x|·|y|

8/7/2009gr 13/100

slide-24
SLIDE 24

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCS: Longest Common Subsequences

LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example :

x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4

LLCS = 5 Distance measure: dLCS(x, y) = Aℓ(x, x) + Aℓ(y, y) − 2Aℓ(x, y) Normalized form: DLCS(x, y) = Aℓ(x,y) √

|x|·|y|

8/7/2009gr 13/100

slide-25
SLIDE 25

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LLCS: example

R> x <- c(1, 1, 1, 2, 2, 3, 3) R> y <- c(1, 1, 1, 4, 3, 3, 4) R> seqdist(seqdef(rbind(x, y)), method = "LCS") [,1] [,2] [1,] 4 [2,] 4 R> seqdist(seqdef(rbind(x, y)), method = "LCS", norm = TRUE) [,1] [,2] [1,] 0.0000000 0.2857143 [2,] 0.2857143 0.0000000

8/7/2009gr 14/100

slide-26
SLIDE 26

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (optimal alignment)

Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986)

8/7/2009gr 15/100

slide-27
SLIDE 27

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (optimal alignment)

Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986)

8/7/2009gr 15/100

slide-28
SLIDE 28

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (optimal alignment)

Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986)

8/7/2009gr 15/100

slide-29
SLIDE 29

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (optimal alignment)

Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986)

8/7/2009gr 15/100

slide-30
SLIDE 30

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (OM): principle

Want to transform one sequence into the other one. Using two types of operations

Insertion or deletion of an element Substitution of an element

Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other.

8/7/2009gr 16/100

slide-31
SLIDE 31

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (OM): principle

Want to transform one sequence into the other one. Using two types of operations

Insertion or deletion of an element Substitution of an element

Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other.

8/7/2009gr 16/100

slide-32
SLIDE 32

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Optimal matching (OM): principle

Want to transform one sequence into the other one. Using two types of operations

Insertion or deletion of an element Substitution of an element

Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other.

8/7/2009gr 16/100

slide-33
SLIDE 33

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

OM : example

Consider the two sequences : 1 1 1 2 2 2 3 2 1 1 2 2 3 3 Insertion of element ‘2’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 The two sequences are now identical.

8/7/2009gr 17/100

slide-34
SLIDE 34

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

OM : example

Consider the two sequences : 1 1 1 2 2 2 3 2 1 1 2 2 3 3 Insertion of element ‘2’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 The two sequences are now identical.

8/7/2009gr 17/100

slide-35
SLIDE 35

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

OM : example

Consider the two sequences : 1 1 1 2 2 2 3 2 1 1 2 2 3 3 Insertion of element ‘2’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 The two sequences are now identical.

8/7/2009gr 17/100

slide-36
SLIDE 36

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

OM : example

Consider the two sequences : 1 1 1 2 2 2 3 2 1 1 2 2 3 3 Insertion of element ‘2’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3 The two sequences are now identical.

8/7/2009gr 17/100

slide-37
SLIDE 37

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

OM: substitution example

Consider the 2 sequences 1 1 1 2 2 2 3 2 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3

8/7/2009gr 18/100

slide-38
SLIDE 38

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

OM: substitution example

Consider the 2 sequences 1 1 1 2 2 2 3 2 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 1 1 2 2 2 3 2 1 1 2 2 2 3

8/7/2009gr 18/100

slide-39
SLIDE 39

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Assigning indel and substitution costs

Same cost for each ‘insert’ or ‘deletion’.

indel cost is a single constant.

Substitution costs:

Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost ci,j = cj,i

8/7/2009gr 19/100

slide-40
SLIDE 40

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Assigning indel and substitution costs

Same cost for each ‘insert’ or ‘deletion’.

indel cost is a single constant.

Substitution costs:

Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost ci,j = cj,i

8/7/2009gr 19/100

slide-41
SLIDE 41

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Defining substitution costs

Unique cost cij = c (should provide c) Based on transition rates (no additional input required)

ci,j = cj,i = 2 − p(it | jt−1) − p(jt | it−1)

Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE

software)

8/7/2009gr 20/100

slide-42
SLIDE 42

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Defining substitution costs

Unique cost cij = c (should provide c) Based on transition rates (no additional input required)

ci,j = cj,i = 2 − p(it | jt−1) − p(jt | it−1)

Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE

software)

8/7/2009gr 20/100

slide-43
SLIDE 43

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Defining substitution costs

Unique cost cij = c (should provide c) Based on transition rates (no additional input required)

ci,j = cj,i = 2 − p(it | jt−1) − p(jt | it−1)

Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE

software)

8/7/2009gr 20/100

slide-44
SLIDE 44

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Using Optimal Matching in TraMineR

Create the state sequence object with seqdef() Get a substitution cost matrix

  • r compute one with seqsubm()

Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...)

8/7/2009gr 21/100

slide-45
SLIDE 45

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Using Optimal Matching in TraMineR

Create the state sequence object with seqdef() Get a substitution cost matrix

  • r compute one with seqsubm()

Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...)

8/7/2009gr 21/100

slide-46
SLIDE 46

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Using Optimal Matching in TraMineR

Create the state sequence object with seqdef() Get a substitution cost matrix

  • r compute one with seqsubm()

Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...)

8/7/2009gr 21/100

slide-47
SLIDE 47

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Cost Matrix: Unique Costs

R> subm.unique <- seqsubm(mvad.seq, method = "CONSTANT", cval = 2) R> subm.unique EM-> FE-> HE-> JL-> SC-> TR-> EM-> 2 2 2 2 2 FE-> 2 2 2 2 2 HE-> 2 2 2 2 2 JL-> 2 2 2 2 2 SC-> 2 2 2 2 2 TR-> 2 2 2 2 2

8/7/2009gr 22/100

slide-48
SLIDE 48

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Cost Matrix: Custom Costs

R> subm.custom <- matrix(c(0, 1, 1, 2, 1, 1, 1, 0, 1, 2, + 1, 2, 1, 1, 0, 3, 1, 2, 2, 2, 3, 0, 3, 1, 1, 1, 1, + 3, 0, 2, 1, 2, 2, 1, 2, 0), nrow = 6, ncol = 6, byrow = TRUE, + dimnames = list(mvad.shortlab, mvad.shortlab)) R> subm.custom EM FE HE JL SC TR EM 1 1 2 1 1 FE 1 1 2 1 2 HE 1 1 3 1 2 JL 2 2 3 3 1 SC 1 1 1 3 2 TR 1 2 2 1 2

8/7/2009gr 23/100

slide-49
SLIDE 49

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Cost Matrix: Based on Transition Rates

R> subm.txrate <- seqsubm(mvad.seq, method = "TRATE") R> subm.txrate EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0.00000 1.97008 1.98723 1.95173 1.98536 1.95950 FE-> 1.97008 0.00000 1.99318 1.98266 1.99092 1.99235 HE-> 1.98723 1.99318 0.00000 1.99584 1.98184 1.99949 JL-> 1.95173 1.98266 1.99584 0.00000 1.99385 1.97808 SC-> 1.98536 1.99092 1.98184 1.99385 0.00000 1.99666 TR-> 1.95950 1.99235 1.99949 1.97808 1.99666 0.00000

8/7/2009gr 24/100

slide-50
SLIDE 50

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

Computing the distances

Using the substitution cost matrix, we compute distances

R> mvad.dist <- seqdist(mvad.seq, method = "OM", indel = 4, + sm = subm.custom, norm = TRUE) R> round(mvad.dist[1:10, 1:10], digits = 2) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [1] 0.00 1.03 0.86 0.90 1.03 0.47 0.46 0.34 0.27 0.57 [2] 1.03 0.00 1.23 1.93 0.16 1.49 0.57 0.69 1.30 1.37 [3] 0.86 1.23 0.00 1.01 1.39 0.70 1.14 1.20 0.59 1.26 [4] 0.90 1.93 1.01 0.00 1.93 0.46 1.36 1.24 0.63 0.90 [5] 1.03 0.16 1.39 1.93 0.00 1.49 0.64 0.69 1.30 1.37 [6] 0.47 1.49 0.70 0.46 1.49 0.00 0.91 0.80 0.20 0.99 [7] 0.46 0.57 1.14 1.36 0.64 0.91 0.00 0.11 0.73 0.80 [8] 0.34 0.69 1.20 1.24 0.69 0.80 0.11 0.00 0.61 0.69 [9] 0.27 1.30 0.59 0.63 1.30 0.20 0.73 0.61 0.00 0.79 [10] 0.57 1.37 1.26 0.90 1.37 0.99 0.80 0.69 0.79 0.00

8/7/2009gr 25/100

slide-51
SLIDE 51

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Section outline

1

Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP LCS Optimal matching

Clustering and MDS

Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS)

Sequence dispersion Analysis of sequence discrepancy

8/7/2009gr 26/100

slide-52
SLIDE 52

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Cluster analysis

Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library

agnes(): agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana(): divisive analysis. pam(): partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori).

8/7/2009gr 27/100

slide-53
SLIDE 53

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Cluster analysis

Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library

agnes(): agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana(): divisive analysis. pam(): partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori).

8/7/2009gr 27/100

slide-54
SLIDE 54

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Hierarchical clustering (Ward)

R> library(cluster) R> mvad.clusterward <- agnes(mvad.dist, diss = T, method = "ward") R> plot(mvad.clusterward, ask = F, which.plots = 2)

[1] [26] [68] [76] [116] [120] [150] [155] [169] [179] [193] [201] [202] [213] [237] [263] [280] [281] [292] [306] [310] [314] [345] [348] [373] [392] [394] [399] [404] [413] [421] [424] [427] [428] [432] [481] [496] [512] [525] [527] [545] [558] [567] [586] [598] [617] [649] [680] [694] [695] [703] [178] [166] [289] [483] [134] [186] [384] [425] [528] [684] [82] [702] [566] [253] [638] [266] [352] [39] [400] [90] [278] [65] [183] [224] [360] [368] [388] [81] [357] [636] [372] [163] [212] [338] [560] [412] [77] [375] [159] [570] [242] [350] [571] [396] [361] [114] [305] [469] [108] [633] [250] [398] [477] [340] [559] [575] [593] [648] [634] [107] [502] [701] [553] [46] [407] [123] [164] [416] [479] [518] [151] [149] [328] [402] [344] [56] [73] [119] [550] [532] [563] [574] [240] [515] [12] [591] [79] [547] [635] [643] [507] [162] [80] [248] [437] [490] [600] [690] [655] [449] [318] [100] [657] [681] [249] [661] [707] [7] [291] [293] [287] [509] [596] [74] [595] [117] [146] [167] [172] [603] [619] [678] [691] [700] [125] [488] [364] [177] [61] [331] [497] [192] [98] [298] [168] [408] [109] [441] [200] [517] [662] [284] [308] [555] [472] [8] [64] [176] [180] [211] [214] [217] [346] [353] [382] [468] [478] [506] [523] [582] [597] [683] [327] [255] [302] [605] [313] [543] [30] [199] [659] [312] [447] [602] [624] [708] [467] [585] [157] [430] [530] [277] [438] [145] [189] [465] [54] [244] [436] [70] [243] [197] [247] [446] [304] [534] [653] [330] [406] [152] [154] [363] [111] [513] [494] [522] [124] [625] [271] [3] [267] [611] [55] [264] [9] [118] [127] [276] [362] [66] [87] [126] [139] [184] [205] [251] [252] [272] [326] [355] [371] [482] [519] [579] [606] [628] [698] [78] [141] [387] [711] [59] [626] [629] [632] [667] [334] [351] [704] [426] [580] [616] [18] [29] [92] [637] [23] [121] [135] [374] [397] [409] [22] [303] [60] [386] [96] [322] [420] [439] [696] [85] [105] [343] [673] [457] [106] [299] [122] [128] [419] [443] [672] [140] [599] [321] [401] [147] [161] [223] [682] [639] [160] [110] [546] [95] [395] [568] [699] [642] [6] [195] [319] [435] [471] [589] [354] [93] [493] [675] [131] [288] [58] [225] [174] [393] [132] [136] [442] [296] [476] [187] [536] [511] [97] [630] [356] [564] [268] [526] [389] [309] [185] [190] [524] [486] [377] [231] [671] [423] [4] [86] [101] [644] [697] [226] [21] [473] [69] [84] [540] [191] [499] [156] [265] [548] [712] [165] [535] [241] [290] [520] [38] [631] [41] [91] [440] [652] [508] [42] [501] [204] [315] [19] [103] [148] [539] [210] [71] [88] [664] [325] [588] [10] [153] [171] [463] [62] [14] [349] [336] [16] [679] [24] [562] [414] [219] [670] [102] [232] [307] [647] [196] [317] [640] [28] [381] [705] [270] [455] [89] [514] [188] [229] [342] [668] [221] [665] [15] [20] [227] [40] [94] [262] [510] [641] [584] [138] [627] [366] [104] [113] [529] [254] [347] [709] [537] [405] [99] [429] [403] [660] [495] [620] [663] [674] [669] [594] [666] [689] [687] [618] [2] [83] [458] [269] [581] [335] [434] [710] [115] [445] [324] [533] [480] [448] [5] [129] [175] [294] [459] [491] [561] [622] [130] [230] [503] [531] [541] [556] [601] [198] [385] [369] [112] [220] [466] [391] [379] [216] [651] [222] [376] [538] [516] [233] [489] [554] [142] [158] [572] [246] [215] [484] [557] [286] [339] [492] [645] [245] [311] [239] [285] [462] [301] [11] [577] [418] [576] [170] [261] [370] [380] [383] [433] [542] [487] [676] [173] [464] [569] [182] [218] [297] [300] [337] [470] [500] [549] [275] [295] [431] [341] [378] [590] [444] [475] [573] [17] [320] [551] [415] [578] [203] [650] [706] [688] [329] [43] [504] [45] [677] [206] [474] [52] [460] [181] [209] [13] [692] [608] [27] [235] [34] [53] [32] [49] [57] [228] [238] [258] [279] [359] [417] [422] [461] [505] [607] [259] [35] [48] [50] [51] [133] [234] [256] [283] [332] [358] [454] [587] [604] [609] [612] [656] [685] [36] [498] [207] [411] [25] [63] [257] [67] [282] [333] [451] [614] [615] [621] [452] [544] [583] [44] [456] [410] [610] [613] [646] [208] [143] [623] [323] [450] [31] [37] [693] [273] [365] [316] [485] [236] [33] [390] [453] [552] [565] [194] [367] [521] [654] [144] [592] [47] [686] [75] [274] [72] [260] [658] [137] 5 10 15

Dendrogram of agnes(x = mvad.dist, diss = T, method = "ward")

Agglomerative Coefficient = 0.99 mvad.dist Height

8/7/2009gr 28/100

slide-55
SLIDE 55

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Warning!!!

Do not forget to specify the diss = T option. Otherwise (i.e. by default) functions agnes(), diana(), pam(), ... first compute the Euclidean distance matrix between rows of the dissimilarity matrix.

8/7/2009gr 29/100

slide-56
SLIDE 56

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Retrieving cluster membership

Select the number of clusters, cut tree at chosen level, and store cluster membership into a vector.

R> mvad.cl3 <- cutree(mvad.clusterward, k = 3) R> mvad.cl3[1:10] [1] 1 2 1 1 2 1 1 1 1 3 R> clust.labels <- c("Employment", "Education", "Jobless") R> mvad.cl3.factor <- factor(mvad.cl3, levels = c(1, 2, + 3), labels = clust.labels)

8/7/2009gr 30/100

slide-57
SLIDE 57

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Exploring clusters graphically

Three types of graphics

1

Transversal distribution with seqdplot()

2

Frequency plots with seqfplot()

3

Individual index-plots seqiplot()

Required argument: state sequence object. Use group = cluster.membership.factor to get plots by cluster.

8/7/2009gr 31/100

slide-58
SLIDE 58

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Transversal Distributions

R> seqdplot(mvad.seq, group = mvad.cl3.factor)

8/7/2009gr 32/100

slide-59
SLIDE 59

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Most frequent sequences

R> seqfplot(mvad.seq, group = mvad.cl3.factor)

8/7/2009gr 33/100

slide-60
SLIDE 60

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Individual sequences

R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0)

8/7/2009gr 34/100

slide-61
SLIDE 61

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Sorting sequences for i-plot display

Previous i-plots become clearer if we sort sequences. Several possibilities: According to

distance to most frequent sequence; distance to centro-type or any other useful reference. scores on first factor of a MDS analysis;

8/7/2009gr 35/100

slide-62
SLIDE 62

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Computing distance to most frequent sequence

Compute, in each cluster, distances to most frequent sequence (refseq = 0). Using here the custom substitution cost matrix.

R> mvad.distom <- numeric(nrow(mvad)) R> mvad.distom[mvad.cl3 == 1] <- seqdist(mvad.seq[mvad.cl3 == + 1, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 2] <- seqdist(mvad.seq[mvad.cl3 == + 2, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 3] <- seqdist(mvad.seq[mvad.cl3 == + 3, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom)

8/7/2009gr 36/100

slide-63
SLIDE 63

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Sort: Distance to most frequent sequence

R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mvad.distom)

8/7/2009gr 37/100

slide-64
SLIDE 64

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Sort: First factor of MDS analysis

R> mds1d <- cmdscale(mvad.dist, k = 1) R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mds1d)

8/7/2009gr 38/100

slide-65
SLIDE 65

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Scatterplot (MDS)

Through Multidimensional Scaling (MDS), we get a scatter plot of sequences

R> mds2d <- cmdscale(mvad.dist, k = 2) R> plot(mds2d, type = "n") R> points(mds2d[mvad.cl3 == 1, ], pch = 16, col = "red") R> points(mds2d[mvad.cl3 == 2, ], pch = 16, col = "blue") R> points(mds2d[mvad.cl3 == 3, ], pch = 16, col = "green") R> legend("bottomright", fill = c("red", "blue", "green"), + legend = clust.labels)

8/7/2009gr 39/100

slide-66
SLIDE 66

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Sequence scatterplot colored by cluster

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 mds2d[,1] mds2d[,2]

  • Employment

Education Jobless

8/7/2009gr 40/100

slide-67
SLIDE 67

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Code for scatterplot colored by sex

R> plot(mds2d, type = "n") R> points(mds2d[mvad$male == "yes", ], pch = 16, col = "red") R> points(mds2d[mvad$male == "no", ], pch = 23, col = "blue") R> legend("bottomright", col = c("red", "blue"), pch = c(16, + 23), legend = c("Men", "Women"))

8/7/2009gr 41/100

slide-68
SLIDE 68

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS

Sequence scatterplot colored by sex

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 mds2d[,1] mds2d[,2]

  • Men

Women

8/7/2009gr 42/100

slide-69
SLIDE 69

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion

Section outline

1

Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP LCS Optimal matching

Clustering and MDS

Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS)

Sequence dispersion Analysis of sequence discrepancy

8/7/2009gr 43/100

slide-70
SLIDE 70

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion

Dispersion of the set of sequences

From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009).

8/7/2009gr 44/100

slide-71
SLIDE 71

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion

Dispersion of the set of sequences

From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009).

8/7/2009gr 44/100

slide-72
SLIDE 72

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion

Dispersion of the set of sequences

From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs SS =

n

  • i=1

(yi − ¯ y)2 = 1 n

n

  • i=1

n

  • j=i+1

(yi − yj)2 = 1 n

n

  • i=1

n

  • j=i+1

dij Setting dij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009).

8/7/2009gr 44/100

slide-73
SLIDE 73

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion

Compute the sequence dispersion

R> distMatLCS <- seqdist(mvad.seq, method = "LCS") R> distMatLCS[1:6, 1:7] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 140 116 108 140 64 60 [2,] 140 72 140 22 140 80 [3,] 116 72 68 90 72 60 [4,] 108 140 68 140 46 112 [5,] 140 22 90 140 140 90 [6,] 64 140 72 46 140 68 R> dissvar(distMatLCS) [1] 42.74502

8/7/2009gr 45/100

slide-74
SLIDE 74

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Section outline

1

Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences

LCP LCS Optimal matching

Clustering and MDS

Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS)

Sequence dispersion Analysis of sequence discrepancy

8/7/2009gr 46/100

slide-75
SLIDE 75

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Analysis of sequence discrepancy

ANOVA like analysis based on pairwise dissimilarities We decompose the SS (Sum of squares equivalent) SST = SSB + SSW Here, with the formula shown earlier SST = 1 n

n

  • i=1

n

  • j=i+1

dij SSW =

  • g

1 ng

ng

  • i=1

ng

  • j=i+1

dij,g

  • SSB

= SST − SSW

8/7/2009gr 47/100

slide-76
SLIDE 76

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Pseudo R-square and ANOVA Table

ANOVA table for m groups

Discrepancy df Mean Discr. F Between SSB dfB = m − 1

SSB dfB

SSB SSW

dfW dfB

Within SSW dfW =

g ng − m SSW dfW

Total SST dfT = n − 1

Pseudo R2 R2 = SSB SST

8/7/2009gr 48/100

slide-77
SLIDE 77

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Pseudo R-square and ANOVA Table

ANOVA table for m groups

Discrepancy df Mean Discr. F Between SSB dfB = m − 1

SSB dfB

SSB SSW

dfW dfB

Within SSW dfW =

g ng − m SSW dfW

Total SST dfT = n − 1

Pseudo R2 R2 = SSB SST

8/7/2009gr 48/100

slide-78
SLIDE 78

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Pseudo F

Pseudo F F = SSB/(m − 1) SSW /(n − m) Normality is not defendable in this setting. F cannot be compared with an F distribution. The significance is assesses through a permutation test Permutation test: iteratively randomly reassign each covariate profile to one of the observed sequence and recompute the F. Empirical distribution of F under independence.

8/7/2009gr 49/100

slide-79
SLIDE 79

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Analysis of sequence discrepancy

Running an ANOVA like analysis for gcse5eq

R> mvad.lcs <- seqdist(mvad.seq, method = "LCS") R> da <- dissassoc(mvad.lcs, group = mvad$gcse5eq, R = 1000)

8/7/2009gr 50/100

slide-80
SLIDE 80

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

ANOVA output

R> print(da) Pseudo ANOVA table: SS df MSE Exp 2499.945 1 2499.94539 Res 27934.510 710 39.34438 Total 30434.455 711 42.80514 Test values (p-values based on 999 permutation): PseudoF PseudoR2 PseudoF_Pval PseudoT PseudoT_Pval 63.54009 0.08214195 0 1.199912 Variance per level: n variance no 452 37.48481 yes 260 42.27453 Total 712 42.74502

8/7/2009gr 51/100

slide-81
SLIDE 81

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Distribution of pseudo F

R> hist(da, col = "cyan")

Distribution of PseudoF

PseudoF Frequency 1 2 3 4 20 40 60 80 100 120

8/7/2009gr 52/100

slide-82
SLIDE 82

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Multiple factor analysis

Generalize previous approach for multiple covariates. There are different approaches. Here, we Measure the additional contribution of each covariate v when we accounted for all other covariates. The F statistics reads Fv = (SSBc − SSBv )/p SSWc/(n − m − 1)

where the SSBc and SSWc are the explained and residual sums of squares of the full model, SSBv the explained sum of squares of the model after removing variable v, and p the number of indicators or contrasts used to encode the covariate v.

significance is assessed again through permutation tests.

8/7/2009gr 53/100

slide-83
SLIDE 83

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Running a Multiple factor analysis

R> da.mfac <- dissmfac(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 1000) R> print(da.mfac) Variable PseudoF PseudoR2 p_value 1 male 3.274802 0.003840223 0.026 2 Grammar 21.124081 0.024771330 0.000 3 funemp 4.483016 0.005257046 0.003 4 gcse5eq 75.725976 0.088800698 0.000 5 fmpr 2.715988 0.003184926 0.045 6 livboth 2.314571 0.002714201 0.078 7 Total 24.829102 0.174448528 0.000

8/7/2009gr 54/100

slide-84
SLIDE 84

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Differences over time

How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R2 for short sliding windows (length 2) We get thus a sequence of R2, which can be plotted Similarly, we can plot series of

total residual discrepancy (SSW ) residual discrepancy of each group (SSG)

8/7/2009gr 55/100

slide-85
SLIDE 85

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Differences over time

How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R2 for short sliding windows (length 2) We get thus a sequence of R2, which can be plotted Similarly, we can plot series of

total residual discrepancy (SSW ) residual discrepancy of each group (SSG)

8/7/2009gr 55/100

slide-86
SLIDE 86

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Differences over time

How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R2 for short sliding windows (length 2) We get thus a sequence of R2, which can be plotted Similarly, we can plot series of

total residual discrepancy (SSW ) residual discrepancy of each group (SSG)

8/7/2009gr 55/100

slide-87
SLIDE 87

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Differences over time

R> mvad.diff <- seqdiff(mvad.seq, group = mvad$gcse5eq) R> mvad.diff$stat[1:4, ] PseudoF PseudoR2 PseudoT Sep.93 29.09196 0.03936176 2.313692 Oct.93 29.39664 0.03975760 2.223468 Nov.93 29.76849 0.04024027 2.265784 Dec.93 30.09793 0.04066750 2.304112 R> mvad.diff$variance[1:4, ] no yes Total Sep.93 0.3688107 0.3113979 0.3620982 Oct.93 0.3691362 0.3127219 0.3629661 Nov.93 0.3704210 0.3133136 0.3642237 Dec.93 0.3725771 0.3146893 0.3663363

8/7/2009gr 56/100

slide-88
SLIDE 88

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Plotting R-squares over time

R> plot(mvad.diff)

0.04 0.06 0.08 0.10 0.12 PseudoR2 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99

8/7/2009gr 57/100

slide-89
SLIDE 89

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Plotting residual discrepancy over time

R> plot(mvad.diff, stat = "Variance")

0.20 0.25 0.30 0.35 Variance no yes Total Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99

8/7/2009gr 58/100

slide-90
SLIDE 90

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Tree structured discrepancy analysis

Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R2. Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant.

8/7/2009gr 59/100

slide-91
SLIDE 91

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Tree structured discrepancy analysis

Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R2. Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant.

8/7/2009gr 59/100

slide-92
SLIDE 92

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Growing the tree

R> dt <- disstree(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 5000) R> print(dt) Dissimilarity tree Global R2: 0.113 |-- Root [ 712 ] var: 42.7 |-> gcse5eq R2: 0.0821 |-- no [ 452 ] var: 37.5 |-> funemp R2: 0.0107 |-- no [ 362 ] var: 35.9 |-> male R2: 0.0123 |-- no [ 146 ] var: 38.7 |-- yes [ 216 ] var: 33.3 |-- yes [ 90 ] var: 41.8 |-- yes [ 260 ] var: 42.3 |-> Grammar R2: 0.0534 |-- no [ 183 ] var: 42.2 |-- yes [ 77 ] var: 34.9

8/7/2009gr 60/100

slide-93
SLIDE 93

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Creating a Graphviz plot of the tree

Using simplified interface to generate a file for GraphViz

R> seqtree2dot(dt, "fg_mvadseqtree", seqdata = mvad.seq, type = "d", + border = NA, withlegend = FALSE, axes = FALSE, ylab = "", + yaxis = FALSE)

8/7/2009gr 61/100

slide-94
SLIDE 94

Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy

Graphical Tree

8/7/2009gr 62/100

slide-95
SLIDE 95

Sequential data analysis - 2 Mining event sequences

Outline

1

Dissimilarities among pairs of state sequences

2

Mining event sequences

3

Conclusion: Sequence of analyses

8/7/2009gr 63/100

slide-96
SLIDE 96

Sequential data analysis - 2 Mining event sequences Event sequences

Section outline

2

Mining event sequences Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints

8/7/2009gr 64/100

slide-97
SLIDE 97

Sequential data analysis - 2 Mining event sequences Event sequences

Analysis of event sequences

Objective

Focus on events, rather than states. Interest in the patterns of events.

Pattern of event: events that occur systematically together and in same order

Are there typical“patterns”of events? Relationship with covariates

Which patterns best discriminate specific groups? Typical differences in event sequences between men and women.

Events patterns vs typical state sequencing. Association rules between event subsequences:

Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon.

8/7/2009gr 65/100

slide-98
SLIDE 98

Sequential data analysis - 2 Mining event sequences Event sequences

Analysis of event sequences

Objective

Focus on events, rather than states. Interest in the patterns of events.

Pattern of event: events that occur systematically together and in same order

Are there typical“patterns”of events? Relationship with covariates

Which patterns best discriminate specific groups? Typical differences in event sequences between men and women.

Events patterns vs typical state sequencing. Association rules between event subsequences:

Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon.

8/7/2009gr 65/100

slide-99
SLIDE 99

Sequential data analysis - 2 Mining event sequences Event sequences

Analysis of event sequences

Objective

Focus on events, rather than states. Interest in the patterns of events.

Pattern of event: events that occur systematically together and in same order

Are there typical“patterns”of events? Relationship with covariates

Which patterns best discriminate specific groups? Typical differences in event sequences between men and women.

Events patterns vs typical state sequencing. Association rules between event subsequences:

Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon.

8/7/2009gr 65/100

slide-100
SLIDE 100

Sequential data analysis - 2 Mining event sequences Event sequences

Analysis of event sequences

Objective

Focus on events, rather than states. Interest in the patterns of events.

Pattern of event: events that occur systematically together and in same order

Are there typical“patterns”of events? Relationship with covariates

Which patterns best discriminate specific groups? Typical differences in event sequences between men and women.

Events patterns vs typical state sequencing. Association rules between event subsequences:

Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon.

8/7/2009gr 65/100

slide-101
SLIDE 101

Sequential data analysis - 2 Mining event sequences Event sequences

Analysis of event sequences

Objective

Focus on events, rather than states. Interest in the patterns of events.

Pattern of event: events that occur systematically together and in same order

Are there typical“patterns”of events? Relationship with covariates

Which patterns best discriminate specific groups? Typical differences in event sequences between men and women.

Events patterns vs typical state sequencing. Association rules between event subsequences:

Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon.

8/7/2009gr 65/100

slide-102
SLIDE 102

Sequential data analysis - 2 Mining event sequences Event sequences

Events and transitions

Event sequence: time ordered transitions. Transition: set of non-ordered events. Example (LHome, Union) → (Marriage) → (Childbirth) (LHome, Union) and (Marriage) are transitions. “LHome” ,“Union”and“Marriage”are events.

8/7/2009gr 66/100

slide-103
SLIDE 103

Sequential data analysis - 2 Mining event sequences Event sequences

Events and transitions

Event sequence: time ordered transitions. Transition: set of non-ordered events. Example (LHome, Union) → (Marriage) → (Childbirth) (LHome, Union) and (Marriage) are transitions. “LHome” ,“Union”and“Marriage”are events.

8/7/2009gr 66/100

slide-104
SLIDE 104

Sequential data analysis - 2 Mining event sequences Event sequences

subsequence

A subsequence B of a sequence A is an event sequence such that

each event of B is an event of A. the events occur in B in the same (weak-)order as in A.

Example A (LHome, Union) → (Marriage) → (Chilbirth). B (LHome, Marriage) → (Chilbirth). C (LHome) → (Chilbirth). C is a subsequence of A and B, since order of events is respected. B is not a subsequence of A, since“Marriage”precedes “Childbirth” .

8/7/2009gr 67/100

slide-105
SLIDE 105

Sequential data analysis - 2 Mining event sequences Event sequences

subsequence

A subsequence B of a sequence A is an event sequence such that

each event of B is an event of A. the events occur in B in the same (weak-)order as in A.

Example A (LHome, Union) → (Marriage) → (Chilbirth). B (LHome, Marriage) → (Chilbirth). C (LHome) → (Chilbirth). C is a subsequence of A and B, since order of events is respected. B is not a subsequence of A, since“Marriage”precedes “Childbirth” .

8/7/2009gr 67/100

slide-106
SLIDE 106

Sequential data analysis - 2 Mining event sequences Event sequences

Frequent and discriminant subsequences

Support of a subsequence: number of sequences that contain the subsequence.

A frequent subsequence is a sequence with support greater than a minimal support. A subsequence is discriminant between groups if its support varies significantly across groups.

8/7/2009gr 68/100

slide-107
SLIDE 107

Sequential data analysis - 2 Mining event sequences Event sequences

Frequent and discriminant subsequences

Support of a subsequence: number of sequences that contain the subsequence.

A frequent subsequence is a sequence with support greater than a minimal support. A subsequence is discriminant between groups if its support varies significantly across groups.

8/7/2009gr 68/100

slide-108
SLIDE 108

Sequential data analysis - 2 Mining event sequences Event sequences

Frequent and discriminant subsequences

Support of a subsequence: number of sequences that contain the subsequence.

A frequent subsequence is a sequence with support greater than a minimal support. A subsequence is discriminant between groups if its support varies significantly across groups.

8/7/2009gr 68/100

slide-109
SLIDE 109

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Section outline

2

Mining event sequences Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints

8/7/2009gr 69/100

slide-110
SLIDE 110

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Data Format

For doing event sequence analysis in TraMineR, we need an event sequence object We create it with seqecreate() to which we provide data in either of the following form:

Time Stamped Event (TSE), which permits to directly specify the events. A state sequence object together with a choice for automatic conversion

transition A distinct event for the transition between each pair of states. state An distinct event for the start of a spell in a given state. period An event for the start and an other for the end of a spell in a given state.

8/7/2009gr 70/100

slide-111
SLIDE 111

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Data Format

For doing event sequence analysis in TraMineR, we need an event sequence object We create it with seqecreate() to which we provide data in either of the following form:

Time Stamped Event (TSE), which permits to directly specify the events. A state sequence object together with a choice for automatic conversion

transition A distinct event for the transition between each pair of states. state An distinct event for the start of a spell in a given state. period An event for the start and an other for the end of a spell in a given state.

8/7/2009gr 70/100

slide-112
SLIDE 112

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Data Format

For doing event sequence analysis in TraMineR, we need an event sequence object We create it with seqecreate() to which we provide data in either of the following form:

Time Stamped Event (TSE), which permits to directly specify the events. A state sequence object together with a choice for automatic conversion

transition A distinct event for the transition between each pair of states. state An distinct event for the start of a spell in a given state. period An event for the start and an other for the end of a spell in a given state.

8/7/2009gr 70/100

slide-113
SLIDE 113

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

“Time Stamped Event”(TSE)

id Individual identifier. timestamp Time stamp (real valued) of the event. event The code (string) of the event. One line per event.

R> data(actcal.tse) R> head(actcal.tse) id time event 1 1 PartTime 2 2 0 NoActivity 3 2 4 Start 4 2 4 FullTime 5 2 11 Stop 6 3 PartTime

8/7/2009gr 71/100

slide-114
SLIDE 114

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Creating an event sequence object

Using the TSE format

Function seqecreate(). With arguments id, timestamp and event we provide the columns of the TSE format.

R> actcal.seqe <- seqecreate(id = actcal.tse$id, + timestamp = actcal.tse$time, event = actcal.tse$event)

Alternatively, we can use the data argument

R> actcal.seqe <- seqecreate(data = actcal.tse)

8/7/2009gr 72/100

slide-115
SLIDE 115

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Creating an event sequence object

From a state sequence object

Function seqecreate(). Argument tevent sets the choice for automatic conversion. Here, we want one event per transition

R> data(mvad) R> mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR") R> mvad.seq <- seqdef(mvad[, 17:86], labels = mvad.shortlab) R> mvad.seqe <- seqecreate(mvad.seq, tevent = "transition")

8/7/2009gr 73/100

slide-116
SLIDE 116

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Transition definition matrix

Conversion is done by means of transition definition matrix Here is how we can generate it (seqecreate() makes it for you)

R> seqetm(mvad.seq, method = "transition") employment FE HE joblessness school training employment "EM" "EM>FE" "EM>HE" "EM>JL" "EM>SC" "EM>TR" FE "FE>EM" "FE" "FE>HE" "FE>JL" "FE>SC" "FE>TR" HE "HE>EM" "HE>FE" "HE" "HE>JL" "HE>SC" "HE>TR" joblessness "JL>EM" "JL>FE" "JL>HE" "JL" "JL>SC" "JL>TR" school "SC>EM" "SC>FE" "SC>HE" "SC>JL" "SC" "SC>TR" training "TR>EM" "TR>FE" "TR>HE" "TR>JL" "TR>SC" "TR"

8/7/2009gr 74/100

slide-117
SLIDE 117

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Event sequence representation

Each sequence is displayed in the following form (e1,e2,...)-time-(e2,...)-time where (e1,e2,...) is the transition defined by the simultaneous occurrences of events e1,e2,.... time is the time (numerical value) between two transitions (or to the end of the observation time)

R> print(mvad.seqe[2]) [1] (FE)-36.00-(FE>HE)-34.00

8/7/2009gr 75/100

slide-118
SLIDE 118

Sequential data analysis - 2 Mining event sequences Creating event subsequences in TraMineR

Event sequence representation

Each sequence is displayed in the following form (e1,e2,...)-time-(e2,...)-time where (e1,e2,...) is the transition defined by the simultaneous occurrences of events e1,e2,.... time is the time (numerical value) between two transitions (or to the end of the observation time)

R> print(mvad.seqe[2]) [1] (FE)-36.00-(FE>HE)-34.00

8/7/2009gr 75/100

slide-119
SLIDE 119

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Section outline

2

Mining event sequences Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints

8/7/2009gr 76/100

slide-120
SLIDE 120

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Finding most frequent subsequences

Function seqefsub(), to which we must provide The event sequences (an event sequence object) The minimal support (with argument pMinSupport).

R> mvad.fsubseq <- seqefsub(mvad.seqe, pMinSupport = 0.01) R> mvad.fsubseq[1:5] Subsequence Support Count 1 (FE) 0.3862360 275 2 (FE>EM) 0.2879213 205 3 (TR>EM) 0.2528090 180 4 (SC) 0.2514045 179 5 (FE)-(FE>EM) 0.2289326 163 Computed on 712 event sequences Constraint Value countMethod One by sequence

8/7/2009gr 77/100

slide-121
SLIDE 121

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Finding most frequent subsequences

Function seqefsub(), to which we must provide The event sequences (an event sequence object) The minimal support (with argument pMinSupport).

R> mvad.fsubseq <- seqefsub(mvad.seqe, pMinSupport = 0.01) R> mvad.fsubseq[1:5] Subsequence Support Count 1 (FE) 0.3862360 275 2 (FE>EM) 0.2879213 205 3 (TR>EM) 0.2528090 180 4 (SC) 0.2514045 179 5 (FE)-(FE>EM) 0.2289326 163 Computed on 712 event sequences Constraint Value countMethod One by sequence

8/7/2009gr 77/100

slide-122
SLIDE 122

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Finding most frequent subsequences

Function seqefsub(), to which we must provide The event sequences (an event sequence object) The minimal support (with argument pMinSupport).

R> mvad.fsubseq <- seqefsub(mvad.seqe, pMinSupport = 0.01) R> mvad.fsubseq[1:5] Subsequence Support Count 1 (FE) 0.3862360 275 2 (FE>EM) 0.2879213 205 3 (TR>EM) 0.2528090 180 4 (SC) 0.2514045 179 5 (FE)-(FE>EM) 0.2289326 163 Computed on 712 event sequences Constraint Value countMethod One by sequence

8/7/2009gr 77/100

slide-123
SLIDE 123

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Graphical display of most frequent subsequences

We can just apply plot() on the object returned by seqefsub() Use indexes ([1:15]) for selecting the subsequences to include (subsequences are sorted by decreasing frequencies). Other arguments are passed to the function barplot()

R> plot(mvad.fsubseq[1:15], col = "cyan", ylab = "Frequency", + xlab = "Subsequences", cex = 1.5)

Subsequences Frequency 0.0 0.1 0.2 0.3

(FE) (FE>EM) (TR>EM) (SC) (FE)−(FE>EM) (TR) (JL>EM) (EM>JL) (TR)−(TR>EM) (EM) (SC>HE) (EM>JL)−(JL>EM) (SC)−(SC>HE) (FE>JL) (HE>EM)

8/7/2009gr 78/100

slide-124
SLIDE 124

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Finding most discriminant subsequences

Aim is to identify the frequent sequences that are most strongly related with a given factor. Discriminant power is evaluated with p-value of a Chi-square independence test. Function seqecmpgroup() To which we provide the frequent subsequence object and a group factor (gcse5eq). A Bonferroni correction is applied when passing argument method="bonferroni".

8/7/2009gr 79/100

slide-125
SLIDE 125

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Finding most discriminant subsequences

Aim is to identify the frequent sequences that are most strongly related with a given factor. Discriminant power is evaluated with p-value of a Chi-square independence test. Function seqecmpgroup() To which we provide the frequent subsequence object and a group factor (gcse5eq). A Bonferroni correction is applied when passing argument method="bonferroni".

8/7/2009gr 79/100

slide-126
SLIDE 126

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Finding most discriminant subsequences

Aim is to identify the frequent sequences that are most strongly related with a given factor. Discriminant power is evaluated with p-value of a Chi-square independence test. Function seqecmpgroup() To which we provide the frequent subsequence object and a group factor (gcse5eq). A Bonferroni correction is applied when passing argument method="bonferroni".

8/7/2009gr 79/100

slide-127
SLIDE 127

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Seeking discriminant subsequences

R> mvad.discr <- seqecmpgroup(mvad.fsubseq, group = mvad$gcse5eq) R> mvad.discr[1:5] Subsequence Support p.value statistic index Freq.no Freq.yes 1 (SC>HE) 0.10393258 1.445408e-19 81.88088 11 0.02433628 0.2423077 2 (SC)-(SC>HE) 0.09831461 7.250286e-18 74.14723 13 0.02433628 0.2269231 3 (HE>EM) 0.08426966 7.487216e-13 51.41219 15 0.02654867 0.1846154 4 (EM>HE) 0.07162921 5.019013e-12 47.67954 21 0.01991150 0.1615385 5 (SC) 0.25140449 7.798571e-12 46.81571 4 0.16592920 0.4000000 Resid.no Resid.yes 1 -5.249117 6.920999 2 -5.016083 6.613742 3 -4.227342 5.573781 4 -4.108312 5.416839 5 -3.624293 4.778657 Computed on 712 event sequences Constraint Value countMethod One by sequence

8/7/2009gr 80/100

slide-128
SLIDE 128

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Graphical display, frequencies

R> plot(mvad.discr[1:15], cex = 1.5)

no

0.0 0.1 0.2 0.3 0.4

(SC>HE) (SC)−(SC>HE) (HE>EM) (EM>HE) (SC) (TR>EM) (TR) (SC)−(HE>EM) (FE>HE) (SC)−(EM>HE) (SC)−(SC>HE)−(HE>EM) (SC>HE)−(HE>EM) (FE)−(FE>HE) (TR)−(TR>EM) (EM)

yes

0.0 0.1 0.2 0.3 0.4

(SC>HE) (SC)−(SC>HE) (HE>EM) (EM>HE) (SC) (TR>EM) (TR) (SC)−(HE>EM) (FE>HE) (SC)−(EM>HE) (SC)−(SC>HE)−(HE>EM) (SC>HE)−(HE>EM) (FE)−(FE>HE) (TR)−(TR>EM) (EM)

Pearson residuals

− 4 − 2 neutral 2 4

8/7/2009gr 81/100

slide-129
SLIDE 129

Sequential data analysis - 2 Mining event sequences Seeking frequent and discriminant subsequences

Graphical display, residuals

R> plot(mvad.discr[1:15], ptype = "resid", cex = 1.5)

no

−4 −2 2 4 6

(SC>HE) (SC)−(SC>HE) (HE>EM) (EM>HE) (SC) (TR>EM) (TR) (SC)−(HE>EM) (FE>HE) (SC)−(EM>HE) (SC)−(SC>HE)−(HE>EM) (SC>HE)−(HE>EM) (FE)−(FE>HE) (TR)−(TR>EM) (EM)

yes

−4 −2 2 4 6

(SC>HE) (SC)−(SC>HE) (HE>EM) (EM>HE) (SC) (TR>EM) (TR) (SC)−(HE>EM) (FE>HE) (SC)−(EM>HE) (SC)−(SC>HE)−(HE>EM) (SC>HE)−(HE>EM) (FE)−(FE>HE) (TR)−(TR>EM) (EM)

Pearson residuals

− 4 − 2 neutral 2 4

8/7/2009gr 82/100

slide-130
SLIDE 130

Sequential data analysis - 2 Mining event sequences Looking for state patterns

Section outline

2

Mining event sequences Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints

8/7/2009gr 83/100

slide-131
SLIDE 131

Sequential data analysis - 2 Mining event sequences Looking for state patterns

Looking for state patterns

By assigning an event to start of each spell spent in a given state. Frequent subsequences correspond to state patterns We can thus for example look for the state patterns that best discriminate clusters.

R> mvad.pat <- seqecreate(mvad.seq, tevent = "state") R> mvad.pat.fsubseq <- seqefsub(mvad.pat, pMinSupport = 0.01) R> discr.pat.cluster <- seqecmpgroup(mvad.pat.fsubseq, group = mvad.cl3.factor) R> plot(discr.pat.cluster[1:10])

8/7/2009gr 84/100

slide-132
SLIDE 132

Sequential data analysis - 2 Mining event sequences Looking for state patterns

Looking for state patterns

By assigning an event to start of each spell spent in a given state. Frequent subsequences correspond to state patterns We can thus for example look for the state patterns that best discriminate clusters.

R> mvad.pat <- seqecreate(mvad.seq, tevent = "state") R> mvad.pat.fsubseq <- seqefsub(mvad.pat, pMinSupport = 0.01) R> discr.pat.cluster <- seqecmpgroup(mvad.pat.fsubseq, group = mvad.cl3.factor) R> plot(discr.pat.cluster[1:10])

8/7/2009gr 84/100

slide-133
SLIDE 133

Sequential data analysis - 2 Mining event sequences Looking for state patterns

Discriminant state patterns

Frequencies of 10 most discriminant

Employment

0.0 0.2 0.4 0.6 0.8 (HE) (SC)−(HE) (EM) (FE)−(HE) (JL) (TR)−(EM) (SC) (TR) (HE)−(EM) (EM)−(HE)

Education

0.0 0.2 0.4 0.6 0.8 (HE) (SC)−(HE) (EM) (FE)−(HE) (JL) (TR)−(EM) (SC) (TR) (HE)−(EM) (EM)−(HE)

Jobless

0.0 0.2 0.4 0.6 0.8 (HE) (SC)−(HE) (EM) (FE)−(HE) (JL) (TR)−(EM) (SC) (TR) (HE)−(EM) (EM)−(HE)

Pearson residuals

− 4 − 2 neutral 2 4

8/7/2009gr 85/100

slide-134
SLIDE 134

Sequential data analysis - 2 Mining event sequences Looking for state patterns

Discriminant state patterns

Frequencies of next 15

R> plot(discr.pat.cluster[11:25])

Employment

0.0 0.1 0.2 0.3 0.4 0.5 (EM)−(JL) (SC)−(HE)−(EM) (SC)−(EM)−(HE) (TR)−(JL) (FE)−(JL) (SC)−(JL) (SC)−(FE)−(JL) (FE)−(EM)−(HE) (JL)−(EM) (JL)−(JL) (FE)−(EM)−(JL) (FE)−(HE)−(EM) (EM)−(EM)−(JL) (JL)−(EM)−(JL) (SC)−(FE)−(HE)

Education

0.0 0.1 0.2 0.3 0.4 0.5 (EM)−(JL) (SC)−(HE)−(EM) (SC)−(EM)−(HE) (TR)−(JL) (FE)−(JL) (SC)−(JL) (SC)−(FE)−(JL) (FE)−(EM)−(HE) (JL)−(EM) (JL)−(JL) (FE)−(EM)−(JL) (FE)−(HE)−(EM) (EM)−(EM)−(JL) (JL)−(EM)−(JL) (SC)−(FE)−(HE)

Jobless

0.0 0.1 0.2 0.3 0.4 0.5 (EM)−(JL) (SC)−(HE)−(EM) (SC)−(EM)−(HE) (TR)−(JL) (FE)−(JL) (SC)−(JL) (SC)−(FE)−(JL) (FE)−(EM)−(HE) (JL)−(EM) (JL)−(JL) (FE)−(EM)−(JL) (FE)−(HE)−(EM) (EM)−(EM)−(JL) (JL)−(EM)−(JL) (SC)−(FE)−(HE)

Pearson residuals

− 4 − 2 neutral 2 4

8/7/2009gr 86/100

slide-135
SLIDE 135

Sequential data analysis - 2 Mining event sequences Looking for state patterns

Discriminant state patterns

Residuals of 10 most discriminant

R> plot(discr.pat.cluster[1:10], ptype = "resid")

Employment

−5 5 10 15 (HE) (SC)−(HE) (EM) (FE)−(HE) (JL) (TR)−(EM) (SC) (TR) (HE)−(EM) (EM)−(HE)

Education

−5 5 10 15 (HE) (SC)−(HE) (EM) (FE)−(HE) (JL) (TR)−(EM) (SC) (TR) (HE)−(EM) (EM)−(HE)

Jobless

−5 5 10 15 (HE) (SC)−(HE) (EM) (FE)−(HE) (JL) (TR)−(EM) (SC) (TR) (HE)−(EM) (EM)−(HE)

Pearson residuals

− 4 − 2 neutral 2 4

8/7/2009gr 87/100

slide-136
SLIDE 136

Sequential data analysis - 2 Mining event sequences Looking for specific subsequences

Section outline

2

Mining event sequences Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints

8/7/2009gr 88/100

slide-137
SLIDE 137

Sequential data analysis - 2 Mining event sequences Looking for specific subsequences

Looking for specific subsequences

With seqefsub() we can also search for predefined subsequences, i.e. find the sequences that contain at leat one

  • f the provided subsequence.

For example, (JL) → (EM) et (EM) → (JL).

R> subseq <- c("(JL)-(EM)", "(EM)-(JL)") R> mysubseq <- seqefsub(mvad.pat, strsubseq = subseq) R> mysubseq Subsequence Support 1 (JL)-(EM) 0.2303371 2 (EM)-(JL) 0.1910112 Computed on 712 event sequences Constraint Value countMethod One by sequence

8/7/2009gr 89/100

slide-138
SLIDE 138

Sequential data analysis - 2 Mining event sequences Looking for specific subsequences

Matrix of occurrences

Function seqeapplysub generates a matrix with found sequences as rows and provided subsequences as columns. The matrix is filled with either

The number of occurrences of the subsequences, or The age at first occurrence.

R> mysubseq.occ <- seqeapplysub(mysubseq, method = "count") R> mysubseq.occ[c(655, 701), ] (JL)-(EM) (EM)-(JL) (SC)-24.00-(JL)-3.00-(EM)-43.00 1 (FE)-4.00-(EM)-39.00-(JL)-13.00-(EM)-14.00 1 1 R> mysubseq.age <- seqeapplysub(mysubseq, method = "age") R> mysubseq.age[c(655, 701), ] (JL)-(EM) (EM)-(JL) (SC)-24.00-(JL)-3.00-(EM)-43.00 24

  • 1

(FE)-4.00-(EM)-39.00-(JL)-13.00-(EM)-14.00 43 4

8/7/2009gr 90/100

slide-139
SLIDE 139

Sequential data analysis - 2 Mining event sequences Temporal constraints

Section outline

2

Mining event sequences Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints

8/7/2009gr 91/100

slide-140
SLIDE 140

Sequential data analysis - 2 Mining event sequences Temporal constraints

Temporal constraints

Search can be submitted to constraints defined with seqeconstraint() that accepts the following arguments

maxGap Maximal time between two events. windowSize Maximal duration of the subsequence. ageMin Minimal age at start of subsequence. ageMax Maximal age at start of subsequence. ageMaxEnd Maximal age at end of subsequence

Indeed, there is no need to specify all of them.

8/7/2009gr 92/100

slide-141
SLIDE 141

Sequential data analysis - 2 Mining event sequences Temporal constraints

Temporal constraints

Search can be submitted to constraints defined with seqeconstraint() that accepts the following arguments

maxGap Maximal time between two events. windowSize Maximal duration of the subsequence. ageMin Minimal age at start of subsequence. ageMax Maximal age at start of subsequence. ageMaxEnd Maximal age at end of subsequence

Indeed, there is no need to specify all of them.

8/7/2009gr 92/100

slide-142
SLIDE 142

Sequential data analysis - 2 Mining event sequences Temporal constraints

Temporal constraints

Search can be submitted to constraints defined with seqeconstraint() that accepts the following arguments

maxGap Maximal time between two events. windowSize Maximal duration of the subsequence. ageMin Minimal age at start of subsequence. ageMax Maximal age at start of subsequence. ageMaxEnd Maximal age at end of subsequence

Indeed, there is no need to specify all of them.

8/7/2009gr 92/100

slide-143
SLIDE 143

Sequential data analysis - 2 Mining event sequences Temporal constraints

Temporal constraints

Search can be submitted to constraints defined with seqeconstraint() that accepts the following arguments

maxGap Maximal time between two events. windowSize Maximal duration of the subsequence. ageMin Minimal age at start of subsequence. ageMax Maximal age at start of subsequence. ageMaxEnd Maximal age at end of subsequence

Indeed, there is no need to specify all of them.

8/7/2009gr 92/100

slide-144
SLIDE 144

Sequential data analysis - 2 Mining event sequences Temporal constraints

Temporal constraints

Search can be submitted to constraints defined with seqeconstraint() that accepts the following arguments

maxGap Maximal time between two events. windowSize Maximal duration of the subsequence. ageMin Minimal age at start of subsequence. ageMax Maximal age at start of subsequence. ageMaxEnd Maximal age at end of subsequence

Indeed, there is no need to specify all of them.

8/7/2009gr 92/100

slide-145
SLIDE 145

Sequential data analysis - 2 Mining event sequences Temporal constraints

Setting temporal constraints, example 1

R> myconstraint <- seqeconstraint(windowSize = 6) R> mysubseq <- seqefsub(mvad.pat, constraint = myconstraint, + pMinSupport = 0.01) R> mysubseq[1:10] Subsequence Support Count 1 (EM) 0.83426966 594 2 (FE) 0.49016854 349 3 (TR) 0.34971910 249 4 (JL) 0.34269663 244 5 (SC) 0.27387640 195 6 (HE) 0.25561798 182 7 (JL)-(EM) 0.11938202 85 8 (EM)-(JL) 0.05477528 39 9 (EM)-(HE) 0.04915730 35 10 (TR)-(EM) 0.04073034 29 Computed on 712 event sequences Constraint Value windowSize 6 countMethod One by sequence

8/7/2009gr 93/100

slide-146
SLIDE 146

Sequential data analysis - 2 Mining event sequences Temporal constraints

Setting temporal constraints, example 2

R> myconstraint <- seqeconstraint(maxGap = 2, ageMin = 12) R> mysubseq <- seqefsub(mvad.pat, constraint = myconstraint, + pMinSupport = 0.01) R> mysubseq[1:10] Subsequence Support Count 1 (EM) 0.67837079 483 2 (JL) 0.29213483 208 3 (HE) 0.25561798 182 4 (FE) 0.14606742 104 5 (TR) 0.12078652 86 6 (JL)-(EM) 0.05337079 38 7 (SC) 0.02528090 18 8 (EM)-(HE) 0.01825843 13 9 (EM)-(JL) 0.01123596 8 10 (EM)-(FE) 0.00983146 7 Computed on 712 event sequences Constraint Value maxGap 2 ageMin 12 countMethod One by sequence

8/7/2009gr 94/100

slide-147
SLIDE 147

Sequential data analysis - 2 Conclusion: Sequence of analyses

Outline

1

Dissimilarities among pairs of state sequences

2

Mining event sequences

3

Conclusion: Sequence of analyses

8/7/2009gr 95/100

slide-148
SLIDE 148

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (state sequences) - I

Explore sequence distribution (seqdplot, seqfplot, seqiplot) Characteristics of the set of sequences

Representative sequence (most frequent, centro¨ ıd, ...) Dispersion of the sequences (from dissimilarity measures) Sequence of transversal characteristics (entropies, modal states, ...) Distribution of longitudinal characteristics (entropy, turbulence, time spent in each state, ...)

Association between longitudinal characteristics of parallel sequences (family-profession, ego-partner, ...)

Preceding analyses by groups (sex, birth cohorts, ...), comparisons

8/7/2009gr 96/100

slide-149
SLIDE 149

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (state sequences) - I

Explore sequence distribution (seqdplot, seqfplot, seqiplot) Characteristics of the set of sequences

Representative sequence (most frequent, centro¨ ıd, ...) Dispersion of the sequences (from dissimilarity measures) Sequence of transversal characteristics (entropies, modal states, ...) Distribution of longitudinal characteristics (entropy, turbulence, time spent in each state, ...)

Association between longitudinal characteristics of parallel sequences (family-profession, ego-partner, ...)

Preceding analyses by groups (sex, birth cohorts, ...), comparisons

8/7/2009gr 96/100

slide-150
SLIDE 150

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (state sequences) - I

Explore sequence distribution (seqdplot, seqfplot, seqiplot) Characteristics of the set of sequences

Representative sequence (most frequent, centro¨ ıd, ...) Dispersion of the sequences (from dissimilarity measures) Sequence of transversal characteristics (entropies, modal states, ...) Distribution of longitudinal characteristics (entropy, turbulence, time spent in each state, ...)

Association between longitudinal characteristics of parallel sequences (family-profession, ego-partner, ...)

Preceding analyses by groups (sex, birth cohorts, ...), comparisons

8/7/2009gr 96/100

slide-151
SLIDE 151

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (state sequences) - I

Explore sequence distribution (seqdplot, seqfplot, seqiplot) Characteristics of the set of sequences

Representative sequence (most frequent, centro¨ ıd, ...) Dispersion of the sequences (from dissimilarity measures) Sequence of transversal characteristics (entropies, modal states, ...) Distribution of longitudinal characteristics (entropy, turbulence, time spent in each state, ...)

Association between longitudinal characteristics of parallel sequences (family-profession, ego-partner, ...)

Preceding analyses by groups (sex, birth cohorts, ...), comparisons

8/7/2009gr 96/100

slide-152
SLIDE 152

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (state sequences) - II

Study of similarities between individual sequences

Building typologies (cluster analysis)

Relationships between clusters and covariates (sex, cohort, ...), logistic models ...

Scatterplots (Multi-dimensional scaling) ANOD (Analysis of discrepancy): Part of discrepancy explained by one or several factors. Segmentation in homogenous groups through tree structured approach.

8/7/2009gr 97/100

slide-153
SLIDE 153

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (event sequences)

Seek frequent subsequences Within constraints

Initial and final family life events (leaving home et marriage for example) Minimal an maximal ages (between 20 et 50 years old) Maximal sequence time span (10 years)

Relationship between frequent event subsequences and covariates (sex, cohort, ...), logistic models. Effect of experimenting the subsequence on entropy, or other response variable ... Finding most discriminant subsequences for a given categorical variable (sex, cluster, ...)

8/7/2009gr 98/100

slide-154
SLIDE 154

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (event sequences)

Seek frequent subsequences Within constraints

Initial and final family life events (leaving home et marriage for example) Minimal an maximal ages (between 20 et 50 years old) Maximal sequence time span (10 years)

Relationship between frequent event subsequences and covariates (sex, cohort, ...), logistic models. Effect of experimenting the subsequence on entropy, or other response variable ... Finding most discriminant subsequences for a given categorical variable (sex, cluster, ...)

8/7/2009gr 98/100

slide-155
SLIDE 155

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (event sequences)

Seek frequent subsequences Within constraints

Initial and final family life events (leaving home et marriage for example) Minimal an maximal ages (between 20 et 50 years old) Maximal sequence time span (10 years)

Relationship between frequent event subsequences and covariates (sex, cohort, ...), logistic models. Effect of experimenting the subsequence on entropy, or other response variable ... Finding most discriminant subsequences for a given categorical variable (sex, cluster, ...)

8/7/2009gr 98/100

slide-156
SLIDE 156

Sequential data analysis - 2 Conclusion: Sequence of analyses

Sequence of analyses (event sequences)

Seek frequent subsequences Within constraints

Initial and final family life events (leaving home et marriage for example) Minimal an maximal ages (between 20 et 50 years old) Maximal sequence time span (10 years)

Relationship between frequent event subsequences and covariates (sex, cohort, ...), logistic models. Effect of experimenting the subsequence on entropy, or other response variable ... Finding most discriminant subsequences for a given categorical variable (sex, cluster, ...)

8/7/2009gr 98/100

slide-157
SLIDE 157

Sequential data analysis - 2 Conclusion: Sequence of analyses

References I

Abbott, A. and J. Forrest (1986). Optimal matching methods for historical

  • sequences. Journal of Interdisciplinary History 16, 471–494.

Abbott, A. and A. Tsay (2000). Sequence analysis and optimal matching methods in sociology, Review and prospect. Sociological Methods and Research 29(1), 3–33. (With discussion, pp 34-76). Billari, F. C. (2001). The analysis of early life courses: Complex description of the transition to adulthood. Journal of Population Research 18(2), 119–142. Elzinga, C. H. (2008). Sequence analysis: Metric representations of categorical time series. Sociological Methods and Research. In revision. Gabadinho, A., G. Ritschard, M. Studer, and N. S. M¨ uller (2008). Mining sequence data in R with TraMineR: A user’s guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva. (TraMineR is on CRAN the Comprehensive R Archive Network). Gauthier, J.-A., E. D. Widmer, P. Bucher, and C. Notredame (2008). How much does it cost? Optimization of costs in sequence analysis of social science data. Sociological Methods and Research. (forthcoming).

8/7/2009gr 99/100

slide-158
SLIDE 158

Sequential data analysis - 2 Conclusion: Sequence of analyses

References II

Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710. McVicar, D. and M. Anyadike-Danes (2002). Predicting successful and unsuccessful transitions from school to work using sequence methods. Journal of the Royal Statistical Society A 165(2), 317–334. Ritschard, G., A. Gabadinho, N. S. M¨ uller, and M. Studer (2008). Mining event histories: A social science perspective. International Journal of Data Mining, Modelling and Management 1(1), 68–90. Ritschard, G., A. Gabadinho, M. Studer, and N. S. M¨ uller (2009). Converting between various sequence representations. In Z. Ras and A. Dardzinska (Eds.), Advances in Data Management, Volume 223 of Studies in Computational Intelligence, pp. 155–175. Berlin: Springer. Studer, M., G. Ritschard, A. Gabadinho, and N. S. M¨ uller (2009). Discrepancy analysis of complex objects using dissimilarities. In H. Briand, F. Guillet,

  • G. Ritschard, and D. A. Zighed (Eds.), Advances in Knowledge Discovery

and Management, Studies in Computational Intelligence. Berlin: Springer. (submitted).

8/7/2009gr 100/100