Visualization Techniques for the Integration of Rank Data Michael G. - - PowerPoint PPT Presentation

visualization techniques for the integration of rank data
SMART_READER_LITE
LIVE PREVIEW

Visualization Techniques for the Integration of Rank Data Michael G. - - PowerPoint PPT Presentation

Visualization Techniques for the Integration of Rank Data Michael G. Schimek 1 a 2 Eva Budinsk 1 Medical University of Graz, Graz, Austria 2 Swiss Institute of Bioinformatics, Lausanne, Switzerland COMPSTAT 2010, Paris, France, August 22-27,


slide-1
SLIDE 1

Visualization Techniques for the Integration of Rank Data

Michael G. Schimek1 Eva Budinsk´ a2

1Medical University of Graz, Graz, Austria 2Swiss Institute of Bioinformatics, Lausanne, Switzerland

COMPSTAT 2010, Paris, France, August 22-27, 2010

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-2
SLIDE 2

Motivation

In various fields of application we are confronted with lists

  • f distinct objects in rank order

The ordering might be due to a measure of strength of evidence or to an assessment based on expert knowledge

  • r a technical device

The ranking might also represent some measurement taken on the objects which might not be comparable across the lists, for instance, because of different assessment technologies or levels of measurement error Our aim is to consolidate such lists of common objects to provide computationally tractable solutions, hence appropriate algorithms and graphs

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-3
SLIDE 3

General assumptions

Let us assume ℓ assessors or laboratories (j = 1, 2, . . . , ℓ) assigning rank positions to the same set of N distinct

  • bjects

Assessment of N distinct objects according to the extent to which a particular attribute is present All assessors, independently of each other, rank the same

  • bjects between 1 and N on the basis of relative

performance The ranking is from 1 to N, without ties Missing assessments are allowed The ℓ assessors produce ℓ rank lists τj There are (ℓ2 − ℓ)/2 possible pairs of such lists τj

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-4
SLIDE 4

The problem

In most applications, especially for large or huge numbers N of objects, it is unlikely that consensus prevails As result only the top-ranked objects matter (the remainder

  • nes show random ordering)

Quite often we observe a general decrease, not necessarily monotone, of the probability for consensus rankings with increasing distance from the top rank position Typically there is reasonable conformity in the rankings for the first, say k, elements of the lists: notion of top-k rank lists Tasks: Consensus in preference and voting, integration of search engine results, meta-analysis of microarray experiments

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-5
SLIDE 5

A motivating example: U.S. college preference data

Avery et al. (2005) developed a statistical model which allows the construction of a ranking of U.S. undergraduate programs based on students’ revealed preferences Data from 1357 high achieving students (90th percentile of all SAT takers) seeking admission N = 110 colleges and universities taking part in the national ranking (matriculation tournaments) For each college/university there are two rankings of interest: matriculation rank (MR) and preference rank (PR) There are no missing assignments Question: Is there a top list of conforming rank assignments?

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-6
SLIDE 6

A motivating example: U.S. college preference data

College Name MR PR Harvard University (o1) 1 1 California Inst. of Technology (o2) 2 7 Yale University (o3) 3 5 Massachusetts Inst. of Technology (o4) 4 3 Stanford University (o5) 5 2 Princeton University (o6) 6 4 Brown University (o7) 7 6 Columbia University (o8) 8 8 Amherst College (o9) 9 13 Dartmouth College (o10) 10 11 Wellesley College (o11) 11 33 University of Pennsylvania (o12) 12 12 University of Notre Dame (o13) 13 14 Swarthmore College (o14) 14 10 Cornell University (o15) 15 15 Georgetown University (o16) 16 9 . . . . . . . . .

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-7
SLIDE 7

The data stream input

The indicator variable takes Ij = 1 if the ranking given by the second assessor to the object ranked j by the first is not distant more than δ from j, and Ij = 0 otherwise ⇒ data stream Concordance is assumed for an arbitrary object o when its rank in τi is maximal δ index positions apart from its rank in τj The data stream depends on the distance parameter δ δ is defined by the shift in index positions of a particular

  • bject o in one list, say τi, with respect to the other list, say

τj A sequence of data streams ordered according to δ represents the reduction of discordance

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-8
SLIDE 8

U.S. college data: data streams for δ = 0 to 5

Object MR PR δ = 0 δ = 1 δ = 2 δ = 3 δ = 4 δ = 5

  • 1

1 1 1 1 1 1 1 1

  • 2

2 7 1

  • 3

3 5 1 1 1 1

  • 1

4 3 1 1 1 1 1

  • 1

5 2 1 1 1

  • 1

6 4 1 1 1 1

  • 1

7 6 1 1 1 1 1

  • 1

8 8 1 1 1 1 1 1

  • 1

9 13 1 1

  • 10

10 11 1 1 1 1 1

  • 11

11 33

  • 12

12 12 1 1 1 1 1 1

  • 13

13 14 1 1 1 1 1

  • 14

14 10 1 1

  • 15

15 15 1 1 1 1 1 1

  • 16

16 9 #(zeros) 12 8 6 5 3 2

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-9
SLIDE 9

Selection of ˆ k for list truncation

Moderate deviation-based inference for random degeneration in paired rank lists (Hall and Schimek, 2008, 2010) For the estimation of the point of degeneration j0 into noise independent Bernoulli random variables are assumed A general decrease of the probability pj (need not be monotone) for concordance of rankings with increasing distance j from the top rank is assumed A distance parameter δ and a tuning parameter ν are required to account for the closeness of the assessors’ rankings and the degree of randomness in the assignments The algorithm represents a simplified mathematical model; It is embedded in an iterative scheme to account for irregular rankings

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-10
SLIDE 10

∆-plot for matriculation rank and preference rank of U.S. colleges

20 40 60 80 100 20 40 60 80 100 delta # of 0’s in data stream

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-11
SLIDE 11

U.S. college preference data: inference results

δ choice based on ∆-plot Sharp decline of #(zeros)’, especially for δ’s up to about 20 (around δ = 45 almost no discordance left) Pilot sample size ν ≥ 4 (functions as smoothing parameter) For δ = 10 and ν = 4 we obtain the smallest of all stable results: ˆ j0 = 16 (15 top ranking colleges) For δ = 20 and ν = 28 we obtain ˆ j0 = 71 (70 top ranking colleges) Both results make sense and depend on the goal of the study (more than one result because of modest separability)

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-12
SLIDE 12

An application: meta-analysis of microarray data

Breast cancer data due to Sørlie et al. (2003) Study goal: Identification of breast tumor subtypes from gene expression measured by microarrays Here we consider selected expression data from three independent patient cohorts called Norway, Norway FU, and Stanford, hybridized on different platforms Only genes (unique gene symbols) common to all platforms are analyzed 3 ranked lists, τ1, τ2, and τ3, each of length N = 5812 Our task: Identification of a subset of genes supported by all 3 cohorts that can be used for further unsupervised analysis of subtypes of breast cancer

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-13
SLIDE 13

Estimates of j0 for a range of δ values, combining pairwise the lists τ1, τ2, and τ3 (r = 1.2, C = 0.4)

δ 20 40 60 80 100 400 700 1000 1300 1600 1900 2200 2500 2800 3100 3400 3700 4000 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500 4750 5000 5250 5500 j0 estimate tau1 tau2 tau1 tau3 tau2 tau3

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-14
SLIDE 14

Aggregation map: Graphical integration of paired ranked lists

Define a partial reference list τ 0

1 ; anyone of the 2 lists with

maxj(ˆ kj) objects among all pairwise comparisons (τ 0

1 gives

the ordering of the objects oi on the vertical axis of the plot) The partial lists τ2, τ3, . . . , τℓ are ordered from highest to lowest by their individual kj when compared to the reference list τ 0

1 (one column per list)

In each cell we represent: (1) top-k membership, ’yes’ is denoted by color ’grey’ and ’no’ by ’white’, (2) distance of a current object oi ∈ τ 0

1 from its position in

the other list, color scale from ’red’ identical to ’yellow’ far distant (integer value denotes distance with negative sign if to the left, and positive sign if to the right) Implemented in R utilizing the grid add-on package of Murrell (2006)

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-15
SLIDE 15

Aggregation map for δ = 450, combining τ1, τ2, and τ3

delta=450 nu=10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

τ 2

ESR1 S100A8 FABP7 SCGB2A2 S100P TFF3 ABHD2 UGT2B4 S100A1 MGP NPY1R HMGCS2 DEFA1 GRIA2 IGLL1 APOD WFDC2 BF FABP4 CXCL14 AKR1C3 CRABP1 ENPP5 CEACAM6 GABRP AMPH TRPV6 CD24 EGFR PVALB CHGB LBP MSX2 CYP4Z1 LIV-1 AQP3 HLA-DRB5 KRT5 PPP1R14C GJB1 RERG SLPI MAOB ERBB2 CNTNAP2 GATA3 CHI3L1 TIMP4 MYCN KRT17 RIMS2 PRAME COL17A1 H11 STAT4 5'OY11.1 LOC120224

τ 3

9 2 10

  • 1

40 16 983 128 30 137 16 34 227

  • 6

63

  • 7

41 137 167 52 128

  • 7

238

  • 18
  • 18

683 35 9 649 1 10 150 358

  • 29

162 151 1022 6

  • 25

83 133

  • 17
  • 25

86 82 8 60 450

  • 15
  • 14
  • 16
  • 33

259 44 375 1880 119

τ 1

7 12 3 6 104 12 846 44 28 281 96 101 145 9 406 38 140 446 63 31 789 74 426 148

  • 23

178 457

τ 1

APOB GABRP SERPINA6 PLP1 CYP4Z1 FABP7 FGF12 ESR1 EXTL2 SCGB2A2 MS4A6A LRRN3 CYP2A6 S100A8 TLX1 FLJ25270 MYCN TFF3 NTRK2 GRB14 TRIM29 LRP2 GRIA2 FLJ23834 KIAA0062 FOLR1 KIAA1036 CHGB ANXA8 GRIK1 PON3 CD79A CD79B COL17A1 JDP1 INPP5A S100A1

τ 3

59 5 13 28 7 661 2 306

  • 7
  • 9

504

  • 2
  • 10

399 86 17 4 66 18 32

  • 1
  • 15

19 666 343 541 13 83 776 37 302 353 278 86 1324 2

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data

slide-16
SLIDE 16

Summary and conclusions

Irregularities, typical for empirical ranked lists, can be well represented by means of data streams Data streams are distance-dependent: distance can be evaluated via the ∆-plot Data stream input is sufficient for (1) inference on the degradation of information and for (2) the graphical integration of top-ranked objects The aggregation map, a new graphical tool, provides additional insight into a top-k set of objects The approach is computationally tractable and efficient The procedures will soon be available in the R-package TopKLists The approach has already demonstrated its practical value

  • M. G. Schimek & E. Budinsk´

a Visualization Techniques for the Integration of Rank Data