Computational geometry and statistical depth measures Eynat - - PowerPoint PPT Presentation

computational geometry and statistical depth measures
SMART_READER_LITE
LIVE PREVIEW

Computational geometry and statistical depth measures Eynat - - PowerPoint PPT Presentation

Computational geometry and statistical depth measures Eynat Rafalin Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry Joint work with Prof. Diane Souvaine 1 Interface 04 Outline of talk Data analysis,


slide-1
SLIDE 1

Interface 04 1

Computational geometry and statistical depth measures

Eynat Rafalin

Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry

Joint work with Prof. Diane Souvaine

slide-2
SLIDE 2

Interface 04 2

Outline of talk

Data analysis, Computational geometry and depth

based statistics

Applications

– A basic technique: the duality transform – Least Median of Squares (LMS) regression in optimal time – Half-space depth contours in optimal time – Depth contours – Simplicial depth

Future research

slide-3
SLIDE 3

Interface 04 3

Computational Geometry

Deals with problems that require geometric algorithms for

their solutions.

Systematic study of algorithms and data structures for

geometric objects, with a focus on exact algorithms that are asymptotically fast.

At the outset: once exact algorithms have been obtained,

refined, and are still slow, then move to approximation algorithms.

Computational geometry is everywhere!

slide-4
SLIDE 4

Interface 04 4

Computational geometry & Statistics – data analysis

slide-5
SLIDE 5

Interface 04 5

Multivariate analysis by Data depth

Data depth - A way of measuring how

deep a given point x in Rd is relative to F, a probability distribution, or relative to a given data cloud.

Examples:

– Halfspace (Location, Tukey) depth (Hodges 55, Tukey 75) – Simplicial depth (Liu 90) – Convex Hull Peeling depth (Barnett 76, Eddy 82) – Regression depth (Rousseeuw & Hubert 99) – Mahalanobis depth (Mahalanobis 36) – Oja depth (Oja 83)

slide-6
SLIDE 6

Interface 04 6

Multivariate analysis by Data depth

Data depth - A way of measuring how

deep a given point x in Rd is relative to F, a probability distribution, or relative to a given data cloud.

Concept provides center outward

  • rdering of points.

Non parametric, multivariate statistics. Robust. affine invariance - for many depth

functions the choice of axes does not affect the depth values.

slide-7
SLIDE 7

Interface 04 7

Outliers and Robustness

Observations that deviate from the main part of

the data (outliers) can have an undesirable influence on the analysis of the data

A robust depth function yields reasonable

results even if several unannounced outliers occur in the data [Handbook of statistics 15, Rao & Maddala].

For example

– Depth contours are nested contours that enclose regions with increasing depth – For half-space depth contours: in the presence of m outliers

  • nly the m outermost depth

contours may be corrupted by the outliers, but the inner set

  • f depth contours will maintain

its shape [Donoho & Gasko 92].

slide-8
SLIDE 8

Interface 04 8

Data depth: a characterization, visualization and quantification tool

Deepest point Outliers Depth contours Bag-plot (Box-plot)

[Rousseeuw, Ruts, Tukey 99]

Scale curve as a measure

  • f scale [Liu, Parelius, Singh 99]

Fan plot as a measure of

tailedness [Liu, Parelius, Singh

99]

Robustified classification

and cluster analysis

[Rousseeuw, Ruts 96]

slide-9
SLIDE 9

Interface 04 9

Fan plots [Liu, Parelius & Singh 99]

50 data points, created from a random distribution, with covariance matrix 4 times identity. The fans are created for data sets containing the 1/6, 2/6, ..central regions. For each region the area of the CH of 2, 4, 6,…% of the points is computed.

Relative area (CH of p%/CH) Percentile of points

slide-10
SLIDE 10

Interface 04 10

The continuous and finite sample case

Most depth functions are defined in respect to a

probability distribution F, considering {X1,.., Xn } random observations from F.

The finite sample version of the depth function is

  • btained by replacing F by Fn, the empirical

distribution of the sample {X1,.., Xn }.

In general, computational geometers study the

finite sample case!

slide-11
SLIDE 11

Applications

slide-12
SLIDE 12

Interface 04 12

Applications

History

– Shamos, Geometry and statistics: problems at the interface,1976 – Bentley & Shamos, A problem in multivariate statistics: algorithm, data structure and applications, 1977

slide-13
SLIDE 13

Interface 04 13

Star spectrum

Data set of the stellar cluster CYGOB1 (Leroy & Rousseeuw 87)

Logarithm of light intensity Given a set of points find a line such that the sum of the squares of the residuals is minimized

slide-14
SLIDE 14

Interface 04 14

Star spectrum

Data set of the stellar cluster CYGOB1 (Leroy & Rousseeuw 87)

Logarithm of light intensity Given a set of points find a line such that the median of the squares of the residuals is minimized

slide-15
SLIDE 15

Interface 04 15

Least Median of Squares Regression

Ordinary least sum of Squares – Low breakdown point Least median of squares – high breakdown point Given a set of points, find a line such that the median of the

squares of the residuals is minimized

Find two parallel lines at minimum vertical distance from each

  • ther with half of the data points in the slab they define

naïve approach O(n3) O(n2logn) time algorithm for computing the LMS line in R2

[Souvaine,Steele 87]

An O(n2) algorithm using duality and topologcial sweep

[Edelsbrunner,Souvaine 90]

A C B l

slide-16
SLIDE 16

Interface 04 16

Points and lines

It is hard to find an order in a set of points. An arrangement of lines is easier. A set of points can be transformed into an

arrangement of lines, preserving important properties using duality: a point (a,b) a line y=ax+b

T

slide-17
SLIDE 17

Interface 04 17

Primal Primal Dual Dual a point (a,b) a line y=ax+b Preserves slope, vertical distance and the above\below relationship A line y=cx+d

Duality

(3,0) (2,1) m: y=-2x+2 TA:y=x+2 (1,3) T T (1,2) (4,-1) l: y = -x+3 TB:y=2x+1 TC:y=3x (2,2) TD:y=4x-1

(-c, d) ?

T(m) T(l) A B C D

slide-18
SLIDE 18

Interface 04 18

LMS

LMS Primal LMS dual A x C y B z TA Tx TC Ty TB l Tl The LMS line bisects a slab bounded by 2 parallel lines, one of which goes through 2 data points and the other goes through

  • ne data point

(Provable characteristics of LMS)

Tz

slide-19
SLIDE 19

Interface 04 19

Least Median of Squares (LMS) Regression

– The LMS line can be computed in 2D in O(n2)

[Edelsbrunner, Souvaine 90]. Earlier result: [Souvaine, Steele 87]

– Practical approximation algorithm [Mount, Netanyahu,

Romanik, Silverman, Yu 97], [Mount, Erickson, Har-Peled 04]

slide-20
SLIDE 20

Half-space depth

slide-21
SLIDE 21

Interface 04 21

The half-space depth of a point p is the minimum number of points of a given set S lying in any closed halfplane bounded by a line through p Question Question – – how to compute the half how to compute the half-

  • space depth

space depth contours efficiently? contours efficiently? (naive cost per point

(naive cost per point– – O(n O(n2

2))

))

A B C D E F G p

slide-22
SLIDE 22

Interface 04 22

A line l through p k points in the half-plane above the line l

through p

To count how many lines above another point

The depth of a point p The depth of a point p – – The minimum The minimum number of points of number of points of S S lying in any closed lying in any closed halfspace halfspace determined by a line through determined by a line through p p a point T(l) through line T(p) k lines above the point T(l)

A B C D E F G p TA TB TC TD TE TF TG

look at the level

l T(l) T(p)

slide-23
SLIDE 23

Interface 04 23

All the half-space depth contours in R2 can be computed in O(n2) time using topological sweep

[Miller, Ramaswami, Rousseeuw, Sellares,Souvaine,Streinu,Struyf,01] A B C D E F G TA TB TC TD TE TF TG Depth 1 Depth 2 Tp p

slide-24
SLIDE 24

Interface 04 24

Half-space depth contours

To compute the k-th half-space depth

contour (all points of depth at least k)

The minimum number of points lying in any

closed half-space determined by a line through p - the min level of the dual line T(l) find the k-th level in the dual

slide-25
SLIDE 25

Interface 04 25

Sweeping an arrangement of lines

Vertical line sweep

– Report all intersection pairs – sorted in order of x coordinate – O(n2logn) time and O(n) space

Topological line sweep

– Report all intersection pairs – according to a partial

  • rder related to the levels
  • f the arrangement

– O(n2) time and O(n) space

slide-26
SLIDE 26

Interface 04 26

Duality in 3D

Primal Primal Dual Dual a point (a,b,c) a plane z=ax+by+c

T

slide-27
SLIDE 27

Interface 04 27

Half-space depth in Rd

The depth of a point p is the minimum

number of points of a given set S lying in any closed half-space bounded by a line hyperplane through p

slide-28
SLIDE 28

Interface 04 28

Collaboration – half-space depth

The depth of a single point can be computed in O(nlog n)

[Rousseeuw & Ruts 1996]. The lower bound is Ω(n log n) [Aloupis, Cortes, Gomez, Soss, Toussaint 02]

Computing the 2D tukey median can be done in O(n log5n)

[Matousek 1991], and was improved to O(n log3n) [Langerman, Steiger 03]

Computing all 2D depth contours can be done in O(n2) time

using duality & topological sweep [Miller, Ramaswami, Rousseeuw,

Sellares, Souvaine, Streinu, Struyf, 01]

Another approach for computing depth contours uses

parallel arrangement construction [Fokuda & Rosta, 02]

Half-space depth contours can be computed for display in

2D using hardware assisted computation [Krishnan, Mustafa,

Venkatasubramanian 02]

slide-29
SLIDE 29

Depth Contours

slide-30
SLIDE 30

Interface 04 30

Depth Contours

nested contours that enclose regions with

increasing depth.

First introduced by Tukey as a data visualization

tool for a two dimensional data (half-space depth contours) [Tukey 75]

Provide powerful tools to visualize and compare

data sets.

slide-31
SLIDE 31

Interface 04 31

The continuous case

Let DF(x), x∈Rd, be the value of a given depth

function for point x with respect to a probability distribution F.

The set {x∈Rd: DF(x)=t} is the contour of depth t.

(We usually refer to the region enclosed by the contour of depth t, the set RF(t)={x∈Rd: DF(x) ≥ t})

The α central region, Cα (0≤α≤1) is the smallest

region enclosed by depth contours with probability α.

slide-32
SLIDE 32

Interface 04 32

The finite sample case

Let D(x), x∈Rd, be the value of a given depth function

for point x with respect to a sample set S.

Rank approach

– The sample α central region is the convex hull containing the most central fraction of α sample points

Coverage approach

– The sample contour of depth t is the boundary of the points in Rd with depth ≥t – The sample α central region

  • Enclose all points that are of depth D(X [⎡αn⎤] ), which is

the depth of the ⎡αn⎤ deepest sample point [Donoho &

Gasko 92]

  • Take the two adjacent contours Dk Dk-1 D(X [⎡αn⎤])= k and

interpolate between the coverage of the two contours, according to the percentage of points inside each, using deepest point as center [Rousseeuw & Ruts 96]

slide-33
SLIDE 33

Interface 04 33

Cover Rank

Contours enclose 10, 20, .. 100% of the data points. The vertices of the contours are the original points of the

  • data. Several contours share vertices.

The data set consists of 50 points, drawn from a bivariate normal distribution with mean (0,0) and covariance 4 times identity. 17 contours, not all vertices are original points of the data set.

slide-34
SLIDE 34

Interface 04 34

Degenerate contours

Rank approach

– The sample α central region is the convex hull containing the most central fraction of α sample points

Coverage approach

– The sample contour of depth t is the boundary of the points in

Rd with depth ≥t

– The sample α central region

  • Enclose all points that are of depth D(X [⎡αn⎤] ), which is

the depth of the ⎡αn⎤ deepest sample point [Donoho &

Gasko 92]

  • Take the two adjacent contours Dk Dk-1 D(X [⎡αn⎤])= k and

interpolate between the coverage of the two contours, according to the percentage of points inside each, using deepest point as center [Rousseeuw & Ruts 96]

slide-35
SLIDE 35

Interface 04 35

The cover approach was most frequently

studied by computational geometers

The rank approach may produce contours

more efficiently and provides a reasonable, and perhaps less expensive, approximation.

slide-36
SLIDE 36

Simplicial Depth

slide-37
SLIDE 37

Interface 04 37

Simplicial depth [Liu 90]

The simplicial depth of a point x w.r.t a data set S

in Rd is the fraction of the closed simplices defined by points in S that contain x [Liu 90] Dimension Simplex R1 R2 R3

SD(x1) = 1 x1 x3 x2 A B C SD(x2) = 1 SD(x3) = 0

slide-38
SLIDE 38

Interface 04 38

Total number of simplicies = ( ) = 10

Simplicial Depth [Liu 90]

5 3

A B C D E .3 .4 .5 .4 .4 .4 .3 .3 .3 .6 .6 .6 .6 .8 .5 .5 .35

Averaging number of closed and

  • pen simplicies containing x

[Burr,R,Souvaine 03] .3 .3 .3 .3

Fails to satisfy [Zuo & Serfling 00]

– Monotonicity - As a point x moves away from the `deepest point' along any fixed ray through the center, the depth at x should decrease monotonically. – Maximality - the depth function should attain maximum value at the center

Depth of points on facets causes

discontinuities in the depth function.

x

slide-39
SLIDE 39

Interface 04 39

Revised definition [Burr,R,Souvaine 03]

Given a data set S={X1,…, Xn } in Rd, the simplicial

depth of a point x is the average of the fraction of closed simplicies containing x and the fraction of

  • pen simplicies containing x

Equivalently

SDBRS(S;x) = ρ(S,x)+1/2σ(S,x)

ρ(S,x) - the number of simplicies with data points as vertices

which contain x in their open interior

σ(S,x) - the number of simplicies with data points as vertices

which contain x in their boundary.

slide-40
SLIDE 40

Interface 04 40

Properties of the revised definition

Reduces to original definition, for continuous

distributions and for points lying in the interior of cells.

Keeps ranking order of data points Corrects irregularity at boundaries of simplicies,

making the depth of a point on the boundary between two cells the average of the depth of the two cells.

Fixes Zuo & Serfling’s counterexamples Invariant under dimensions change for R1, R2 Can be calculated using the existing algorithms, with

slight modifications

slide-41
SLIDE 41

Interface 04 41

Open problems

Data points are still over counted. The revised definition, still neither attains maximality at the

center, nor does monotonicity relative to the deepest point The data points A,B, and C all have depth 587/1120 and the data point D, which is at the unique center of the data set has depth 355/1120

slide-42
SLIDE 42

Interface 04 42

A B C TB TA TC TX x

The dual of Simplicial depth

(0,1) (2, -2) (-2, -.5) (-1, 0)

slide-43
SLIDE 43

Interface 04 43

simplicial depth

For a point can be computed in 2D in O(nlogn) time

[Gil, Steiger, Wigderson 92], [Khuller, Mitchell 90], [Rousseeuw, Ruts 96].

This matches the lower bound [Khuller, Mitchell 90],

[Aloupis, Cortes, Gomez, Soss, Toussaint 02]

The depth of all n points can be computed in 2D in

O(n2) [Gil, Steiger, Wigderson 92], [Khuller, Mitchell 90]

The simplicial median in 2D in O(n4) time [Aloupis,

Langerman, Toussaint 01]

In 3D the depth of a point can be computed in O(n2)

time [Gil, Steiger, Wigderson 92], [Cheng & Ouyang 01]

slide-44
SLIDE 44

Future research

?

slide-45
SLIDE 45

Interface 04 45

A software tool for depth based statistical analysis

Easy to use, efficient and expandable

interface.

For exploratory statistical research

slide-46
SLIDE 46

Interface 04 46

Additional details Http:// Http://www.cs.tufts.edu www.cs.tufts.edu/research/geometry/ /research/geometry/