Interface 04 1
Computational geometry and statistical depth measures
Eynat Rafalin
Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry
Joint work with Prof. Diane Souvaine
Computational geometry and statistical depth measures Eynat - - PowerPoint PPT Presentation
Computational geometry and statistical depth measures Eynat Rafalin Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry Joint work with Prof. Diane Souvaine 1 Interface 04 Outline of talk Data analysis,
Interface 04 1
Eynat Rafalin
Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry
Joint work with Prof. Diane Souvaine
Interface 04 2
Data analysis, Computational geometry and depth
based statistics
Applications
– A basic technique: the duality transform – Least Median of Squares (LMS) regression in optimal time – Half-space depth contours in optimal time – Depth contours – Simplicial depth
Future research
Interface 04 3
Deals with problems that require geometric algorithms for
their solutions.
Systematic study of algorithms and data structures for
geometric objects, with a focus on exact algorithms that are asymptotically fast.
At the outset: once exact algorithms have been obtained,
refined, and are still slow, then move to approximation algorithms.
Interface 04 4
Interface 04 5
Data depth - A way of measuring how
Examples:
– Halfspace (Location, Tukey) depth (Hodges 55, Tukey 75) – Simplicial depth (Liu 90) – Convex Hull Peeling depth (Barnett 76, Eddy 82) – Regression depth (Rousseeuw & Hubert 99) – Mahalanobis depth (Mahalanobis 36) – Oja depth (Oja 83)
Interface 04 6
Data depth - A way of measuring how
Concept provides center outward
Non parametric, multivariate statistics. Robust. affine invariance - for many depth
Interface 04 7
Observations that deviate from the main part of
the data (outliers) can have an undesirable influence on the analysis of the data
A robust depth function yields reasonable
results even if several unannounced outliers occur in the data [Handbook of statistics 15, Rao & Maddala].
For example
– Depth contours are nested contours that enclose regions with increasing depth – For half-space depth contours: in the presence of m outliers
contours may be corrupted by the outliers, but the inner set
its shape [Donoho & Gasko 92].
Interface 04 8
Deepest point Outliers Depth contours Bag-plot (Box-plot)
[Rousseeuw, Ruts, Tukey 99]
Scale curve as a measure
Fan plot as a measure of
tailedness [Liu, Parelius, Singh
99]
Robustified classification
and cluster analysis
[Rousseeuw, Ruts 96]
Interface 04 9
50 data points, created from a random distribution, with covariance matrix 4 times identity. The fans are created for data sets containing the 1/6, 2/6, ..central regions. For each region the area of the CH of 2, 4, 6,…% of the points is computed.
Relative area (CH of p%/CH) Percentile of points
Interface 04 10
Most depth functions are defined in respect to a
probability distribution F, considering {X1,.., Xn } random observations from F.
The finite sample version of the depth function is
distribution of the sample {X1,.., Xn }.
In general, computational geometers study the
finite sample case!
Interface 04 12
History
– Shamos, Geometry and statistics: problems at the interface,1976 – Bentley & Shamos, A problem in multivariate statistics: algorithm, data structure and applications, 1977
Interface 04 13
Star spectrum
Data set of the stellar cluster CYGOB1 (Leroy & Rousseeuw 87)
Logarithm of light intensity Given a set of points find a line such that the sum of the squares of the residuals is minimized
Interface 04 14
Star spectrum
Data set of the stellar cluster CYGOB1 (Leroy & Rousseeuw 87)
Logarithm of light intensity Given a set of points find a line such that the median of the squares of the residuals is minimized
Interface 04 15
Ordinary least sum of Squares – Low breakdown point Least median of squares – high breakdown point Given a set of points, find a line such that the median of the
squares of the residuals is minimized
Find two parallel lines at minimum vertical distance from each
naïve approach O(n3) O(n2logn) time algorithm for computing the LMS line in R2
[Souvaine,Steele 87]
An O(n2) algorithm using duality and topologcial sweep
[Edelsbrunner,Souvaine 90]
A C B l
Interface 04 16
It is hard to find an order in a set of points. An arrangement of lines is easier. A set of points can be transformed into an
arrangement of lines, preserving important properties using duality: a point (a,b) a line y=ax+b
T
Interface 04 17
Primal Primal Dual Dual a point (a,b) a line y=ax+b Preserves slope, vertical distance and the above\below relationship A line y=cx+d
(3,0) (2,1) m: y=-2x+2 TA:y=x+2 (1,3) T T (1,2) (4,-1) l: y = -x+3 TB:y=2x+1 TC:y=3x (2,2) TD:y=4x-1
(-c, d) ?
T(m) T(l) A B C D
Interface 04 18
LMS Primal LMS dual A x C y B z TA Tx TC Ty TB l Tl The LMS line bisects a slab bounded by 2 parallel lines, one of which goes through 2 data points and the other goes through
(Provable characteristics of LMS)
Tz
Interface 04 19
Least Median of Squares (LMS) Regression
– The LMS line can be computed in 2D in O(n2)
[Edelsbrunner, Souvaine 90]. Earlier result: [Souvaine, Steele 87]
– Practical approximation algorithm [Mount, Netanyahu,
Romanik, Silverman, Yu 97], [Mount, Erickson, Har-Peled 04]
Interface 04 21
The half-space depth of a point p is the minimum number of points of a given set S lying in any closed halfplane bounded by a line through p Question Question – – how to compute the half how to compute the half-
space depth contours efficiently? contours efficiently? (naive cost per point
(naive cost per point– – O(n O(n2
2))
))
A B C D E F G p
Interface 04 22
A line l through p k points in the half-plane above the line l
through p
To count how many lines above another point
The depth of a point p The depth of a point p – – The minimum The minimum number of points of number of points of S S lying in any closed lying in any closed halfspace halfspace determined by a line through determined by a line through p p a point T(l) through line T(p) k lines above the point T(l)
A B C D E F G p TA TB TC TD TE TF TG
look at the level
l T(l) T(p)
Interface 04 23
All the half-space depth contours in R2 can be computed in O(n2) time using topological sweep
[Miller, Ramaswami, Rousseeuw, Sellares,Souvaine,Streinu,Struyf,01] A B C D E F G TA TB TC TD TE TF TG Depth 1 Depth 2 Tp p
Interface 04 24
To compute the k-th half-space depth
contour (all points of depth at least k)
The minimum number of points lying in any
closed half-space determined by a line through p - the min level of the dual line T(l) find the k-th level in the dual
Interface 04 25
Vertical line sweep
– Report all intersection pairs – sorted in order of x coordinate – O(n2logn) time and O(n) space
Topological line sweep
– Report all intersection pairs – according to a partial
– O(n2) time and O(n) space
Interface 04 26
Primal Primal Dual Dual a point (a,b,c) a plane z=ax+by+c
T
Interface 04 27
The depth of a point p is the minimum
Interface 04 28
The depth of a single point can be computed in O(nlog n)
[Rousseeuw & Ruts 1996]. The lower bound is Ω(n log n) [Aloupis, Cortes, Gomez, Soss, Toussaint 02]
Computing the 2D tukey median can be done in O(n log5n)
[Matousek 1991], and was improved to O(n log3n) [Langerman, Steiger 03]
Computing all 2D depth contours can be done in O(n2) time
using duality & topological sweep [Miller, Ramaswami, Rousseeuw,
Sellares, Souvaine, Streinu, Struyf, 01]
Another approach for computing depth contours uses
parallel arrangement construction [Fokuda & Rosta, 02]
Half-space depth contours can be computed for display in
2D using hardware assisted computation [Krishnan, Mustafa,
Venkatasubramanian 02]
Interface 04 30
nested contours that enclose regions with
increasing depth.
First introduced by Tukey as a data visualization
tool for a two dimensional data (half-space depth contours) [Tukey 75]
Provide powerful tools to visualize and compare
data sets.
Interface 04 31
Let DF(x), x∈Rd, be the value of a given depth
function for point x with respect to a probability distribution F.
The set {x∈Rd: DF(x)=t} is the contour of depth t.
(We usually refer to the region enclosed by the contour of depth t, the set RF(t)={x∈Rd: DF(x) ≥ t})
The α central region, Cα (0≤α≤1) is the smallest
region enclosed by depth contours with probability α.
Interface 04 32
Let D(x), x∈Rd, be the value of a given depth function
for point x with respect to a sample set S.
Rank approach
– The sample α central region is the convex hull containing the most central fraction of α sample points
Coverage approach
– The sample contour of depth t is the boundary of the points in Rd with depth ≥t – The sample α central region
the depth of the ⎡αn⎤ deepest sample point [Donoho &
Gasko 92]
interpolate between the coverage of the two contours, according to the percentage of points inside each, using deepest point as center [Rousseeuw & Ruts 96]
Interface 04 33
Contours enclose 10, 20, .. 100% of the data points. The vertices of the contours are the original points of the
The data set consists of 50 points, drawn from a bivariate normal distribution with mean (0,0) and covariance 4 times identity. 17 contours, not all vertices are original points of the data set.
Interface 04 34
Rank approach
– The sample α central region is the convex hull containing the most central fraction of α sample points
Coverage approach
– The sample contour of depth t is the boundary of the points in
Rd with depth ≥t
– The sample α central region
the depth of the ⎡αn⎤ deepest sample point [Donoho &
Gasko 92]
interpolate between the coverage of the two contours, according to the percentage of points inside each, using deepest point as center [Rousseeuw & Ruts 96]
Interface 04 35
The cover approach was most frequently
studied by computational geometers
The rank approach may produce contours
more efficiently and provides a reasonable, and perhaps less expensive, approximation.
Interface 04 37
The simplicial depth of a point x w.r.t a data set S
in Rd is the fraction of the closed simplices defined by points in S that contain x [Liu 90] Dimension Simplex R1 R2 R3
SD(x1) = 1 x1 x3 x2 A B C SD(x2) = 1 SD(x3) = 0
Interface 04 38
Total number of simplicies = ( ) = 10
5 3
A B C D E .3 .4 .5 .4 .4 .4 .3 .3 .3 .6 .6 .6 .6 .8 .5 .5 .35
Averaging number of closed and
[Burr,R,Souvaine 03] .3 .3 .3 .3
Fails to satisfy [Zuo & Serfling 00]
– Monotonicity - As a point x moves away from the `deepest point' along any fixed ray through the center, the depth at x should decrease monotonically. – Maximality - the depth function should attain maximum value at the center
Depth of points on facets causes
discontinuities in the depth function.
x
Interface 04 39
Given a data set S={X1,…, Xn } in Rd, the simplicial
depth of a point x is the average of the fraction of closed simplicies containing x and the fraction of
Equivalently
SDBRS(S;x) = ρ(S,x)+1/2σ(S,x)
ρ(S,x) - the number of simplicies with data points as vertices
which contain x in their open interior
σ(S,x) - the number of simplicies with data points as vertices
which contain x in their boundary.
Interface 04 40
Reduces to original definition, for continuous
distributions and for points lying in the interior of cells.
Keeps ranking order of data points Corrects irregularity at boundaries of simplicies,
making the depth of a point on the boundary between two cells the average of the depth of the two cells.
Fixes Zuo & Serfling’s counterexamples Invariant under dimensions change for R1, R2 Can be calculated using the existing algorithms, with
slight modifications
Interface 04 41
Data points are still over counted. The revised definition, still neither attains maximality at the
center, nor does monotonicity relative to the deepest point The data points A,B, and C all have depth 587/1120 and the data point D, which is at the unique center of the data set has depth 355/1120
Interface 04 42
A B C TB TA TC TX x
(0,1) (2, -2) (-2, -.5) (-1, 0)
Interface 04 43
For a point can be computed in 2D in O(nlogn) time
[Gil, Steiger, Wigderson 92], [Khuller, Mitchell 90], [Rousseeuw, Ruts 96].
This matches the lower bound [Khuller, Mitchell 90],
[Aloupis, Cortes, Gomez, Soss, Toussaint 02]
The depth of all n points can be computed in 2D in
O(n2) [Gil, Steiger, Wigderson 92], [Khuller, Mitchell 90]
The simplicial median in 2D in O(n4) time [Aloupis,
Langerman, Toussaint 01]
In 3D the depth of a point can be computed in O(n2)
time [Gil, Steiger, Wigderson 92], [Cheng & Ouyang 01]
Interface 04 45
Easy to use, efficient and expandable
interface.
For exploratory statistical research
Interface 04 46
Additional details Http:// Http://www.cs.tufts.edu www.cs.tufts.edu/research/geometry/ /research/geometry/