Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

data mining and machine learning fundamental concepts and
SMART_READER_LITE
LIVE PREVIEW

Data Mining and Machine Learning: Fundamental Concepts and - - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


slide-1
SLIDE 1

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 6: High-dimensional Data

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 1 / 21

slide-2
SLIDE 2

High-dimensional Space

Let D be a n × d data matrix. In data mining typically the data is very high dimensional. Understanding the nature of high-dimensional space, or hyperspace, is very important, especially because it does not behave like the more familiar geometry in two or three dimensions. Hyper-rectangle: The data space is a d-dimensional hyper-rectangle Rd =

d

  • j=1
  • min(Xj),max(Xj)
  • where min(Xj) and max(Xj) specify the range of Xj.

Hypercube: Assume the data is centered, and let m denote the maximum attribute value m =

d

max

j=1 n

max

i=1

  • |xij|
  • The data hyperspace can be represented as a hypercube, centered at 0, with all sides of

length l = 2m, given as Hd(l) =

  • x = (x1,x2,...,xd)T

∀i, xi ∈ [−l/2,l/2]

  • The unit hypercube has all sides of length l = 1, and is denoted as Hd(1).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 2 / 21

slide-3
SLIDE 3

Hypersphere

Assume that the data has been centered, so that µ = 0. Let r denote the largest magnitude among all points: r = max

i

  • xi
  • The data hyperspace can be represented as a d-dimensional hyperball centered at

0 with radius r, defined as Bd(r) =

  • x | x ≤ r
  • r Bd(r)

=

  • x = (x1,x2,...,xd)
  • d
  • j=1

x2

j ≤ r 2

  • The surface of the hyperball is called a hypersphere, and it consists of all the

points exactly at distance r from the center of the hyperball Sd(r) =

  • x | x = r
  • r Sd(r) =
  • x = (x1,x2,...,xd)
  • d
  • j=1

(xj)2 = r 2

  • Zaki & Meira Jr.

(RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 3 / 21

slide-4
SLIDE 4

Iris Data Hyperspace: Hypercube and Hypersphere

l = 4.12 and r = 2.19 −2 −1 1 2 −2 −1 1 2 X1: sepal length X2: sepal width

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

b

r

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 4 / 21

slide-5
SLIDE 5

High-dimensional Volumes

Hypercube: The volume of a hypercube with edge length l is given as vol(Hd(l)) = ld HypersphereThe volume of a hyperball and its corresponding hypersphere is identical The volume of a hypersphere is given as In 1D: vol(S1(r)) = 2r In 2D: vol(S2(r)) = πr 2 In 3D: vol(S3(r)) = 4 3πr 3 In d-dimensions: vol(Sd(r)) = Kdr d =

  • π

d 2

Γ d

2 + 1

  • r d

where Γ d 2 + 1

  • =

d

2

  • !

if d is even √π

  • d!!

2(d+1)/2

  • if d is odd

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 5 / 21

slide-6
SLIDE 6

Volume of Unit Hypersphere

With increasing dimensionality the hypersphere volume first increases up to a point, and then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere with r = 1, lim

d→∞vol(Sd(1)) = lim d→∞

π

d 2

Γ( d

2 + 1) → 0 5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 d vol(Sd(1))

bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 6 / 21

slide-7
SLIDE 7

Hypersphere Inscribed within Hypercube

Consider the space enclosed within the largest hypersphere that can be accommodated within a hypercube (which represents the dataspace). The ratio of the volume of the hypersphere of radius r to the hypercube with side length l = 2r is given as In 2 dimensions: vol(S2(r)) vol(H2(2r)) = πr 2 4r 2 = π 4 = 78.5% In 3 dimensions: vol(S3(r)) vol(H3(2r)) =

4 3πr 3

8r 3 = π 6 = 52.4% In d dimensions: lim

d→∞

vol(Sd(r)) vol(Hd(2r)) = lim

d→∞

πd/2 2dΓ( d

2 + 1) → 0

As the dimensionality increases, most of the volume of the hypercube is in the “corners,” whereas the center is essentially empty.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 7 / 21

slide-8
SLIDE 8

Hypersphere Inscribed inside a Hypercube

−r r −r r

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 8 / 21

slide-9
SLIDE 9

Conceptual View of High-dimensional Space

Two, three, four, and higher dimensions

All the volume of the hyperspace is in the corners, with the center being essentially empty. High-dimensional space looks like a rolled-up porcupine! (a) 2D (b) 3D (c) 4D (d) dD

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 9 / 21

slide-10
SLIDE 10

Volume of a Thin Shell

The volume of a thin hypershell of width ǫ is given as vol(Sd(r,ǫ)) = vol(Sd(r)) − vol(Sd(r − ǫ)) = Kdr d − Kd(r − ǫ)d. The ratio of volume of the thin shell to the volume of the outer sphere: vol(Sd(r,ǫ)) vol(Sd(r)) = Kdr d − Kd(r − ǫ)d Kdr d = 1 −

  • 1 − ǫ

r d As d increases, we have lim

d→∞

vol(Sd(r,ǫ)) vol(Sd(r)) = lim

d→∞1 −

  • 1 − ǫ

r d → 1

r r − ǫ ǫ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 10 / 21

slide-11
SLIDE 11

Diagonals in Hyperspace

Consider a d-dimensional hypercube, with origin 0d = (01,02,...,0d), and bounded in each dimension in the range [−1,1]. Each “corner” of the hyperspace is a d-dimensional vector of the form (±11,±12,...,±1d)T. Let ei = (01,...,1i,...,0d)T denote the d-dimensional canonical unit vector in dimension i, and let 1 denote the d-dimensional diagonal vector (11,12,...,1d)T. Consider the angle θd between the diagonal vector 1 and the first axis e1, in d dimensions: cosθd = eT

1 1

e1 1 = eT

1 1

  • eT

1 e1

√ 1T1 = 1 √ 1 √ d = 1 √ d

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 11 / 21

slide-12
SLIDE 12

Diagonals in Hyperspace

As d increases, we have lim

d→∞cosθd = lim d→∞

1 √ d → 0 which implies that lim

d→∞θd → π

2 = 90◦

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 12 / 21

slide-13
SLIDE 13

Angle between Diagonal Vector 1 and e1

−1 1 −1 1

1 e1 θ

−1 1 −1 1 −1 1

1 e1 θ

(a) In 2D (b) In 3D

In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the coordinates axes! Each of the 2d−1 new axes connecting pairs of 2d corners are essentially orthogonal to all of the d principal coordinate axes! Thus, in effect, high-dimensional space has an exponential number of orthogonal “axes.”

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 13 / 21

slide-14
SLIDE 14

Density of the Multivariate Normal

Consider the standard multivariate normal distribution with µ = 0, and Σ = I f (x) = 1 ( √ 2π)d exp

  • −xTx

2

  • The peak of the density is at the mean. Consider the set of points x with density at least

α fraction of the density at the mean f (x) f (0) ≥ α exp

  • −xTx

2

  • ≥ α

xTx ≤ −2ln(α)

d

  • i=1

(xi)2 ≤ −2ln(α) The sum of squared IID random variables follows a chi-squared distribution χ2

  • d. Thus,

P f (x) f (0) ≥ α

  • = Fχ2

d (−2ln(α))

where Fχ2 is the CDF.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 14 / 21

slide-15
SLIDE 15

Density Contour for α Fraction of the Density at the Mean: One Dimension

Let α = 0.5, then −2ln(0.5) = 1.386 and Fχ2

1(1.386) = 0.76. Thus, 24% of the

density is in the tail regions.

1 2 3 4 −1 −2 −3 −4 0.1 0.2 0.3 0.4 | |

α = 0.5

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 15 / 21

slide-16
SLIDE 16

Density Contour for α Fraction of the Density at the Mean: Two Dimensions

Let α = 0.5, then −2ln(0.5) = 1.386 and Fχ2

2(1.386) = 0.50. Thus, 50% of the

density is in the tail regions.

X1 X2 f (x) −4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 0.05 0.10 0.15

b

α = 0.5

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 16 / 21

slide-17
SLIDE 17

Chi-Squared Distribution: P(f (x)/f (0) ≥ α)

This probability decreases rapidly with dimensionality. For 2D, it is 0.5. For 3D it is 0.29, ie., 71% of the density is in the tails. By d = 10, it decreases to 0.075%, that is, 99.925% of the points lie in the extreme or tail regions.

5 10 15 0.1 0.2 0.3 0.4 0.5

x f (x)

F = 0.5 5 10 15 0.05 0.10 0.15 0.20 0.25

x f (x)

F = 0.29

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 17 / 21

slide-18
SLIDE 18

Hypersphere Volume: Polar Coordinates in 2D

X1 X2

bC

θ1 r (x1,x2)

The point x = (x1,x2) in polar coordinates x1 = r cosθ1 = rc1 x2 = r sinθ1 = rs1 where r = x, and cosθ1 = c1 and sinθ1 = s1. The Jacobian matrix for this transformation is given as J(θ1) = ∂x1

∂r ∂x1 ∂θ1 ∂x2 ∂r ∂x2 ∂θ1

  • =

c1 −rs1 s1 rc1

  • Hypersphere volume is obtained by

integration over r and θ1 (with r > 0, and 0 ≤ θ1 ≤ 2π): vol(S2(r)) =

  • r
  • θ1
  • det(J(θ1))
  • dr dθ1

= r 2π r dr dθ1 = r r dr 2π dθ1 = r 2 2

  • r

· θ1

0 = πr 2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 18 / 21

slide-19
SLIDE 19

Hypersphere Volume: Polar Coordinates in 3D

X1 X2 X3

bC (x1,x2,x3)

r θ1 θ2

x = (x1,x2,x2) in polar coordinates x1 = r cosθ1 cosθ2 = rc1c2 x2 = r cosθ1 sinθ2 = rc1s2 x3 = r sinθ1 = rs1 The Jacobian matrix is given as J(θ1,θ2) =   c1c2 −rs1c2 −rc1s2 c1s2 −rs1s2 rc1c2 s1 rc1   The volume of the hypersphere for d = 3 is

  • btained via a triple integral with r > 0,

−π/2 ≤ θ1 ≤ π/2, and 0 ≤ θ2 ≤ 2π vol(S3(r)) =

  • r
  • θ1
  • θ2
  • det(J(θ1,θ2))
  • dr dθ1 dθ2

= 4 3πr 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 19 / 21

slide-20
SLIDE 20

Hypersphere Volume in d Dimensions

The determinant of the d-dimensional Jacobian matrix is det(J(θ1,θ2,...,θd−1)) = (−1)dr d−1cd−2

1

cd−3

2

...cd−2 The volume of the hypersphere is given by the d-dimensional integral with r > 0, −π/2 ≤ θi ≤ π/2 for all i = 1,...,d − 2, and 0 ≤ θd−1 ≤ 2π: vol(Sd(r)) =

  • r
  • θ1
  • θ2

···

  • θd−1
  • det(J(θ1,θ2,...,θd−1))
  • dr dθ1 dθ2 ...dθd−1

= r r d−1dr π/2

−π/2

cd−2

1

dθ1 ··· π/2

−π/2

cd−2dθd−2 2π dθd−1 = r d d Γ d−1

2

  • Γ

1

2

  • Γ

d

2

  • Γ

d−2

2

  • Γ

1

2

  • Γ

d−1

2

  • ... Γ(1)Γ

1

2

  • Γ

3

2

= πΓ 1

2

d/2−1 r d

d 2 Γ

d

2

  • =
  • πd/2

Γ d

2 + 1

  • r d

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 20 / 21

slide-21
SLIDE 21

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

dataminingbook.info Mohammed J. Zaki1 Wagner Meira Jr.2

1Department of Computer Science

Rensselaer Polytechnic Institute, Troy, NY, USA

2Department of Computer Science

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 6: High-dimensional Data

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 21 / 21