data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 6: High-dimensional Data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 1 / 21

  2. High-dimensional Space Let D be a n × d data matrix. In data mining typically the data is very high dimensional. Understanding the nature of high-dimensional space, or hyperspace , is very important, especially because it does not behave like the more familiar geometry in two or three dimensions. Hyper-rectangle: The data space is a d -dimensional hyper-rectangle d � � � R d = min( X j ) , max( X j ) j = 1 where min( X j ) and max ( X j ) specify the range of X j . Hypercube: Assume the data is centered, and let m denote the maximum attribute value � � d n m = max max | x ij | j = 1 i = 1 The data hyperspace can be represented as a hypercube , centered at 0, with all sides of length l = 2 m , given as � � � ∀ i , x i ∈ [ − l / 2 , l / 2 ] x = ( x 1 , x 2 ,..., x d ) T � H d ( l ) = The unit hypercube has all sides of length l = 1, and is denoted as H d ( 1 ) . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 2 / 21

  3. Hypersphere Assume that the data has been centered, so that µ = 0. Let r denote the largest magnitude among all points: � � r = max � x i � i The data hyperspace can be represented as a d -dimensional hyperball centered at 0 with radius r , defined as � d � � x 2 j ≤ r 2 � � � B d ( r ) = x | � x � ≤ r or B d ( r ) = x = ( x 1 , x 2 ,..., x d ) � j = 1 The surface of the hyperball is called a hypersphere , and it consists of all the points exactly at distance r from the center of the hyperball � � S d ( r ) = x | � x � = r � d � ( x j ) 2 = r 2 � � or S d ( r ) = x = ( x 1 , x 2 ,..., x d ) � j = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 3 / 21

  4. bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC b bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Iris Data Hyperspace: Hypercube and Hypersphere l = 4 . 12 and r = 2 . 19 2 1 X 2 : sepal width r 0 − 1 − 2 − 2 − 1 0 1 2 X 1 : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 4 / 21

  5. High-dimensional Volumes Hypercube: The volume of a hypercube with edge length l is given as vol( H d ( l )) = l d Hypersphere The volume of a hyperball and its corresponding hypersphere is identical The volume of a hypersphere is given as In 3D: vol( S 3 ( r )) = 4 In 2D: vol( S 2 ( r )) = π r 2 3 π r 3 In 1D: vol( S 1 ( r )) = 2 r � d � π vol( S d ( r )) = K d r d = 2 r d In d -dimensions: � d � Γ 2 + 1 where �� d � ! if d is even � d � 2 Γ 2 + 1 = √ π � � d !! if d is odd 2 ( d + 1 ) / 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 5 / 21

  6. bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Volume of Unit Hypersphere With increasing dimensionality the hypersphere volume first increases up to a point, and then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere with r = 1, d π 2 d →∞ vol( S d ( 1 )) = lim lim 2 + 1 ) → 0 Γ( d d →∞ 5 4 vol( S d ( 1 )) 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 6 / 21

  7. Hypersphere Inscribed within Hypercube Consider the space enclosed within the largest hypersphere that can be accommodated within a hypercube (which represents the dataspace). The ratio of the volume of the hypersphere of radius r to the hypercube with side length l = 2 r is given as vol( H 2 ( 2 r )) = π r 2 vol( S 2 ( r )) 4 r 2 = π In 2 dimensions: 4 = 78 . 5 % 4 3 π r 3 vol( S 3 ( r )) 8 r 3 = π In 3 dimensions: vol( H 3 ( 2 r )) = 6 = 52 . 4 % π d / 2 vol( S d ( r )) In d dimensions: lim vol( H d ( 2 r )) = lim 2 + 1 ) → 0 2 d Γ( d d →∞ d →∞ As the dimensionality increases, most of the volume of the hypercube is in the “corners,” whereas the center is essentially empty. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 7 / 21

  8. Hypersphere Inscribed inside a Hypercube − r r − r 0 0 r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 8 / 21

  9. Conceptual View of High-dimensional Space Two, three, four, and higher dimensions All the volume of the hyperspace is in the corners, with the center being essentially empty. High-dimensional space looks like a rolled-up porcupine! (a) 2D (b) 3D (c) 4D (d) d D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 9 / 21

  10. Volume of a Thin Shell The volume of a thin hypershell of width ǫ is given as vol( S d ( r ,ǫ )) = vol( S d ( r )) − vol( S d ( r − ǫ )) = K d r d − K d ( r − ǫ ) d . The ratio of volume of the thin shell to the volume of the outer sphere: r vol( S d ( r )) = K d r d − K d ( r − ǫ ) d vol( S d ( r ,ǫ )) 1 − ǫ � d � = 1 − r K d r d r − ǫ As d increases, we have ǫ vol( S d ( r ,ǫ )) 1 − ǫ � d � lim vol( S d ( r )) = lim d →∞ 1 − → 1 r d →∞ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 10 / 21

  11. Diagonals in Hyperspace Consider a d -dimensional hypercube, with origin 0 d = ( 0 1 , 0 2 ,..., 0 d ) , and bounded in each dimension in the range [ − 1 , 1 ] . Each “corner” of the hyperspace is a d -dimensional vector of the form ( ± 1 1 , ± 1 2 ,..., ± 1 d ) T . Let e i = ( 0 1 ,..., 1 i ,..., 0 d ) T denote the d -dimensional canonical unit vector in dimension i , and let 1 denote the d -dimensional diagonal vector ( 1 1 , 1 2 ,..., 1 d ) T . Consider the angle θ d between the diagonal vector 1 and the first axis e 1 , in d dimensions: e T 1 1 e T 1 1 1 = 1 cos θ d = � e 1 � � 1 � = √ = √ √ √ � e T 1 1 T 1 d d 1 e 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 11 / 21

  12. Diagonals in Hyperspace As d increases, we have 1 d →∞ cos θ d = lim lim √ → 0 d →∞ d which implies that d →∞ θ d → π lim 2 = 90 ◦ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 12 / 21

  13. Angle between Diagonal Vector 1 and e 1 1 1 1 1 θ e 1 0 0 θ e 1 − 1 − 1 − 1 − 1 0 1 1 0 0 1 − 1 (a) In 2D (b) In 3D In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the coordinates axes! Each of the 2 d − 1 new axes connecting pairs of 2 d corners are essentially orthogonal to all of the d principal coordinate axes! Thus, in effect, high-dimensional space has an exponential number of orthogonal “axes.” Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 13 / 21

  14. Density of the Multivariate Normal Consider the standard multivariate normal distribution with µ = 0, and Σ = I − x T x 1 � � f ( x ) = √ 2 π ) d exp 2 ( The peak of the density is at the mean. Consider the set of points x with density at least α fraction of the density at the mean f ( x ) f ( 0 ) ≥ α � − x T x � exp ≥ α 2 x T x ≤ − 2 ln( α ) d ( x i ) 2 ≤ − 2 ln( α ) � i = 1 The sum of squared IID random variables follows a chi-squared distribution χ 2 d . Thus, � f ( x ) � P f ( 0 ) ≥ α = F χ 2 d ( − 2 ln( α )) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 14 / 21 where F χ 2 is the CDF.

  15. Density Contour for α Fraction of the Density at the Mean: One Dimension Let α = 0 . 5, then − 2 ln( 0 . 5 ) = 1 . 386 and F χ 2 1 ( 1 . 386 ) = 0 . 76. Thus, 24% of the density is in the tail regions. 0 . 4 0 . 3 α = 0 . 5 0 . 2 0 . 1 | | − 4 − 3 − 2 − 1 0 1 2 3 4 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 15 / 21

  16. b Density Contour for α Fraction of the Density at the Mean: Two Dimensions Let α = 0 . 5, then − 2 ln( 0 . 5 ) = 1 . 386 and F χ 2 2 ( 1 . 386 ) = 0 . 50. Thus, 50% of the density is in the tail regions. f ( x ) 0.15 0.10 α = 0 . 5 0.05 − 4 − 3 − 2 0 − 1 0 X 2 1 − 4 − 3 − 2 2 − 1 0 1 3 2 X 1 3 4 4 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 16 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend