topology and data topological data analysis and manifold learning - - PowerPoint PPT Presentation

topology and data
SMART_READER_LITE
LIVE PREVIEW

topology and data topological data analysis and manifold learning - - PowerPoint PPT Presentation

by Joshua Tan, for Ufora & NYU Capstone, 12/16/2014 a library for topology and data topological data analysis and manifold learning what is dimensionality reduction Given some input space X and a sample set S , dimensionality


slide-1
SLIDE 1

by Joshua Tan, for Ufora & NYU Capstone, 12/16/2014

topology and data

a library for topological data analysis and manifold learning

slide-2
SLIDE 2

what is…

dimensionality reduction

Given some input space X and a sample set S, dimensionality reduction seeks to find a lower-dimensional manifold M s.t. S ⊂ M ⊂ X.

  • Also known as manifold learning.
slide-3
SLIDE 3

examples

❖ Kernel PCA projects up into the feature space, projects

down onto the components, ranks by eigenvalues

❖ Isomap (i.e. MDS) embeds high-d points to low-d space

while preserving a dissimilarity (distance) matrix

❖ Projection pursuit projects to the most “interesting”

components according to some objective function

❖ DBSCAN, which considers not only distances but some

“density-reachability” from a cluster

slide-4
SLIDE 4

Mapper

❖ Like DBSCAN, Mapper is a clustering/dimensionality

reduction algorithm based on varying both a distance parameter s well as a “density” parameter

❖ Unlike DBSCAN, Mapper is designed to be less

dependent on the choice of parameters

slide-5
SLIDE 5

example: breast cancer

from Nicolau et al. 2011

slide-6
SLIDE 6

computing Mapper

  • 1. generate a sample data set as a DataFrame object
  • 2. compute a 1-d dissimilarity matrix of distances
  • 3. evaluate the points using a knn-neighbors filter function
  • 4. define a covering of the resulting image
  • 5. use the pre-image of this covering to define a covering of the original data
  • 6. from the covering, generate a clustering of the data
  • 7. visualize the result as a graph

For more complicated filter functions f : X \to R^2, the generated graph will be a simplicial complex.

slide-7
SLIDE 7
slide-8
SLIDE 8

“connecting” the dots

figures borrowed from Michael Lesnick, on IAS eNews

slide-9
SLIDE 9

persistent homology

❖ Persistent homology is a technique—read, a technical tool

—for computing the “shape” of data sets

❖ In some sense, the global counterpart to Mapper

slide-10
SLIDE 10
slide-11
SLIDE 11

computing persistent homology

❖ Take your point cloud S and turn it into a nested sequence of

simplicial complexes, a.k.a. a filtration.

  • ❖ Zomorodian and Carlsson (2004) specify a natural algorithm for

computing the homology of a filtered d-dimensional simplicial complex K, assuming we evaluate the homology over a field.

❖ This returns a “persistent bar code”.

slide-12
SLIDE 12
slide-13
SLIDE 13

example: natural image statistics

Data from Mumford et al.: 4167 images, randomly sample 5000 3 pixel by 3 pixel images from each

  • image. Take the ones with highest contrast, obtain

8,000,000 points in R^9.

  • Normalize w.r.t. mean intensity, project onto high-

contrast images (those away from the origin). Obtain points on S^7.

  • M[k,T] is the subset of M in the upper T percent of

density as measured by δk (the k-nn distance).

slide-14
SLIDE 14

Ufora

❖ Ufora is a data analytics startup based in NYC ❖ For my project, I implemented both the Mapper

algorithm and a persistent homology library in their proprietary language, Fora

❖ https://dev.ufora.com/#/projects/mapper/HEAD/

mapper

slide-15
SLIDE 15

future directions

slide-16
SLIDE 16

bibliography

❖ Carlsson, Gunnar. “Topology and data”. ❖ Zomorodian, Afra. “Computing persistent homology”. ❖ Ghrist, Robert. “Barcodes: the persistent homology of

data”.

❖ Singh, Gurjeet. “Topological methods for the analysis of

high dimensional data sets and 3D object recognition”.

❖ Mullner, Daniel. Python Mapper at danifold.net/mapper ❖ Blum, Avrim. “Thoughts on clustering”.