Unsupervised Learning George Konidaris gdk@cs.brown.edu Fall 2019 - - PowerPoint PPT Presentation

unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Learning George Konidaris gdk@cs.brown.edu Fall 2019 - - PowerPoint PPT Presentation

Unsupervised Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997)


slide-1
SLIDE 1

Unsupervised Learning

George Konidaris gdk@cs.brown.edu

Fall 2019

slide-2
SLIDE 2

Machine Learning

Subfield of AI concerned with learning from data. Broadly, using:

  • Experience
  • To Improve Performance
  • On Some Task

(Tom Mitchell, 1997)

slide-3
SLIDE 3

Unsupervised Learning

Input: X = {x1, …, xn} Try to understand the structure of the data. E.g., how many types of cars? How can they vary?

inputs

slide-4
SLIDE 4

Clustering

One particular type of unsupervised learning:

  • Split the data into discrete clusters.
  • Assign new data points to each cluster.
  • Clusters can be thought of as types.

Formal definition Given:

  • Data points X = {x1, …, xn}.

Find:

  • Number of clusters k
  • Assignment function f(x) = {1, …, k}
slide-5
SLIDE 5

Clustering

slide-6
SLIDE 6

k-Means

One approach:

  • Pick k
  • Place k points (“means”) in the data
  • Assign new point to ith cluster if nearest to ith “mean”.
slide-7
SLIDE 7

k-Means

slide-8
SLIDE 8

k-Means

Major question:

  • Where to put the “means”?

Very simple algorithm:

  • Place k “means” at random.
  • Assign all points in the data to each “mean”
  • Move each “mean” to mean of assigned data.

{µ1, ..., µk}

f(xj) = i such that d(xj, µi)  d(xj, µl)8l 6= i

µi = X

v∈Ci

xv |Ci|

slide-9
SLIDE 9

k-Means

slide-10
SLIDE 10

k-Means

slide-11
SLIDE 11

k-Means

slide-12
SLIDE 12

k-Means

slide-13
SLIDE 13

k-Means

Remaining questions … How to choose k? What about bad initializations? How to measure distance? Broadly:

  • Use a quality metric.
  • Loop through k.
  • Random restart initial position.
  • Use distance metric D.
slide-14
SLIDE 14

Density Estimation

Clustering: can answer which cluster, but not does this belong?

slide-15
SLIDE 15

Density Estimation

Estimate the distribution the data is drawn from. This allows us to evaluate the probability that a new point is drawn from the same distribution as the old data. Formal definition Given:

  • Data points X = {x1, …, xn},

Find:

  • PDF P(X)
slide-16
SLIDE 16

GMM

Simple approach:

  • Model the data as a mixture of Gaussians.

Each Gaussian has its own mean and variance. Each has its own weight (sum to 1). Weighted sum of Gaussians still a PDF.

slide-17
SLIDE 17

GMM

slide-18
SLIDE 18

GMM

Algorithm - broadly as before:

  • Place k “means” at random.
  • Set variances to be high.
  • Assign all points to highest probability distribution.
  • Set mean, variance, weights to match assigned data.

{µ1, ..., µk} µi = X

v∈Ci

xv |Ci|

Ci = {xv|N(xv|µi, σ2

i ) > N(xv|µj, σ2 j ), ∀j}

σ2

i = variance(Ci)

wi = |Ci| P

j |Cj|

slide-19
SLIDE 19

GMM

slide-20
SLIDE 20

GMM

slide-21
SLIDE 21

GMM

slide-22
SLIDE 22

GMM

Major issue:

  • How to decide between two GMMs?
  • How to choose k?

General statistical question: model selection. Several good answers for this. Simple example: Bayesian information criterion (BIC). Trades off model complexity (k) with fit (likelihood). −2 log L + k log n

likelihood # parameters in model # data points

slide-23
SLIDE 23

Nonparametric Density Estimation

Parametric:

  • Define a parametrized model (e.g., a Gaussian)
  • Fit parameters
  • Done!

Key assumptions:

  • Data is distributed according to the parametrized form.
  • We know which parametrized form in advance.

What is the shape of the distribution

  • ver images representing flowers?
slide-24
SLIDE 24

Nonparametric Density Estimation

Nonparametric alternative:

  • Avoid fixed parametrized form.
  • Compute density estimate directly from the data.

Kernel density estimator: where:

  • D is a special kind of distance metric called a kernel.
  • Falls away from zero, integrates to one.
  • b is bandwidth: controls how fast kernel falls away.

PDF(x) = 1 nb

n

X

i=1

D ✓xi − x b ◆

slide-25
SLIDE 25

Nonparametric Density Estimation

Kernel:

  • Lots of choices, Gaussian often works in practice.

Bandwidth:

  • High: distant points have higher “contribution” to sum.
  • Low: distant points have lower.

PDF(x) = 1 nb

n

X

i=1

D ✓xi − x b ◆

slide-26
SLIDE 26

Nonparametric Density Estimation

(wikipedia)

slide-27
SLIDE 27

Nonparametric Density Estimator

slide-28
SLIDE 28

Dimensionality Reduction

X = {x1, …, xn}, each xi has m dimensions: xi = [x1, …, xm]. If m is high, data can be hard to deal with.

  • High-dimensional decision boundary.
  • Need more data.
  • But data is often not really high-dimensional.

Dimensionality reduction:

  • Reduce or compress the data
  • Try not to lose too much!
  • Find intrinsic dimensionality
slide-29
SLIDE 29

Dimensionality Reduction

For example, imagine if x1 and x2 are meaningful features, and x3 … xm are random noise. What happens to k-nearest neighbors? What happens to a decision tree? What happens to the perceptron algorithm? What happens if you want to do clustering?

slide-30
SLIDE 30

Dimensionality Reduction

Often can be phrased as a projection: where:

  • our goal: retain as much sample variance as possible.

Variance captures what varies within the data. f : X → X0 |X0| << |X|

slide-31
SLIDE 31

PCA

Principle Components Analysis. Project data into a new space:

  • Dimensions are linearly uncorrelated.
  • We have a measure of importance for each dimension.
slide-32
SLIDE 32

PCA

slide-33
SLIDE 33

PCA

  • Gather data x1, …, xn.
  • Adjust data to be zero-mean:
  • Compute covariance matrix C (m x m).
  • Compute unit eigenvectors

Vi and eigenvalues vi of C. Each Vi is a direction, and each vi is its importance - the amount

  • f the data’s variance it accounts for.

New data points: xi = xi − X

j

xj n

<latexit sha1_base64="45F9LXNTgx6rd3eGkudsXpyHSWk=">ACHnicbZDLSgMxFIYz9VbrerSTbAIbiwzIuhGKLpxWcFeoNMOmTps0kQ5KRlqFbn8MHcKuP4E7c6hP4GmbaWdjWAwk/38OJ/n8iFGlbfvbyq2srq1v5DcLW9s7u3vF/YO6ErHEpIYFE7LpI0UY5aSmqWakGUmCQp+Rhj+8TfPGI5GKCv6gxFph6jHaUAx0sbyinDUofB6ep9BV8WhN4BuIBFORp3BJOETr1iy/a04LJwMlECWVW94o/bFTgOCdeYIaVajh3pdoKkpiRScGNFYkQHqIeaRnJUhUO5n+ZAJPjNOFgZDmcA2n7t+JBIVKjUPfdIZI9Vilpr/Za1YB1fthPIo1oTj2aIgZlALmGKBXSoJ1mxsBMKSmrdC3EeGgzbw5rb4Qgw18lVKxlnksCzq52XH6PuLUuUmY5QHR+AYnAIHXIKuANVUAMYPIEX8ArerGfr3fqwPmetOSubOQRzZX39AlGPosM=</latexit><latexit sha1_base64="45F9LXNTgx6rd3eGkudsXpyHSWk=">ACHnicbZDLSgMxFIYz9VbrerSTbAIbiwzIuhGKLpxWcFeoNMOmTps0kQ5KRlqFbn8MHcKuP4E7c6hP4GmbaWdjWAwk/38OJ/n8iFGlbfvbyq2srq1v5DcLW9s7u3vF/YO6ErHEpIYFE7LpI0UY5aSmqWakGUmCQp+Rhj+8TfPGI5GKCv6gxFph6jHaUAx0sbyinDUofB6ep9BV8WhN4BuIBFORp3BJOETr1iy/a04LJwMlECWVW94o/bFTgOCdeYIaVajh3pdoKkpiRScGNFYkQHqIeaRnJUhUO5n+ZAJPjNOFgZDmcA2n7t+JBIVKjUPfdIZI9Vilpr/Za1YB1fthPIo1oTj2aIgZlALmGKBXSoJ1mxsBMKSmrdC3EeGgzbw5rb4Qgw18lVKxlnksCzq52XH6PuLUuUmY5QHR+AYnAIHXIKuANVUAMYPIEX8ArerGfr3fqwPmetOSubOQRzZX39AlGPosM=</latexit><latexit sha1_base64="45F9LXNTgx6rd3eGkudsXpyHSWk=">ACHnicbZDLSgMxFIYz9VbrerSTbAIbiwzIuhGKLpxWcFeoNMOmTps0kQ5KRlqFbn8MHcKuP4E7c6hP4GmbaWdjWAwk/38OJ/n8iFGlbfvbyq2srq1v5DcLW9s7u3vF/YO6ErHEpIYFE7LpI0UY5aSmqWakGUmCQp+Rhj+8TfPGI5GKCv6gxFph6jHaUAx0sbyinDUofB6ep9BV8WhN4BuIBFORp3BJOETr1iy/a04LJwMlECWVW94o/bFTgOCdeYIaVajh3pdoKkpiRScGNFYkQHqIeaRnJUhUO5n+ZAJPjNOFgZDmcA2n7t+JBIVKjUPfdIZI9Vilpr/Za1YB1fthPIo1oTj2aIgZlALmGKBXSoJ1mxsBMKSmrdC3EeGgzbw5rb4Qgw18lVKxlnksCzq52XH6PuLUuUmY5QHR+AYnAIHXIKuANVUAMYPIEX8ArerGfr3fqwPmetOSubOQRzZX39AlGPosM=</latexit><latexit sha1_base64="45F9LXNTgx6rd3eGkudsXpyHSWk=">ACHnicbZDLSgMxFIYz9VbrerSTbAIbiwzIuhGKLpxWcFeoNMOmTps0kQ5KRlqFbn8MHcKuP4E7c6hP4GmbaWdjWAwk/38OJ/n8iFGlbfvbyq2srq1v5DcLW9s7u3vF/YO6ErHEpIYFE7LpI0UY5aSmqWakGUmCQp+Rhj+8TfPGI5GKCv6gxFph6jHaUAx0sbyinDUofB6ep9BV8WhN4BuIBFORp3BJOETr1iy/a04LJwMlECWVW94o/bFTgOCdeYIaVajh3pdoKkpiRScGNFYkQHqIeaRnJUhUO5n+ZAJPjNOFgZDmcA2n7t+JBIVKjUPfdIZI9Vilpr/Za1YB1fthPIo1oTj2aIgZlALmGKBXSoJ1mxsBMKSmrdC3EeGgzbw5rb4Qgw18lVKxlnksCzq52XH6PuLUuUmY5QHR+AYnAIHXIKuANVUAMYPIEX8ArerGfr3fqwPmetOSubOQRzZX39AlGPosM=</latexit>

ˆ xi = [V1, ..., Vp]xi

<latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit><latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit><latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit><latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit>
slide-34
SLIDE 34

PCA

Let’s focus on this equation:

p x 1 m x 1 p x m compressed data point

  • riginal data point

compression matrix

ˆ xi = [V1, ..., Vp]xi

<latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit><latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit><latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit><latexit sha1_base64="znbIAXUs9yZmV6cY7na5NVHr8ug=">ACHXicbZDLSgMxFIYz9VbrerShcEiuCjDjAi6EYpuXFawF2inQyZN29DMZEjOSMvQpc/hA7jVR3AnbsUn8DVMLwvb+kPg4z/ncE7+IBZcg+N8W5mV1bX1jexmbmt7Z3cv39Q1TJRlFWoFLVA6KZ4BGrAfB6rFiJAwEqwX923G9siU5jJ6gGHMvJB0I97hlICx/Pxs0cgHYxaHF/jRtV3i9i27SKu+rGHBy3u5wuO7UyEl8GdQHNVPbzP82pEnIqCaN1wnRi8lCjgVLBRrploFhPaJ13WMBiRkGkvnXxkhE+N08YdqcyLAE/cvxMpCbUehoHpDAn09GJtbP5XayTQufJSHsUJsIhOF3USgUHicSq4zRWjIYGCFXc3IpjyhCwWQ3tyWQsg8k0COTjLuYwzJUz23X8P1FoXQzyiLjtAJOkMukQldIfKqIoekIv6BW9Wc/Wu/VhfU5bM9Zs5hDNyfr6BeKEoLI=</latexit>
slide-35
SLIDE 35

PCA

If you want to recover the original data point:

V is orthonormal

V = [V1, ..., Vp]

¯ xi = V −1ˆ xi

<latexit sha1_base64="erDUjVA762oBD69vjrH82OqZYFs=">ACGnicbZDPSsNAEMY3/q31X9WjHhaL4EVJRNCLUPTiUcG2QhvLZLuxSzfZsDsRS8jF5/ABvOojeBOvXnwCX8Nt7MFaP1j48c0M/sFiRQGXfTmZqemZ2bLy2UF5eWV1Yra+sNo1LNeJ0pqfR1AIZLEfM6CpT8OtEcokDyZtA/G9abd1wboeIrHCTcj+A2FqFgNbqVLbaAejsPr8R9IQ2brI9L6ftHmBhdSpVd98tRCfBG0GVjHTRqXy1u4qlEY+RSTCm5bkJ+hloFEzyvNxODU+A9eGWtyzGEHjZ8UvcrpjnS4NlbYvRlq4vycyiIwZRIHtjAB75m9taP5Xa6UYHvuZiJMUecx+FoWpKjoMBLaFZozlAMLwLSwt1LWAw0MbXBjWwKl+giByW0y3t8cJqFxsO9Zvjys1k5HGZXIJtkmu8QjR6RGzskFqRNGHsgTeSYvzqPz6rw57z+tU85oZoOMyfn4BnkvoUI=</latexit><latexit sha1_base64="erDUjVA762oBD69vjrH82OqZYFs=">ACGnicbZDPSsNAEMY3/q31X9WjHhaL4EVJRNCLUPTiUcG2QhvLZLuxSzfZsDsRS8jF5/ABvOojeBOvXnwCX8Nt7MFaP1j48c0M/sFiRQGXfTmZqemZ2bLy2UF5eWV1Yra+sNo1LNeJ0pqfR1AIZLEfM6CpT8OtEcokDyZtA/G9abd1wboeIrHCTcj+A2FqFgNbqVLbaAejsPr8R9IQ2brI9L6ftHmBhdSpVd98tRCfBG0GVjHTRqXy1u4qlEY+RSTCm5bkJ+hloFEzyvNxODU+A9eGWtyzGEHjZ8UvcrpjnS4NlbYvRlq4vycyiIwZRIHtjAB75m9taP5Xa6UYHvuZiJMUecx+FoWpKjoMBLaFZozlAMLwLSwt1LWAw0MbXBjWwKl+giByW0y3t8cJqFxsO9Zvjys1k5HGZXIJtkmu8QjR6RGzskFqRNGHsgTeSYvzqPz6rw57z+tU85oZoOMyfn4BnkvoUI=</latexit><latexit sha1_base64="erDUjVA762oBD69vjrH82OqZYFs=">ACGnicbZDPSsNAEMY3/q31X9WjHhaL4EVJRNCLUPTiUcG2QhvLZLuxSzfZsDsRS8jF5/ABvOojeBOvXnwCX8Nt7MFaP1j48c0M/sFiRQGXfTmZqemZ2bLy2UF5eWV1Yra+sNo1LNeJ0pqfR1AIZLEfM6CpT8OtEcokDyZtA/G9abd1wboeIrHCTcj+A2FqFgNbqVLbaAejsPr8R9IQ2brI9L6ftHmBhdSpVd98tRCfBG0GVjHTRqXy1u4qlEY+RSTCm5bkJ+hloFEzyvNxODU+A9eGWtyzGEHjZ8UvcrpjnS4NlbYvRlq4vycyiIwZRIHtjAB75m9taP5Xa6UYHvuZiJMUecx+FoWpKjoMBLaFZozlAMLwLSwt1LWAw0MbXBjWwKl+giByW0y3t8cJqFxsO9Zvjys1k5HGZXIJtkmu8QjR6RGzskFqRNGHsgTeSYvzqPz6rw57z+tU85oZoOMyfn4BnkvoUI=</latexit><latexit sha1_base64="erDUjVA762oBD69vjrH82OqZYFs=">ACGnicbZDPSsNAEMY3/q31X9WjHhaL4EVJRNCLUPTiUcG2QhvLZLuxSzfZsDsRS8jF5/ABvOojeBOvXnwCX8Nt7MFaP1j48c0M/sFiRQGXfTmZqemZ2bLy2UF5eWV1Yra+sNo1LNeJ0pqfR1AIZLEfM6CpT8OtEcokDyZtA/G9abd1wboeIrHCTcj+A2FqFgNbqVLbaAejsPr8R9IQ2brI9L6ftHmBhdSpVd98tRCfBG0GVjHTRqXy1u4qlEY+RSTCm5bkJ+hloFEzyvNxODU+A9eGWtyzGEHjZ8UvcrpjnS4NlbYvRlq4vycyiIwZRIHtjAB75m9taP5Xa6UYHvuZiJMUecx+FoWpKjoMBLaFZozlAMLwLSwt1LWAw0MbXBjWwKl+giByW0y3t8cJqFxsO9Zvjys1k5HGZXIJtkmu8QjR6RGzskFqRNGHsgTeSYvzqPz6rw57z+tU85oZoOMyfn4BnkvoUI=</latexit>

¯ xi = V T ˆ xi

<latexit sha1_base64="SQrcU9IgdDgPpBPvb73nJhZaha4=">ACGXicbZDPSsNAEMY39V/9H/XoZbEInkoigl6EohePFdoqtLVMt26SYbdidiCTn4HD6AV30Eb+LVk0/ga7iNOVj1g4Uf38ws18QS2HQ8z6c0tz8wuJSeXldW19Y9Pd2m4ZlWjGm0xJpa8DMFyKiDdRoOTXseYQBpJfBePzaf3qlmsjVNTAScy7IQwjMRAM0Fo9d7cTgE7vshtBT2nrJm1ktDMCzJ2eW/GqXi76F/wCKqRQved+dvqKJSGPkEkwpu17MXZT0CiY5NlKJzE8BjaGIW9bjCDkpvmn8jovnX6dKC0fRHS3P05kUJozCQMbGcIODK/a1Pzv1o7wcFJNxVRnCP2PeiQSIpKjpNhPaF5gzlxAIwLeytlI1A0Ob28yWQKkxQmAym4z/O4e/0Dqs+pYvjyq1syKjMtkle+SA+OSY1MgFqZMmYeSePJIn8uw8OC/Oq/P23VpyipkdMiPn/Qs92qEu</latexit><latexit sha1_base64="SQrcU9IgdDgPpBPvb73nJhZaha4=">ACGXicbZDPSsNAEMY39V/9H/XoZbEInkoigl6EohePFdoqtLVMt26SYbdidiCTn4HD6AV30Eb+LVk0/ga7iNOVj1g4Uf38ws18QS2HQ8z6c0tz8wuJSeXldW19Y9Pd2m4ZlWjGm0xJpa8DMFyKiDdRoOTXseYQBpJfBePzaf3qlmsjVNTAScy7IQwjMRAM0Fo9d7cTgE7vshtBT2nrJm1ktDMCzJ2eW/GqXi76F/wCKqRQved+dvqKJSGPkEkwpu17MXZT0CiY5NlKJzE8BjaGIW9bjCDkpvmn8jovnX6dKC0fRHS3P05kUJozCQMbGcIODK/a1Pzv1o7wcFJNxVRnCP2PeiQSIpKjpNhPaF5gzlxAIwLeytlI1A0Ob28yWQKkxQmAym4z/O4e/0Dqs+pYvjyq1syKjMtkle+SA+OSY1MgFqZMmYeSePJIn8uw8OC/Oq/P23VpyipkdMiPn/Qs92qEu</latexit><latexit sha1_base64="SQrcU9IgdDgPpBPvb73nJhZaha4=">ACGXicbZDPSsNAEMY39V/9H/XoZbEInkoigl6EohePFdoqtLVMt26SYbdidiCTn4HD6AV30Eb+LVk0/ga7iNOVj1g4Uf38ws18QS2HQ8z6c0tz8wuJSeXldW19Y9Pd2m4ZlWjGm0xJpa8DMFyKiDdRoOTXseYQBpJfBePzaf3qlmsjVNTAScy7IQwjMRAM0Fo9d7cTgE7vshtBT2nrJm1ktDMCzJ2eW/GqXi76F/wCKqRQved+dvqKJSGPkEkwpu17MXZT0CiY5NlKJzE8BjaGIW9bjCDkpvmn8jovnX6dKC0fRHS3P05kUJozCQMbGcIODK/a1Pzv1o7wcFJNxVRnCP2PeiQSIpKjpNhPaF5gzlxAIwLeytlI1A0Ob28yWQKkxQmAym4z/O4e/0Dqs+pYvjyq1syKjMtkle+SA+OSY1MgFqZMmYeSePJIn8uw8OC/Oq/P23VpyipkdMiPn/Qs92qEu</latexit><latexit sha1_base64="SQrcU9IgdDgPpBPvb73nJhZaha4=">ACGXicbZDPSsNAEMY39V/9H/XoZbEInkoigl6EohePFdoqtLVMt26SYbdidiCTn4HD6AV30Eb+LVk0/ga7iNOVj1g4Uf38ws18QS2HQ8z6c0tz8wuJSeXldW19Y9Pd2m4ZlWjGm0xJpa8DMFyKiDdRoOTXseYQBpJfBePzaf3qlmsjVNTAScy7IQwjMRAM0Fo9d7cTgE7vshtBT2nrJm1ktDMCzJ2eW/GqXi76F/wCKqRQved+dvqKJSGPkEkwpu17MXZT0CiY5NlKJzE8BjaGIW9bjCDkpvmn8jovnX6dKC0fRHS3P05kUJozCQMbGcIODK/a1Pzv1o7wcFJNxVRnCP2PeiQSIpKjpNhPaF5gzlxAIwLeytlI1A0Ob28yWQKkxQmAym4z/O4e/0Dqs+pYvjyq1syKjMtkle+SA+OSY1MgFqZMmYeSePJIn8uw8OC/Oq/P23VpyipkdMiPn/Qs92qEu</latexit>

so:

¯ xi = V1ˆ xi

1 + V2ˆ

xi

2 + ... + Vpˆ

xi

p

<latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit><latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit><latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit><latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit>
slide-36
SLIDE 36

PCA

Reconstruction:

real valued numbers

  • rthogonal

axes

Every data point is expressed as a point in a new coordinate frame. Equivalently: weighted sum of basis (eigenvector) functions.

¯ xi = V1ˆ xi

1 + V2ˆ

xi

2 + ... + Vpˆ

xi

p

<latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit><latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit><latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit><latexit sha1_base64="MxrcCjLKYdXpzM5xfYFBuKwIZ4Y=">ACQnicbZDLSgMxFIYzXu96tJNsAiCMwUQTdCURcuFewF2jqcSVMbmpkMyRmxDH0hn8MHcFtfQHAnbl2YjlW8/RD4851zOMkfJlIY9LyRMzU9Mzs3X1hYXFpeWV0rm/UjEo141WmpNKNEAyXIuZVFCh5I9EcolDyetg/GdfrN1wboeJLHCS8HcF1LqCAVoUFE9bIejsdngl6BGtBT5t9QDtPfAt2bOk/AnKOXBdN8fJV2NyJYJiyXO9XPSv8SemRCY6D4pPrY5iacRjZBKMafpegu0MNAom+XCxlRqeAOvDNW9aG0PETvLfzukO5Z0aFdpe2KkOf0+kUFkzCAKbWcE2DO/a2P4X62ZYvewnYk4SZH7GNRN5UFR1HRztCc4ZyYA0wLexbKeuBoY24B9bQqX6CKEZ2mT83zn8NbWy61t/sV+qHE8yKpAtsk12iU8OSIWckXNSJYzckQcyIo/OvfPsvDivH61TzmRmk/yQ8/YOk0yvNQ=</latexit>
slide-37
SLIDE 37

Eigenfaces

0.2 x

  • 1.3 x

6.1 x 0 x

Σ

slide-38
SLIDE 38

Eigenfaces

(40 basis functions) (Turk and Pentland, 1991)

slide-39
SLIDE 39

Eigenfaces

(40 basis functions) (Turk and Pentland, 1991)

slide-40
SLIDE 40

PCA for Supervised Learning

Given data x1, …, xn, labels y1, …, yn:

  • Compute compressor matrix V.
  • Compute compressed data .
  • Use compressed data to learn classifier:
  • Given a new data point x, run f on Vx.

Why?

f : ˆ X → Y

  • Low amount of data relative to dimensionality.
  • Dimensions may be highly correlated.
  • Dimensions may be mostly noise/irrelevant/constant.
  • Not all data need be labelled.

ˆ x1, ..., ˆ xn

<latexit sha1_base64="TCDnwG1zXOB2uGVD/KhB1/8aBNw=">ACF3icbZDLSgMxFIYz9VbrepK3ASL4KIMyLosujGZQV7gbaWTJpQzOTITkjlmHwOXwAt/oI7sStS5/A1zBtR7CtBwJf/v8cTvJ7keAaHOfLyi0tr6yu5dcLG5tb2zvF3b26lrGirEalkKrpEc0ED1kNOAjWjBQjgSdYwxtejf3GPVOay/AWRhHrBKQfcp9TAkbqFg/aAwLJQ3rnlrFt2X8ezdeybGdSeFcDMoayq3eJ3uydpHLAQqCBat1wngk5CFHAqWFpox5pFhA5Jn7UMhiRgupNMvpDiY6P0sC+VOSHgifp3IiGB1qPAM50BgYGe98bif14rBv+ik/AwioGFdLrIjwUGicd54B5XjIYGSBUcfNWTAdEQomtZktnpRDIJ5OTLufA6LUD+1XcM3Z6XKZRHh2iI3SCXHSOKugaVENUfSIntELerWerDfr3fqYtuasbGYfzZT1+QMzgZ90</latexit><latexit sha1_base64="TCDnwG1zXOB2uGVD/KhB1/8aBNw=">ACF3icbZDLSgMxFIYz9VbrepK3ASL4KIMyLosujGZQV7gbaWTJpQzOTITkjlmHwOXwAt/oI7sStS5/A1zBtR7CtBwJf/v8cTvJ7keAaHOfLyi0tr6yu5dcLG5tb2zvF3b26lrGirEalkKrpEc0ED1kNOAjWjBQjgSdYwxtejf3GPVOay/AWRhHrBKQfcp9TAkbqFg/aAwLJQ3rnlrFt2X8ezdeybGdSeFcDMoayq3eJ3uydpHLAQqCBat1wngk5CFHAqWFpox5pFhA5Jn7UMhiRgupNMvpDiY6P0sC+VOSHgifp3IiGB1qPAM50BgYGe98bif14rBv+ik/AwioGFdLrIjwUGicd54B5XjIYGSBUcfNWTAdEQomtZktnpRDIJ5OTLufA6LUD+1XcM3Z6XKZRHh2iI3SCXHSOKugaVENUfSIntELerWerDfr3fqYtuasbGYfzZT1+QMzgZ90</latexit><latexit sha1_base64="TCDnwG1zXOB2uGVD/KhB1/8aBNw=">ACF3icbZDLSgMxFIYz9VbrepK3ASL4KIMyLosujGZQV7gbaWTJpQzOTITkjlmHwOXwAt/oI7sStS5/A1zBtR7CtBwJf/v8cTvJ7keAaHOfLyi0tr6yu5dcLG5tb2zvF3b26lrGirEalkKrpEc0ED1kNOAjWjBQjgSdYwxtejf3GPVOay/AWRhHrBKQfcp9TAkbqFg/aAwLJQ3rnlrFt2X8ezdeybGdSeFcDMoayq3eJ3uydpHLAQqCBat1wngk5CFHAqWFpox5pFhA5Jn7UMhiRgupNMvpDiY6P0sC+VOSHgifp3IiGB1qPAM50BgYGe98bif14rBv+ik/AwioGFdLrIjwUGicd54B5XjIYGSBUcfNWTAdEQomtZktnpRDIJ5OTLufA6LUD+1XcM3Z6XKZRHh2iI3SCXHSOKugaVENUfSIntELerWerDfr3fqYtuasbGYfzZT1+QMzgZ90</latexit><latexit sha1_base64="TCDnwG1zXOB2uGVD/KhB1/8aBNw=">ACF3icbZDLSgMxFIYz9VbrepK3ASL4KIMyLosujGZQV7gbaWTJpQzOTITkjlmHwOXwAt/oI7sStS5/A1zBtR7CtBwJf/v8cTvJ7keAaHOfLyi0tr6yu5dcLG5tb2zvF3b26lrGirEalkKrpEc0ED1kNOAjWjBQjgSdYwxtejf3GPVOay/AWRhHrBKQfcp9TAkbqFg/aAwLJQ3rnlrFt2X8ezdeybGdSeFcDMoayq3eJ3uydpHLAQqCBat1wngk5CFHAqWFpox5pFhA5Jn7UMhiRgupNMvpDiY6P0sC+VOSHgifp3IiGB1qPAM50BgYGe98bif14rBv+ik/AwioGFdLrIjwUGicd54B5XjIYGSBUcfNWTAdEQomtZktnpRDIJ5OTLufA6LUD+1XcM3Z6XKZRHh2iI3SCXHSOKugaVENUfSIntELerWerDfr3fqYtuasbGYfzZT1+QMzgZ90</latexit>
slide-41
SLIDE 41

ISOMAP

Another approach:

  • Estimate intrinsic geometric dimensionality of data.
  • Recover natural distance metric
slide-42
SLIDE 42

ISOMAP

Core idea: distance metric locally Euclidean

  • Small radius r, connect each point to neighbors
  • Weight based on Euclidean distance
slide-43
SLIDE 43

ISOMAP

Solve all-points shortest pairs:

  • Transforms local distance to global distance.
  • Compute embedding.
slide-44
SLIDE 44

ISOMAP

From Tenenbaum, de Silva, and Langford, Science 290:2319-2323, December 2000.

slide-45
SLIDE 45

Application: Novelty Detection

Intrusion detection - when is a user behaving unusually? First proposed by Prof. Dorothy Denning in 1986. (1995 ACM Fellow)