Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

deconstructing data science
SMART_READER_LITE
LIVE PREVIEW

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016 1853 (12,10) (5,6) 1853 Jefferson Oakland Lafayette Harrison Square Square Square Square Feature x 5


slide-1
SLIDE 1

Deconstructing Data Science

David Bamman, UC Berkeley
 
 Info 290
 Lecture 17: Distance models Mar 28, 2016

slide-2
SLIDE 2

1853

slide-3
SLIDE 3

(5,6) (12,10)

1853

slide-4
SLIDE 4

Feature

x y

Jefferson
 Square Oakland
 Square Lafayette
 Square Harrison 
 Square

5 12 5 12 6 10 10 6

F

  • i=1

|xi − yi|

“Manhattan distance”

slide-5
SLIDE 5

(5,6) (12,10)

1853

slide-6
SLIDE 6

(5,6) (12,10)

|x1 − y1| |x2 − y2|

  • (x1 − y1)2 + (x2 − y2)2

a2 + b2 = c2

  • a2 + b2 = c
slide-7
SLIDE 7

Euclidean distance

= F

  • i=1

(xi − yi)2 1/2

  • F
  • i=1

(xi − yi)2

slide-8
SLIDE 8

p-norm F

  • i=1

|xi − yi|p 1/p F

  • i=1

|xi − yi|1 1/1 1-norm
 (Manhattan) F

  • i=1

|xi − yi|2 1/2 2-norm
 (Euclidean)

slide-9
SLIDE 9

F

  • i=1

|xi − yi|0 1/0 F

  • i=1

|xi − yi|∞ 1/∞ ∞-norm
 (Chebyshev) 0-norm
 (Hamming) = max

i

|xi − yi| =

F

  • i=1

I[xi = yi]

slide-10
SLIDE 10

Metrics

d(x, y) ≥ 0 distances are not negative d(x, y) = 0 iff x = y distances are positive, except for identity d(x, y) = d(y, x) distances are symmetric

slide-11
SLIDE 11

Metrics

d(x, y) ≤ d(x, z) + d(z, y) triangle inequality a detour to another point z can’t shorten the “distance” between x and y x y z

slide-12
SLIDE 12

Feature

follow clinton follow trump “benghazi” negative sentiment + “benghazi” “illegal immigrants” “republican” in profile “democrat” in profile self-reported location = Berkeley

x1 x2 x3

1 1 1 1 1 1 1 1

slide-13
SLIDE 13

K-nearest neighbors

  • Supervised classification/regression
  • Make prediction by finding the closest k data

points and

  • predicting the majority label among those k

points (classification)

  • predicting their average of those k points

(regression)

slide-14
SLIDE 14

KNN Classification

(Pick the value of Y with the highest probability)

P(Y = j | x) = 1 K

  • xi∈N (x)

I[yi = j]

Let be the K-nearest neighbors to xi N(xi)

slide-15
SLIDE 15

KNN Regression

Let be the K-nearest neighbors to xi N(xi)

ˆ yi = 1 K

  • xj∈N (xi)

yj

slide-16
SLIDE 16

Data

http://scott.fortmann-roe.com/docs/BiasVariance.html

slide-17
SLIDE 17

K=1

http://scott.fortmann-roe.com/docs/BiasVariance.html

slide-18
SLIDE 18

K=100

http://scott.fortmann-roe.com/docs/BiasVariance.html

slide-19
SLIDE 19

K=12

http://scott.fortmann-roe.com/docs/BiasVariance.html

slide-20
SLIDE 20

KNN

  • Properties:
  • Linear/Nonlinear?
  • Complexity of training/testing?
  • Overfitting?
  • How to choose the best K?
  • Impact of data representation
slide-21
SLIDE 21

Similarity

task method distance classification/regression KNN euclidean, etc. classification/regression SVM kernel duplicate detection search

slide-22
SLIDE 22

Relevance (IR)

  • Similarity as an end of its own is a different

paradigm from what we’ve been considering so far (classification, regression, clustering).

task x y KNN classification/ regression documents genres duplicate detection documents

slide-23
SLIDE 23

Duplicate detection

slide-24
SLIDE 24

Duplicate document detection

  • What are the data points we’re comparing?
  • How do we represent each one?
  • How do we measure “similarity”
  • Evaluation?
slide-25
SLIDE 25

Computational concerns

  • Two sources of complexity:
  • Dimensionality of the feature space (every

document in represented by a vocabulary of 1M word) [minhashing]

  • Number of documents in collection to compare

(4.64 billion web pages) [locality sensitive hashing]

slide-26
SLIDE 26

Feature

the and

  • bama

supreme court kansas ncaa four

x1 x2 x3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-27
SLIDE 27

x1 x2 x3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

|X ∩ Y| |X ∪ Y|

number of features in both X and Y number of features in either X and Y

Jaccard Similarity

slide-28
SLIDE 28

We were many times weaker than his splendid, lacquered machine, so that I did not even attempt to

  • utspeed him. O lente currite noctis equi! O softly

run, nightmares! Nabokov, Lolita

Text Reuse

slide-29
SLIDE 29

Text reuse detection

  • What are the data points we’re comparing?
  • How do we represent each one?
  • How do we measure “similarity”
  • Evaluation?
slide-30
SLIDE 30

Information retrieval

slide-31
SLIDE 31
  • What are the data points we’re comparing?
  • How do we represent each one?
  • How do we measure “similarity”
  • Evaluation?

Information retrieval

slide-32
SLIDE 32

cos(x, y) = F

i=1 xiyi

F

i=1 x2 i

F

i=1 y2 i

x1 x2 x3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Cosine Similarity

  • Euclidean distance measures the

magnitude of distance between two points

  • Cosine similarity measures their
  • rientation
  • Often weighted by TF-IDF to

discount the impact of frequent features.

slide-33
SLIDE 33

Modern IR

  • Modern IR accounts for much more information

than document similarity

  • Prominence/reliability of document (PageRank)
  • Geographic location
  • Search query history
  • This can become a supervised problem to learn

how to map these more elaborate features of a query/session to the search ranking. How do we represent our data?

slide-34
SLIDE 34

Meme tracking

  • J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"
slide-35
SLIDE 35
  • J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

Meme tracking

slide-36
SLIDE 36
  • J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

Meme tracking

slide-37
SLIDE 37

http://mybinder.org/repo/dbamman/dds