Deconstructing Data Science
David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016
Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation
Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016 1853 (12,10) (5,6) 1853 Jefferson Oakland Lafayette Harrison Square Square Square Square Feature x 5
David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016
1853
(5,6) (12,10)
1853
Feature
x y
Jefferson Square Oakland Square Lafayette Square Harrison Square
5 12 5 12 6 10 10 6
F
“Manhattan distance”
(5,6) (12,10)
1853
(5,6) (12,10)
|x1 − y1| |x2 − y2|
a2 + b2 = c2
p-norm F
|xi − yi|p 1/p F
|xi − yi|1 1/1 1-norm (Manhattan) F
|xi − yi|2 1/2 2-norm (Euclidean)
F
|xi − yi|0 1/0 F
|xi − yi|∞ 1/∞ ∞-norm (Chebyshev) 0-norm (Hamming) = max
i
|xi − yi| =
F
I[xi = yi]
d(x, y) ≥ 0 distances are not negative d(x, y) = 0 iff x = y distances are positive, except for identity d(x, y) = d(y, x) distances are symmetric
d(x, y) ≤ d(x, z) + d(z, y) triangle inequality a detour to another point z can’t shorten the “distance” between x and y x y z
Feature
follow clinton follow trump “benghazi” negative sentiment + “benghazi” “illegal immigrants” “republican” in profile “democrat” in profile self-reported location = Berkeley
x1 x2 x3
1 1 1 1 1 1 1 1
points and
points (classification)
(regression)
(Pick the value of Y with the highest probability)
Let be the K-nearest neighbors to xi N(xi)
Let be the K-nearest neighbors to xi N(xi)
http://scott.fortmann-roe.com/docs/BiasVariance.html
http://scott.fortmann-roe.com/docs/BiasVariance.html
http://scott.fortmann-roe.com/docs/BiasVariance.html
http://scott.fortmann-roe.com/docs/BiasVariance.html
task method distance classification/regression KNN euclidean, etc. classification/regression SVM kernel duplicate detection search
paradigm from what we’ve been considering so far (classification, regression, clustering).
task x y KNN classification/ regression documents genres duplicate detection documents
document in represented by a vocabulary of 1M word) [minhashing]
(4.64 billion web pages) [locality sensitive hashing]
Feature
the and
supreme court kansas ncaa four
x1 x2 x3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
x1 x2 x3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
number of features in both X and Y number of features in either X and Y
We were many times weaker than his splendid, lacquered machine, so that I did not even attempt to
run, nightmares! Nabokov, Lolita
cos(x, y) = F
i=1 xiyi
F
i=1 x2 i
F
i=1 y2 i
x1 x2 x3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
magnitude of distance between two points
discount the impact of frequent features.
than document similarity
how to map these more elaborate features of a query/session to the search ranking. How do we represent our data?
http://mybinder.org/repo/dbamman/dds