deconstructing data science
play

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016 1853 (12,10) (5,6) 1853 Jefferson Oakland Lafayette Harrison Square Square Square Square Feature x 5


  1. 
 Deconstructing Data Science David Bamman, UC Berkeley 
 Info 290 
 Lecture 17: Distance models Mar 28, 2016

  2. 1853

  3. (12,10) (5,6) 1853

  4. Jefferson 
 Oakland 
 Lafayette 
 Harrison 
 Square Square Square Square Feature x 5 12 5 12 y 6 10 10 6 F � | x i − y i | “Manhattan distance” i = 1

  5. (12,10) (5,6) 1853

  6. (12,10) � ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 | x 2 − y 2 | (5,6) | x 1 − y 1 | a 2 + b 2 = c 2 � a 2 + b 2 = c

  7. Euclidean distance � F � ( x i − y i ) 2 � � � i = 1 � F � 1 / 2 ( x i − y i ) 2 � = i = 1

  8. � F � 1 / 1 | x i − y i | 1 1-norm 
 � (Manhattan) i = 1 � F � 1 / 2 2-norm 
 | x i − y i | 2 � (Euclidean) i = 1 � F � 1 / p | x i − y i | p � p -norm i = 1

  9. � F � 1 / 0 F 0-norm 
 � I [ x i � = y i ] | x i − y i | 0 � = (Hamming) i = 1 i = 1 � F � 1 / ∞ ∞ -norm 
 | x i − y i | � | x i − y i | ∞ = max i (Chebyshev) i = 1

  10. Metrics d ( x , y ) ≥ 0 distances are not negative distances are positive, d ( x , y ) = 0 iff x = y except for identity d ( x , y ) = d ( y , x ) distances are symmetric

  11. Metrics d ( x , y ) ≤ d ( x , z ) + d ( z , y ) triangle inequality y a detour to another point z can’t shorten the “distance” between x and y x z

  12. Feature x1 x2 x3 follow clinton 1 0 0 follow trump 0 1 1 “benghazi” 0 0 1 negative sentiment + “benghazi” 0 1 0 “illegal immigrants” 0 1 1 “republican” in profile 0 0 0 “democrat” in profile 0 0 0 self-reported location = Berkeley 1 0 0

  13. K-nearest neighbors • Supervised classification/regression • Make prediction by finding the closest k data points and • predicting the majority label among those k points (classification) • predicting their average of those k points (regression)

  14. KNN Classification N ( x i ) Let be the K-nearest neighbors to x i P ( Y = j | x ) = 1 � I [ y i = j ] K x i ∈ N ( x ) (Pick the value of Y with the highest probability)

  15. KNN Regression N ( x i ) Let be the K-nearest neighbors to x i y i = 1 � y j ˆ K x j ∈ N ( x i )

  16. Data http://scott.fortmann-roe.com/docs/BiasVariance.html

  17. K=1 http://scott.fortmann-roe.com/docs/BiasVariance.html

  18. K=100 http://scott.fortmann-roe.com/docs/BiasVariance.html

  19. K=12 http://scott.fortmann-roe.com/docs/BiasVariance.html

  20. KNN • Properties: • Linear/Nonlinear? • Complexity of training/testing? • Overfitting? • How to choose the best K? • Impact of data representation

  21. Similarity task method distance classification/regression KNN euclidean, etc. classification/regression SVM kernel duplicate detection search

  22. Relevance (IR) • Similarity as an end of its own is a different paradigm from what we’ve been considering so far (classification, regression, clustering). task x y KNN classification/ documents genres regression duplicate detection documents

  23. Duplicate detection

  24. Duplicate document detection • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

  25. Computational concerns • Two sources of complexity: • Dimensionality of the feature space (every document in represented by a vocabulary of 1M word) [ minhashing ] • Number of documents in collection to compare (4.64 billion web pages) [ locality sensitive hashing]

  26. Feature x1 x2 x3 the 1 1 1 and 1 1 1 obama 1 1 0 supreme 1 0 0 court 1 0 1 kansas 0 1 1 ncaa 0 1 1 four 1 1 1

  27. Jaccard Similarity x1 x2 x3 1 1 1 1 1 1 1 1 0 number of features in both X and Y 1 0 0 | X ∩ Y | 1 0 1 | X ∪ Y | 0 1 1 0 1 1 number of features in either X and Y 1 1 1

  28. Text Reuse We were many times weaker than his splendid, lacquered machine, so that I did not even attempt to outspeed him. O lente currite noctis equi! O softly run, nightmares! Nabokov, Lolita

  29. Text reuse detection • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

  30. Information retrieval

  31. Information retrieval • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

  32. Cosine Similarity x1 x2 x3 1 1 1 � F i = 1 x i y i 1 1 1 cos ( x , y ) = �� F �� F i = 1 x 2 i = 1 y 2 1 1 0 i i 1 0 0 Euclidean distance measures the • 1 0 1 magnitude of distance between two points 0 1 1 Cosine similarity measures their • orientation 0 1 1 Often weighted by TF-IDF to • 1 1 1 discount the impact of frequent features.

  33. Modern IR • Modern IR accounts for much more information than document similarity • Prominence/reliability of document (PageRank) • Geographic location • Search query history • This can become a supervised problem to learn how to map these more elaborate features of a query/session to the search ranking. How do we represent our data?

  34. Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

  35. Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

  36. Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

  37. http://mybinder.org/repo/dbamman/dds

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend