Learned about: LSH/Similarity search & recommender systems - PowerPoint PPT Presentation

¡ Learned about: LSH/Similarity search & recommender systems ¡ Search: “jaguar” ¡ Uncertainty about the user’s information need § Don’t put all eggs in one basket! ¡ Relevance isn’t everything – need diversity ! 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

¡ Recommendation: ¡ Summarization: “Robert Downey Jr.” ¡ News Media: 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

[ Althoff et al., KDD 2015 ] Robert ¡Downey ¡Jr. ¡(1965—) Deborah The ¡Party's Ben ¡Stiller Fiona ¡Apple Susan ¡Downey Iron ¡Man ¡2 Iron ¡Man ¡3 Falconer Over Robert Paramount Chaplin Ally ¡McBeal Gothika Iron ¡Man The ¡Avengers Downey, ¡Sr. Pictures 1985 1990 1995 2000 2005 2010 2015 Timeline Person ¡ Goal: Timeline should express his relationships to other people through events (personal, collaboration, mentorship, etc.) ¡ Why timelines? § Easier: Wikipedia article is 18 pages long § Context: Through relationships & event descriptions § Exploration: Can “jump” to other people 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

¡ Given: § Relevant relationships § Events that each cover some relationships ¡ Goal: Given a large set of events , pick a small subset that explains most known relationships (“the timeline”) 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5

Demo available at: http://cs.stanford.edu/~althoff/timemachine/demo.html Robert ¡Downey ¡Jr. ¡(1965—) Deborah The ¡Party's Ben ¡Stiller Fiona ¡Apple Susan ¡Downey Iron ¡Man ¡2 Iron ¡Man ¡3 Falconer Over Robert Paramount Chaplin Ally ¡McBeal Gothika Iron ¡Man The ¡Avengers Downey, ¡Sr. Pictures 1985 1990 1995 2000 2005 2010 2015 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

¡ User studies: People hate redundancy! Chaplin Iron Man Iron Man Iron Man Academy Award US Release US Release vs Award N. Ceremony Rented Lips Iron Man US Release EU Release ¡ Want to see more diverse set of relationships 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

¡ Idea: Encode diversity as coverage problem ¡ Example: Selecting events for timeline § Try to cover all important relationships

¡ Q: What is being covered? ¡ A: Relationships Captain America Anthony Hopkins Gwyneth Paltrow Susan Downey Downey Jr. starred in Chaplin together with Anthony Hopkins ¡ Q: Who is doing the covering? ¡ A: Events 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

¡ Suppose we are given a set of events E § Each event e covers a set of X e ⊆ U e relationships ¡ For a set of events we define: S ⊆ E � � � � [ F ( S ) = X e � � � � � � e ∈ S ¡ Goal: We want to Cardinality | S | ≤ k F ( S ) max Constraint ¡ Note: F(S) is a set function: F ( S ) : 2 E → N 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11

¡ Given universe of elements U = { u 1 , . . . , u n } and sets { X 1 , . . . , X m } ⊆ U U: all relationships X 3 X i : relationships covered by event i X 1 U X 2 X 4 ¡ Goal: Find set of k events X 1 …X k covering most of U § More precisely: Find set of k events X 1 …X k whose size of the union is the largest 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

Simple Heuristic: Greedy Algorithm: ¡ Start with S 0 = {} ¡ For i = 1…k § Take event e that max F ( S i − 1 ∪ e ) � � § Let � � S i = S i − 1 ∪ { e } [ F ( S ) = X e � � � � � � e ∈ S ¡ Example: § Eval. F({e 1 }), …, F({e m }) , pick best (say e 1 ) § Eval. F({e 1 } u {e 2 }), …, F({e 1 } u {e m }) , pick best (say e 2 ) § Eval. F({e 1 , e 2 } u {e 3 }), …, F({e 1 , e 2 } u {e m }) , pick best § And so on… 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

¡ Goal: Maximize the covered area 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14

A C B ¡ Goal: Maximize the size of the covered area with two sets ¡ Greedy first picks A and then C ¡ But the optimal way would be to pick B and C 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19

¡ Bad news: Maximum Coverage is NP-hard ¡ Good news: Good approximations exist § Problem has certain structure to it that even simple greedy algorithms perform reasonably well § Details in 2 nd half of lecture ¡ Now: Generalize our objective for timeline generation 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

¡ Objective values all relationships equally � � � � [ X [ F ( S ) = � = 1 where R = X e X e � � � � � e ∈ S r ∈ R e ∈ S ¡ Unrealistic: Some relationships are more important than others § use different weights (“weighted coverage function”) X w : R → R + F ( S ) = w ( r ) r ∈ R 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 21

§ Use global importance weights § How much interest is there? § Could be measured as § w(X) = # search queries for person X § w(X) = # Wikipedia article views for X § w(X) = # news article mentions for X Captain America Anthony Hopkins Gwyneth Paltrow Susan Downey Captain America Anthony Hopkins Gwyneth Paltrow Susan Downey 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

Captain America Susan Downey Justin Bieber Tim Althoff Applying global importance weights Captain America Justin Bieber Susan Downey Tim Althoff ¡ Some relationships are not (very) globally important but (not) highly relevant to timeline ¡ Need relevant to timeline instead of globally relevant w(Susan Downey | RDJr) > w(Justin Bieber | RDJr) 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

¡ Can use co-occurrence statistics w(X | RDJr) = #(X and RDJr) / (#(RDJr) * #(X)) § Similar: Pointwise mutual information (PMI) § How often do X and Y occur together compared to what you would expect if they were independent § Accounts for popular entities (e.g., Justin Bieber) 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24

¡ How to differentiate between two events that cover the same relationships ? ¡ Example: Robert and Susan Downey § Event 1: Wedding, August 27, 2005 § Event 2: Minor charity event, Nov 11, 2006 ¡ We need to be able to distinguish these! 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25

¡ Further improvement when we not only score relationships but also score the event timestamp X X F ( S ) = w R ( r ) + w T ( t e ) where r ∈ R e ∈ S [ R = X e e ∈ S Relationship (as before) Timestamps ¡ Again, use co-occurrences for weights w T 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26

marvel.com • “Robert Downey Jr” and “May 4, 2012” occurs 173 times on 71 different webpages • US Release date of The Avengers • Use MapReduce on 10B web pages (10k+ machines) 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27

¡ Generalized earlier coverage function to linear combination of weighted coverage functions where X X F ( S ) = w R ( r ) + w T ( t e ) [ R = X e r ∈ R e ∈ S e ∈ S ¡ Goal: | S | ≤ k F ( S ) max ¡ Still NP-hard (because generalization of NP-hard problem) 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28

¡ How can we actually optimize this function? ¡ What structure is there that will help us do this efficiently? ¡ Any questions so far? 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 29

¡ For this optimization problem, Greedy produces a solution S s.t. F(S) ³ (1-1/e)*OPT ( F(S) ³ 0.63*OPT ) [Nemhauser, Fisher, Wolsey ’78] ¡ Claim holds for functions F (·) which are: § Submodular, Monotone, Normal, Non-negative (discussed next) 5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30

Learned about: LSH/Similarity search & recommender systems - PowerPoint PPT Presentation

Learned about: LSH/Similarity search & recommender systems Search: jaguar Uncertainty about the users information need Dont put all eggs in one basket! Relevance isnt everything need diversity ! 5/28/20 Tim

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Web Mining and Recommender Systems Recommender Systems: Introduction Learning Goals

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

2. Recommender Systems Recommenders Everywhere Advanced Topics in Information Retrieval /

Affect- and Personality-based Recommender Systems Part II: Acquisition, Usage in Recommender

On the Economics of Recommender Systems Emilio Calvano Center for Studies in Econ and Finance U.

Privacy in Recommender Systems CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 21:

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Status of Partial Return Yoke Holger Witte Brookhaven National Laboratory Advanced Accelerator

Depth of maximum of air-shower profiles at the Pierre Auger Observatory: Measurements above 10

` http://distributedcomponents.net Ilya Sergey James R. Wilcox Zach Tatlock Distributed

MobileIron Introduction 2007 Company founded purpose-built for multi-OS mobility MobileIron

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

WHEN N KINGDOMS OMS COL OLLAPS LAPSE Now it will come about in the last days, the mountain

Phase Behavior in Iron/Carbon System Callister P. 252 Chapter 9 1 Iron Age 1500 to 1000 BC

Medium-Energy Neutron Attenuation in Iron and Concrete (8) H. Hirayama and Attenuation Length

Learned about: LSH/Similarity search & recommender systems - PowerPoint PPT Presentation

Learned about: LSH/Similarity search & recommender systems Search: jaguar Uncertainty about the users information need Dont put all eggs in one basket! Relevance isnt everything need diversity ! 5/28/20 Tim

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

Web Mining and Recommender Systems Recommender Systems: Introduction Learning Goals

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

Locality-Sensitive Hashing &amp; Image Similarity Search Andrew Wylie Overview; LSH given a

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

2. Recommender Systems Recommenders Everywhere Advanced Topics in Information Retrieval /

Affect- and Personality-based Recommender Systems Part II: Acquisition, Usage in Recommender

On the Economics of Recommender Systems Emilio Calvano Center for Studies in Econ and Finance U.

Privacy in Recommender Systems CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 21:

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Status of Partial Return Yoke Holger Witte Brookhaven National Laboratory Advanced Accelerator

Depth of maximum of air-shower profiles at the Pierre Auger Observatory: Measurements above 10

` http://distributedcomponents.net Ilya Sergey James R. Wilcox Zach Tatlock Distributed

MobileIron Introduction 2007 Company founded purpose-built for multi-OS mobility MobileIron

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

WHEN N KINGDOMS OMS COL OLLAPS LAPSE Now it will come about in the last days, the mountain

Phase Behavior in Iron/Carbon System Callister P. 252 Chapter 9 1 Iron Age 1500 to 1000 BC

Medium-Energy Neutron Attenuation in Iron and Concrete (8) H. Hirayama and Attenuation Length

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a