Learned about: LSH/Similarity search & recommender systems - - PowerPoint PPT Presentation

learned about lsh similarity search recommender systems
SMART_READER_LITE
LIVE PREVIEW

Learned about: LSH/Similarity search & recommender systems - - PowerPoint PPT Presentation

Learned about: LSH/Similarity search & recommender systems Search: jaguar Uncertainty about the users information need Dont put all eggs in one basket! Relevance isnt everything need diversity ! 5/28/20 Tim


slide-1
SLIDE 1
slide-2
SLIDE 2

¡ Learned about: LSH/Similarity search &

recommender systems

¡ Search: “jaguar” ¡ Uncertainty about the user’s information need

§ Don’t put all eggs in one basket!

¡ Relevance isn’t everything – need diversity!

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

slide-3
SLIDE 3

¡ Recommendation: ¡ Summarization:

“Robert Downey Jr.”

¡ News Media:

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

slide-4
SLIDE 4

¡ Goal: Timeline should express his relationships to other

people through events (personal, collaboration, mentorship, etc.)

¡ Why timelines?

§ Easier: Wikipedia article is 18 pages long § Context: Through relationships & event descriptions § Exploration: Can “jump” to other people

Robert ¡Downey ¡Jr. ¡(1965—) 1985 1990 1995 2000 2005 2010 2015 The ¡Avengers Ben ¡Stiller Ally ¡McBeal Iron ¡Man ¡3 Susan ¡Downey Gothika The ¡Party's Over Fiona ¡Apple Robert Downey, ¡Sr. Iron ¡Man Deborah Falconer Chaplin Iron ¡Man ¡2 Paramount Pictures

Timeline Person

[Althoff et al., KDD 2015]

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

slide-5
SLIDE 5

¡ Given:

§ Relevant relationships § Events that each cover some relationships

¡ Goal: Given a large set of events, pick a small

subset that explains most known relationships (“the timeline”)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5

slide-6
SLIDE 6

Robert ¡Downey ¡Jr. ¡(1965—) 1985 1990 1995 2000 2005 2010 2015

The ¡Avengers Ben ¡Stiller Ally ¡McBeal Iron ¡Man ¡3 Susan ¡Downey Gothika The ¡Party's Over Fiona ¡Apple Robert Downey, ¡Sr. Iron ¡Man Deborah Falconer Chaplin Iron ¡Man ¡2 Paramount Pictures

Demo available at: http://cs.stanford.edu/~althoff/timemachine/demo.html

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

slide-7
SLIDE 7

¡ User studies: People hate redundancy!

vs

Iron Man US Release

¡ Want to see more diverse set of relationships

Iron Man EU Release Iron Man Award Ceremony Iron Man US Release Rented Lips US Release Chaplin Academy Award N.

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

slide-8
SLIDE 8
slide-9
SLIDE 9

¡ Idea: Encode diversity as coverage problem ¡ Example: Selecting events for timeline

§ Try to cover all important relationships

slide-10
SLIDE 10

¡ Q: What is being covered? ¡ A: Relationships ¡ Q: Who is doing the covering? ¡ A: Events

Captain America Anthony Hopkins Gwyneth Paltrow Susan Downey

Downey Jr. starred in Chaplin together with Anthony Hopkins

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

slide-11
SLIDE 11

¡ Suppose we are given a set of events E

§ Each event e covers a set of relationships

¡ For a set of events we define: ¡ Goal: We want to ¡ Note: F(S) is a set function:

e

Cardinality Constraint

S ⊆ E F(S) =

  • [

e∈S

Xe

  • max

|S|≤k F(S)

F(S) : 2E → N Xe ⊆ U

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11

slide-12
SLIDE 12

¡ Given universe of elements

and sets

¡ Goal: Find set of k events X1…Xk covering most of U

§ More precisely: Find set of k events X1…Xk whose size of the union is the largest

U X1

X2 X3

X4

U = {u1, . . . , un} {X1, . . . , Xm} ⊆ U

U: all relationships Xi: relationships covered by event i

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

slide-13
SLIDE 13

Simple Heuristic: Greedy Algorithm:

¡ Start with S0 = {} ¡ For i = 1…k

§ Take event e that max § Let

¡ Example:

§ Eval. F({e1}), …, F({em}), pick best (say e1) § Eval. F({e1} u {e2}), …, F({e1} u {em}), pick best (say e2) § Eval. F({e1, e2} u {e3}), …, F({e1, e2} u {em}), pick best § And so on…

F(Si−1 ∪ e) Si = Si−1 ∪ {e}

F(S) =

  • [

e∈S

Xe

  • 5/28/20

Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

slide-14
SLIDE 14

¡ Goal: Maximize the covered area

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14

slide-15
SLIDE 15

¡ Goal: Maximize the covered area

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15

slide-16
SLIDE 16

¡ Goal: Maximize the covered area

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16

slide-17
SLIDE 17

¡ Goal: Maximize the covered area

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17

slide-18
SLIDE 18

¡ Goal: Maximize the covered area

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18

slide-19
SLIDE 19

¡ Goal: Maximize the size of the covered area

with two sets

¡ Greedy first picks A and then C ¡ But the optimal way would be to pick B and C

A B C

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19

slide-20
SLIDE 20

¡ Bad news: Maximum Coverage is NP-hard ¡ Good news: Good approximations exist

§ Problem has certain structure to it that even simple greedy algorithms perform reasonably well § Details in 2nd half of lecture

¡ Now: Generalize our objective for timeline

generation

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

slide-21
SLIDE 21

¡ Objective values all relationships equally ¡ Unrealistic: Some relationships are more

important than others

§ use different weights (“weighted coverage function”)

F(S) =

  • [

e∈S

Xe

  • =

X

r∈R

1 where R = [

e∈S

Xe F(S) = X

r∈R

w(r) w : R → R+

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 21

slide-22
SLIDE 22

§ Use global importance weights § How much interest is there? § Could be measured as

§ w(X) = # search queries for person X § w(X) = # Wikipedia article views for X § w(X) = # news article mentions for X

Captain America Anthony Hopkins Gwyneth Paltrow Susan Downey

Captain America Anthony Hopkins Gwyneth Paltrow Susan Downey

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

slide-23
SLIDE 23

¡ Some relationships are not (very) globally

important but (not) highly relevant to timeline

¡ Need relevant to timeline instead of globally

relevant w(Susan Downey | RDJr) > w(Justin Bieber | RDJr)

Captain America Justin Bieber Tim Althoff Susan Downey

Captain America Justin Bieber

Tim Althoff

Susan Downey

Applying global importance weights

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

slide-24
SLIDE 24

¡ Can use co-occurrence statistics

w(X | RDJr) = #(X and RDJr) / (#(RDJr) * #(X))

§ Similar: Pointwise mutual information (PMI) § How often do X and Y occur together compared to what you would expect if they were independent § Accounts for popular entities (e.g., Justin Bieber)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24

slide-25
SLIDE 25

¡ How to differentiate between two events that

cover the same relationships?

¡ Example: Robert and Susan Downey

§ Event 1: Wedding, August 27, 2005 § Event 2: Minor charity event, Nov 11, 2006

¡ We need to be able to distinguish these!

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25

slide-26
SLIDE 26

¡ Further improvement when we not only score

relationships but also score the event timestamp

Relationship (as before) Timestamps

¡ Again, use co-occurrences for weights wT

where

F(S) = X

r∈R

wR(r) + X

e∈S

wT (te) R = [

e∈S

Xe

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26

slide-27
SLIDE 27

marvel.com

  • “Robert Downey Jr” and “May 4, 2012” occurs 173

times on 71 different webpages

  • US Release date of The Avengers
  • Use MapReduce on 10B web pages (10k+ machines)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27

slide-28
SLIDE 28

¡ Generalized earlier coverage function to

linear combination of weighted coverage functions

¡ Goal: ¡ Still NP-hard

(because generalization of NP-hard problem)

where

F(S) = X

r∈R

wR(r) + X

e∈S

wT (te) R = [

e∈S

Xe max

|S|≤k F(S)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28

slide-29
SLIDE 29

¡ How can we actually optimize this function? ¡ What structure is there that will help us do

this efficiently?

¡ Any questions so far?

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 29

slide-30
SLIDE 30

¡ For this optimization problem, Greedy

produces a solution S s.t. F(S) ³ (1-1/e)*OPT (F(S) ³ 0.63*OPT)

[Nemhauser, Fisher, Wolsey ’78]

¡ Claim holds for functions F(·) which are:

§ Submodular, Monotone, Normal, Non-negative (discussed next)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30

slide-31
SLIDE 31

Definition:

¡ Set function F(·) is called submodular if:

For all P,QÍ U: F(P) + F(Q) ³ F(PÈ Q) + F(PÇ Q)

Q P

P Ç Q

+ + ³

P È Q

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31

slide-32
SLIDE 32

¡ Checking the previous definition is not easy in practice ¡ Substitute P = A È {d} and Q = B where AÍ B and dÏ B

in the definition above

From before: F(P) + F(Q) ³ F(PÈ Q) + F(PÇ Q)

F(AÈ {d}) + F(B) ³ F(AÈ {d} È B) + F((AÈ {d}) Ç B) F(AÈ {d}) + F(B) ³ F(BÈ{d}) + F(A) F(AÈ {d}) – F(A) ³ F(BÈ{d}) – F(B)

Common definition of Submodularity

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32

slide-33
SLIDE 33

¡ Diminishing returns characterization

d B A d + +

Large improvement Small improvement Gain of adding d to a small set Gain of adding d to a large set

F(A È d) – F(A) ≥ F(B È d) – F(B)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 33

slide-34
SLIDE 34

F(·) Solution size |A|

F(A) F(A È d) F(B È d)

"A Í B

F(B) Adding d to B helps less than adding it to A! Gain of adding d to a small set Gain of adding d to a large set

F(A È d) – F(A) ≥ F(B È d) – F(B)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 34

slide-35
SLIDE 35

Let F1 … FM be submodular functions and λ1 … λM ≥ 0 and let S denote some solution set, then the non-negative linear combination F(S) (defined below) of these functions is also submodular.

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 35

slide-36
SLIDE 36

¡ When maximizing a submodular function with

cardinality constraints, Greedy produces a solution S for which F(S) ³ (1-1/e)*OPT i.e., (F(S) ³ 0.63*OPT)

[Nemhauser, Fisher, Wolsey ’78]

¡ Claim holds for functions F(·) which are:

§ Monotone: if A Í B then F(A) £ F(B) § Normal: F({})=0 § Non-negative: For any A, F(A) ³ 0 § In addition to being submodular

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 36

slide-37
SLIDE 37
slide-38
SLIDE 38

¡ Suppose we are given a set of events E

§ Each event e covers a set of relationships U

¡ For a set of events we define: ¡ Goal: We want to ¡ Note: F(S) is a set function:

e

Cardinality Constraint

S ⊆ E F(S) =

  • [

e∈S

Xe

  • max

|S|≤k F(S)

F(S) : 2E → N

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 38

slide-39
SLIDE 39

¡ Claim: is submodular.

A B Xe Xe

Gain of adding Xe to a smaller set Gain of adding Xe to a larger set

F(S) =

  • [

e∈S

Xe

  • "A Í B

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 39

F(A È Xe) – F(A) ≥ F(B È Xe) – F(B)

slide-40
SLIDE 40

¡ Claim: is normal & monotone ¡ Normality: When S is empty, is empty. ¡ Monotonicity: Adding a new event to S can

never decrease the number of relationships covered by S.

¡ What about non-negativity?

F(S) =

  • [

e∈S

Xe

  • [

e∈S

Xe

Monotone: if A Í B then F(A) £ F(B) Normal: F({})=0 Non-negative: For any A, F(A) ³ 0

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 40

slide-41
SLIDE 41

Simple Coverage Weighted Coverage (Relationships) Weighted Coverage (Timestamps) Complete Optimization Problem Submodularity Monotonicity Normality

X X X

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 41

slide-42
SLIDE 42

¡ Claim: F(S) is submodular.

§ Consider two sets A and B s.t. A Í B Í S and let us consider an event e Ï B § Three possibilities when we add e to A or B:

§ Case 1: e does not cover any new relationships w.r.t both A and B F(A U {e}) – F(A) = 0 = F(B U {e}) – F(B)

F(S) = X

r∈R

w(r) w : R → R+

R = [

e∈S

Xe

where

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 42

slide-43
SLIDE 43

¡ Claim: F(S) is submodular.

§ Three possibilities when we add e to A or B:

§ Case 2: e covers some new relationships w.r.t A but not w.r.t B F(A U {e}) – F(A) = v where v ³ 0 F(B U {e}) – F(B) = 0 Therefore, F(A U {e}) – F(A) ³ F(B U {e}) – F(B)

F(S) = X

r∈R

w(r) w : R → R+

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 43

slide-44
SLIDE 44

¡ Claim: F(S) is submodular.

§ Three possibilities when we add e to A or B:

§ Case 3: e covers some new relationships w.r.t both A and B F(A U {e}) – F(A) = v where v ³ 0 F(B U {e}) – F(B) = u where u ³ 0 But, v ³ u because e will always cover fewer new relationships w.r.t B than w.r.t A

F(S) = X

r∈R

w(r) w : R → R+

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 44

slide-45
SLIDE 45

¡ Claim: F(S) is monotone and normal. ¡ Normality: When S is empty, is empty. ¡ Monotonicity: Adding a new event to S can

never decrease the number of relationships covered by S. F(S) = X

r∈R

w(r) w : R → R+

R = [

e∈S

Xe

where

R = [

e∈S

Xe

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 45

slide-46
SLIDE 46

Simple Coverage Weighted Coverage (Relationships) Weighted Coverage (Timestamps) Complete Optimization Problem Submodularity Monotonicity Normality

X X X X X X

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 46

slide-47
SLIDE 47

¡ Claim: F(S) is submodular, monotone and

normal

¡ Analogous arguments to that of weighted

coverage (relationships) are applicable F(S) = X

e∈S

wT (te)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 47

slide-48
SLIDE 48

Simple Coverage Weighted Coverage (Relationships) Weighted Coverage (Timestamps) Complete Optimization Problem Submodularity Monotonicity Normality

X X X X X X X X X

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 48

slide-49
SLIDE 49

¡ Generalized earlier coverage function to non-

negative linear combination of weighted coverage functions

¡ Goal: ¡ Claim: F(A) is submodular, monotone and

normal

where

F(S) = X

r∈R

wR(r) + X

e∈S

wT (te) R = [

e∈S

Xe max

|S|≤k F(S)

F1(S) F2(S)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 49

slide-50
SLIDE 50

¡ Submodularity: F(S) is a non-negative linear

combination of two submodular functions. Therefore, it is submodular too.

¡ Normality: F1({}) = 0 = F2({})

F1({}) + F2({}) = 0

¡ Monotonicity: Let A Í B Í S,

F1(A) £ F1(B) and F2(A) £ F2(B) F1(A) + F2(A) £ F1(B) + F2(B)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 50

slide-51
SLIDE 51

Simple Coverage Weighted Coverage (Relationships) Weighted Coverage (Timestamps) Complete Optimization Problem Submodularity Monotonicity Normality

X X X X X X X X X X X X

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 51

slide-52
SLIDE 52
slide-53
SLIDE 53

¡ Greedy Algorithm is Slow! ¡ At each iteration, we need to

evaluate marginal gains of all the remaining elements

¡ Runtime O(|U| * K) for

selecting K elements out of the set U

a b c d Marginal gain: F(SÈx)-F(S) e

Greedy

Add element with highest marginal gain

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 53

slide-54
SLIDE 54

¡ In round i:

§ So far we have Si-1 = {e1 … ei-1} § Now we pick an element e Ï Si-1 which maximizes the marginal benefit Δi = F(Si-1 U {e}) – F(Si-1)

¡ Key observation:

§ Marginal gain of any element e can never increase! § For every element e: Δi(e) ³ Δj(e) for all iterations i < j

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 54

slide-55
SLIDE 55

¡ Idea:

§ Use Di as upper-bound on Dj (j > i)

¡ Lazy Greedy:

§ Keep an ordered list of marginal benefits Di from previous iteration § Re-evaluate Di only for top node § Re-sort and prune

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 55

a b c d e

[Leskovec et al., KDD ’07]

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

A1={a} A Í B (Upper bound on) Marginal gain D1

slide-56
SLIDE 56

¡ Idea:

§ Use Di as upper-bound on Dj (j > i)

¡ Lazy Greedy:

§ Keep an ordered list of marginal benefits Di from previous iteration § Re-evaluate Di only for top node § Re-sort and prune

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 56

a d b c e

[Leskovec et al., KDD ’07]

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

A Í B A1={a} Upper bound on Marginal gain D2

slide-57
SLIDE 57

¡ Idea:

§ Use Di as upper-bound on Dj (j > i)

¡ Lazy Greedy:

§ Keep an ordered list of marginal benefits Di from previous iteration § Re-evaluate Di only for top node § Re-sort and prune

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 57

a c d b e Upper bound on Marginal gain D2

[Leskovec et al., KDD ’07]

F(A È {d}) – F(A) ≥ F(B È {d}) – F(B)

A1={a} A2={a,b} A Í B

slide-58
SLIDE 58

¡ Lazy greedy offers significant speed-up over

traditional greedy implementations in practice.

Lower is better

1 2 3 4 5 6 7 8 9 10 100 200 300 400 number of elements selected

running time (seconds) exhaustive search (all subsets) naive greedy

Lazy

[Leskovec et al., KDD ’07]

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 58

slide-59
SLIDE 59

¡ Althoff et. al., TimeMachine: Timeline Generation for

Knowledge-Base Entities, KDD 2015

¡ Leskovec et. al., Cost-effective Outbreak Detection in

Networks, KDD 2007

¡ Andreas Krause, Daniel Golovin, Submodular

Function Maximization

¡ ICML Tutorial:

http://submodularity.org/submodularity-icml-part1- slides-prelim.pdf

¡ Learning and Testing Submodular Functions:

http://grigory.us/cis625/lecture3.pdf

¡ UW Research by Jeff Bilmes (ECE)

5/28/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 59