 
              Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
¡ Date: § Thursday, March 19, 12:15-3:15 PM PDT § Location: § if SUNetID[0] in [‘A', .. ‘R'] then Cubberley Auditorium § if SUNetID[0] in [‘S', .. ‘Z'] then STLC114 ¡ Alternate Date: § Wednesday, March 18, 6:00-9:00 PM PDT § Location: § Gates 104 § There is still SOME SPACE LEFT! ¡ TAs will NOT answer questions during the final 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
You may come to Stanford to take the exam, or… ¡ Date: § From Wed, Mar 18, 6 PM to Thu, Mar 19, 6 PM (PDT) § Agree with your exam monitor on the most convenient 3-hour slot in that window of time ¡ Exam monitors will receive an email from SCPD with the final exam , which they will in turn forward to you right before the beginning of your 3-hour slot ¡ Once you completed the exam, make sure to send the file back to your exam monitor (high-quality scanned copy) ¡ Exam monitors will NOT answer questions during the final 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
¡ Final exam is open book and open notes ¡ A calculator or computer is REQUIRED § You may only use your computer to do arithmetic calculations (i.e., the buttons found on a standard scientific calculator) § You may also use your computer to read course notes or the textbook § But no Internet/Google/Python access is allowed ¡ Practice finals are posted on Piazza! ¡ We recommend bringing a power strip 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
¡ Redundancy leads to a bad user experience § Uncertainty around information need => don’t put all eggs in one basket ¡ How do we optimize for diversity directly? 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
France intervenes Chuck for Defense Argo wins big Hagel expects fight Monday, January 14, 2013 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
France intervenes Chuck for Defense Argo wins big New gun proposals Monday, January 14, 2013 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
¡ Idea: Encode diversity as coverage problem ¡ Example: Word cloud of news for a single day § Want to select articles so that most words are “covered” 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
¡ Q: What is being covered? ¡ A: Concepts (In our case: Named entities) France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Hagel expects fight ¡ Q: Who is doing the covering? ¡ A: Documents 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
¡ Suppose we are given a set of documents D § Each document d covers a set 𝒀 𝒆 of words/topics/named entities W ¡ For a set of documents A Í D we define 𝑮 𝑩 = $ 𝒀 𝒋 𝒋∈𝑩 ¡ Goal: We want to max 𝑩 $𝒍 𝑮(𝑩) ¡ Note: F(A) is a set function: 𝑮 𝑩 : 𝐓𝐟𝐮𝐭 → ℕ 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
¡ Given universe of elements 𝑿 = {𝒙 𝟐 , … , 𝒙 𝒐 } and sets 𝒀 𝟐 , … , 𝒀 𝒏 Í 𝑿 X 3 X 1 W X 2 X 4 ¡ Goal: Find k sets X i that cover the most of W § More precisely: Find k sets X i whose size of the union is the largest § Bad news: A known NP-complete problem 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
Simple Heuristic: Greedy Algorithm: ¡ Start with 𝑩 𝟏 = { } ¡ For 𝒋 = 𝟐 … 𝒍 § Find set 𝒆 that 𝐧𝐛𝐲 𝑮(𝑩 𝒋#𝟐 ∪ {𝒆}) § Let 𝑩 𝒋 = 𝑩 𝒋#𝟐 È {𝒆} 𝑮 𝑩 = ( 𝒀 𝒆 ¡ Example: 𝒆∈𝑩 § Eval. 𝑮 𝒆 𝟐 , … , 𝑮({𝒆 𝒏 }) , pick best (say 𝒆 𝟐 ) § Eval. 𝑮 𝒆 𝟐 } ∪ {𝒆 𝟑 , … , 𝑮({𝒆 𝟐 } ∪ {𝒆 𝒏 }) , pick best (say 𝒆 𝟑 ) § Eval. 𝑮({𝒆 𝟐 , 𝒆 𝟑 } ∪ {𝒆 𝟒 }), … , 𝑮({𝒆 𝟐 , 𝒆 𝟑 } ∪ {𝒆 𝒏 }) , pick best § And so on… 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
¡ Goal: Maximize the covered area 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
¡ Goal: Maximize the covered area 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
¡ Goal: Maximize the covered area 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
¡ Goal: Maximize the covered area 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
¡ Goal: Maximize the covered area 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
A C B ¡ Goal: Maximize the size of the covered area ¡ Greedy first picks A and then C ¡ But the optimal way would be to pick B and C 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
¡ Greedy produces a solution A where: F(A) ³ (1-1/e)*OPT ( F(A) ³ 0.63*OPT ) [Nemhauser, Fisher, Wolsey ’78] ¡ Claim holds for functions F (·) with 2 properties: § F is monotone: (adding more docs doesn’t decrease coverage) if A Í B then F (A) £ F (B) and F ({})= 0 § F is submodular: adding an element to a set gives less improvement than adding it to one of its subsets 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
Definition: ¡ Set function F(·) is called submodular if: For all A,B Í W : F(A) + F(B) ³ F(A È B) + F(A Ç B) ³ + + A A È B B A Ç B 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
¡ Diminishing returns characterization Equivalent definition: ¡ Set function F(·) is called submodular if: For all A Í B : F(A È {d}) – F(A) ≥ F(B È {d}) – F(B) Gain of adding d to a small set Gain of adding d to a large set + d A Large improvement B + d Small improvement 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
¡ F(·) is submodular : A Í B F(A È {d}) – F(A) ≥ F(B È {d}) – F(B) Gain of adding d to a small set Gain of adding d to a large set ¡ Natural example: A § Sets 𝑒 1 , … , 𝑒 % d § 𝐺 𝐵 = ⋃ &∈( 𝑒 & (size of the covered area) B § Claim: 𝑮(𝑩) is submodular! d 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25
¡ Submodularity is discrete analogue of concavity F(·) " A Í B F(B È { d} ) F(B) F(A È { d} ) F(A) Adding d to B helps less than adding it to A ! Solution size |A| F(A È {d}) – F(A) ≥ F(B È {d}) – F(B) Gain of adding X d to a small set Gain of adding X d to a large set 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
¡ Marginal gain: 𝚬 𝑮 𝒆 𝑩 = 𝑮 𝑩 ∪ {𝒆} − 𝑮(𝑩) 𝐵 ⊆ 𝐶 ¡ Submodular: 𝑮 𝑩 ∪ {𝒆} − 𝑮 𝑩 ≥ 𝑮 𝑪 ∪ {𝒆} − 𝑮(𝑪) ¡ Concavity: 𝑏 ≤ 𝑐 𝒈 𝒃 + 𝒆 − 𝒈 𝒃 ≥ 𝒈 𝒄 + 𝒆 − 𝒈(𝒄) F(A) |A| 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
¡ Let 𝑮 𝟐 … 𝑮 𝒏 be submodular and 𝝁 𝟐 … 𝝁 𝒏 > 𝟏 𝒏 𝝁 𝒋 𝑮 𝒋 𝑩 is submodular then 𝑮 𝑩 = ∑ 𝒋*𝟐 § Submodularity is closed under non-negative linear combinations! ¡ This is an extremely useful fact: § Average of submodular functions is submodular: 𝑮 𝑩 = ∑ 𝒋 𝑸 𝒋 ⋅ 𝑮 𝒋 𝑩 § Multicriterion optimization: 𝑮 𝑩 = ∑ 𝒋 𝝁 𝒋 𝑮 𝒋 𝑩 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
¡ Q: What is being covered? ¡ A: Concepts (In our case: Named entities) France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Hagel expects fight ¡ Q: Who is doing the covering? ¡ A: Documents 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29
¡ Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend ¡ F(A): the number of concepts covered by A § Elements…concepts, Sets … concepts in docs § F(A) is submodular and monotone! § We can use greedy algorithm to optimize F 3/9/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
Recommend
More recommend