http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on ity Parallel SGD document Detection streams reduction detection 3/3/20 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

¡ Classic model of algorithms § You get to see the entire input, then compute some function of it § In this context, “ offline algorithm” ¡ Online Algorithms § You get to see the input one piece at a time, and need to make irrevocable decisions along the way § Similar to the data stream model 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

¡ Query-to-advertiser graph: query advertiser [Andersen, Lang: Communities from seed sets, 2006] 3/3/20 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Opportunity to Which advertiser Advertiser show an ad gets picked a 1 (1,a) (2,b) 2 b (3,d) c 3 4 d Advertiser X wants to show an ad for topic/query Y This is an online problem: We have to make decisions as queries/topics show up. We do not know what topics will show up in the future. 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

a 1 2 b c 3 4 d Boys Girls Nodes: Boys and Girls; Links: Preferences Goal: Match boys to girls so that the most preferences are satisfied 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

a 1 2 b c 3 4 d Boys Girls M = {(1,a),(2,b),(3,d)} is a matching Cardinality of matching = |M| = 3 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

a 1 2 b c 3 4 d Boys Girls M = {(1,c),(2,b),(3,d),(4,a)} is a perfect matching Perfect matching … all vertices of the graph are matched Maximum matching … a matching that contains the largest possible number of matches 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

¡ Problem: Find a maximum matching for a given bipartite graph § A perfect one if it exists ¡ There is a polynomial-time offline algorithm based on augmenting paths (Hopcroft & Karp 1973, see http://en.wikipedia.org/wiki/Hopcroft-Karp_algorithm ) ¡ But what if we do not know the entire graph upfront? 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

¡ Initially, we are given the set boys ¡ In each round , one girl’s choices are revealed § That is, the girl’s edges are revealed ¡ At that time, we have to decide to either: § Pair the girl with a boy § Do not pair the girl with any boy ¡ Example of application: Assigning tasks to servers 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

a 1 (1,a) (2,b) 2 b (3,d) c 3 4 d 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

¡ Greedy algorithm for the online graph matching problem: § Pair the new girl with any eligible boy § If there is none, do not pair the girl ¡ How good is the algorithm? 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

¡ For input I , suppose greedy produces matching M greedy while an optimal matching is M opt Competitive ratio = min all possible inputs I (|M greedy |/|M opt |) (what is greedy’s worst performance over all possible inputs I ) 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

¡ Consider a case: M greedy ≠ M opt M opt 1 a M greedy ¡ Consider the set G of girls 2 b matched in M opt but not in M greedy 3 c ¡ (1) By definition of G : d 4 | M opt | £ | M greedy | + | G | G ={ } B ={ } ¡ (2) Define set B of boys linked to girls in G § Notice boys in B are already matched in M greedy . Why? § If there would exist such non-matched (by M greedy ) boy adjacent to a non-matched girl then greedy would have matched them So: | M greedy |≥ | B | 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

M opt ¡ Summary so far: 1 a M greedy § Girls G matched in M opt but not in M greedy 2 b 3 § Boys B adjacent to girls in G c § (1) | M opt | £ | M greedy | + | G | d 4 G ={ } B ={ } § (2) | M greedy |≥ | B | ¡ Optimal matches all girls in G to (some) boys in B § (3) | G | £ | B | ¡ Combining (2) and (3) : § | G | £ | B | £ | M greedy | 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

M opt ¡ So we have: 1 a M greedy § (1) | M opt | £ | M greedy | + | G | 2 b 3 § (4) | G | £ | B | £ | M greedy | c d 4 ¡ Combining (1) and (4) : G ={ } B ={ } § Worst case is when | G | = | B | = | M greedy | § | M opt | £ | M greedy | + | M greedy | § Then | M greedy |/| M opt | ³ 1/2 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

a 1 (1,a) (2,b) 2 b c 3 4 d 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

¡ Banner ads (1995-2001) § Initial form of web advertising § Popular websites charged $X for every 1,000 “impressions” of the ad § Called “ CPM ” rate CPM …cost per mille (Cost per thousand impressions) Mille…thousand in Latin § Modeled similar to TV, magazine ads § From untargeted to demographically targeted § Low click-through rates § Low ROI for advertisers 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

¡ Introduced by Overture around 2000 § Advertisers bid on search keywords § When someone searches for that keyword, the highest bidder’s ad is shown § Advertiser is charged only if the ad is clicked on ¡ Similar model adopted by Google with some changes around 2002 § Called Adwords 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

¡ Performance-based advertising works! § Multi-billion-dollar industry ¡ Interesting problem: Which ads to show for a given query? § (Today’s lecture) ¡ If I am an advertiser, which search terms should I bid on and how much should I bid? § (Not focus of today’s lecture) 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

¡ A stream of queries arrives at the search engine: q 1 , q 2 , … ¡ Several advertisers bid on each query ¡ When query q i arrives, search engine must pick a subset of advertisers to show their ads ¡ Goal: Maximize search engine’s revenues § Simple solution: Instead of raw bids, use the “ expected revenue per click ” (i.e., Bid*CTR ) ¡ Clearly we need an online algorithm! 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

Advertiser Bid CTR Bid * CTR A $1.00 1% 1 cent B $0.75 2% 1.5 cents C $0.50 2.5% 1.25 cents Click through Expected rate revenue 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27

Advertiser Bid CTR Bid * CTR B $0.75 2% 1.5 cents C $0.50 2.5% 1.25 cents A $1.00 1% 1 cent Instead of sorting advertisers by bid, sort by expected revenue 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28

Instead of sorting advertisers by bid, sort by expected revenue Advertiser Bid CTR Bid * CTR B $0.75 2% 1.5 cents C $0.50 2.5% 1.25 cents A $1.00 1% 1 cent Challenges: ¡ CTR of an ad is unknown ¡ Advertisers have limited budgets and bid on multiple queries 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29

¡ Two complications: § Budget § CTR of an ad is unknown 1) Budget: Each advertiser has a limited budget § Search engine guarantees that the advertiser will not be charged more than their daily budget 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30

¡ 2) CTR (Click-Through Rate): Each ad-query pair has a different likelihood of being clicked § Advertiser 1 bids $2 on query A, click probability = 0.1 § Advertiser 2 bids $1 on query B, click probability = 0.5 ¡ CTR is predicted or measured historically § Averaged over a time period ¡ Some complications we will not cover: § 1) CTR is position dependent: § Ad #1 is clicked more than Ad #2 3/3/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Overlaps with machine learning, statistics, artificial intelligence,

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dimensional == many features Find

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies

Benefits of the Real-time Approach Peter Bownes HDR Prostate Workshop Nov 2013 Real Time US

Equally-weighted Risk contributions: a new method to build risk balanced diversified portfolios

(Enumeration Results for) Signed Graphs Matthias Beck San Francisco State University [John

The Hydrostatic Equation Air pressure at any height in the atmosphere is due to the force per

CS 225 Data Structures Feb. 26 BST Balance Wad ade Fag agen-Ulm lmschneid ider Course

of Open Repair vs. EV AR? Andres Schanzer, MD University of Massachusetts Medical School April 4

Living in Balance Cindy J. Whitaker Professional Learning Director Dont sayyou dont

Improving Load Balance via Resource Exchange in Large-Scale Search Engines Kaiyue Duan, Yusen Li,