Algorithms for Big‐Data Management
CompSci 590.02 Instructor: Ashwin Machanavajjhala
1 Lecture 1 : 590.02 Spring 13
AlgorithmsforBigData Management CompSci590.02 - - PowerPoint PPT Presentation
AlgorithmsforBigData Management CompSci590.02 Instructor:AshwinMachanavajjhala Lecture1:590.02Spring13 1 Administrivia hCp://www.cs.duke.edu/courses/spring13/compsci590.2/
CompSci 590.02 Instructor: Ashwin Machanavajjhala
1 Lecture 1 : 590.02 Spring 13
hCp://www.cs.duke.edu/courses/spring13/compsci590.2/
– No exams! – Every class based on 1 (or 2) assigned papers that students must read.
– Individual or groups of size 2‐3
2 Lecture 1 : 590.02 Spring 13
– Ideas will be posted in the coming weeks
– Literature review – Some original research/implementa\on
– ≤Feb 12: Choose Project (ideas will be posted … new ideas welcome) – Feb 21: Project proposal (1‐4 pages describing the project) – Mar 21: Mid‐project review (2‐3 page report on progress) – Apr 18: Final presenta\ons and submission (6‐10 page conference style paper + 20 minute talk)
Lecture 1 : 590.02 Spring 13 3
– “What Next? A Half‐Dozen Data Management Research Goals for Big Data and Cloud”, Surajit Chaudhuri, MicrosoO Research – “Big data: The next fronQer for innovaQon, compeQQon, and producQvity”, McKinsey Global InsQtute Report, 2011
Lecture 1 : 590.02 Spring 13 4
We will read papers in:
– Data Management – Theory – Machine Learning – …
Lecture 1 : 590.02 Spring 13 5
– Read scienQfic papers – Formulate a problem – Perform a scienQfic evaluaQon
Lecture 1 : 590.02 Spring 13 6
Lecture 1 : 590.02 Spring 13 7
Lecture 1 : 590.02 Spring 13 8
Lecture 1 : 590.02 Spring 13 9
Lecture 1 : 590.02 Spring 13 10
hCp://visual.ly/what‐big‐data
Lecture 1 : 590.02 Spring 13 11
hCp://visual.ly/what‐big‐data
hardware
sources
Lecture 1 : 590.02 Spring 13 12
13
Lecture 1 : 590.02 Spring 13
14
+250% clicks
+79% clicks
+43% clicks
Recommended links Personalized News Interests Top Searches
Lecture 1 : 590.02 Spring 13
15
creaQvely and effecQvely to drive efficiency and quality, the sector could create more than $300 billion in value every year. ”
McKinsey Global Ins\tute Report
Lecture 1 : 590.02 Spring 13
Lecture 1 : 590.02 Spring 13 16
Lecture 1 : 590.02 Spring 13 17
hCp://www.ccs.neu.edu/home/amislove/twiCermood/
– Reservoir Sampling – Sampling with indices – Sampling from Joins – Markov chain Monte Carlo sampling – Graph Sampling & PageRank
Lecture 1 : 590.02 Spring 13 18
– Sketches – Online Aggrega\on – Windowed queries – Online learning
Lecture 1 : 590.02 Spring 13 19
– PRAM – Map Reduce – Graph processing architectures : Bulk Synchronous parallel and asynchronous models – (Graph connec\vity, Matrix Mul\plica\on, Belief Propaga\on)
Lecture 1 : 590.02 Spring 13 20
– Theta Joins: or how to op\mally join two large datasets – Clustering similar documents using minHash – Iden\fying matching users across social networks – Correla\on Clustering – Markov Logic Networks
Lecture 1 : 590.02 Spring 13 21
Lecture 1 : 590.02 Spring 13 22
– Processing the en\re dataset takes too long. How many tweets menQon Obama? – Computa\on is intractable Number of saQsfying assignments for a DNF. – Do not have access or expensive to get access to en\re data. How many restaurants does Google know about? Number of users in Facebook whose birthday is today. What fracQon of the populaQon has the flu?
Lecture 1 : 590.02 Spring 13 23
Input: A universe of items U (e.g., all tweets) A subset G (e.g., tweets men\oning Obama) Goal: Es\mate μ = |G|/|U| Algorithm:
Theorem: Let ε < 2. If N > (1/μ) (4 ln(2/δ)/ε2), then Pr[(1‐ε) μ < Y < (1+ε)μ] > 1‐δ
Lecture 1 : 590.02 Spring 13 24
Algorithm:
Theorem: Let ε < 2. If N > (1/μ) (4 ln(2/δ)/ε2), then Pr[(1‐ε) μ < Y < (1+ε)μ] > 1‐δ Proof: Homework
Lecture 1 : 590.02 Spring 13 25
subset of n rows is equally likely.
Lecture 1 : 590.02 Spring 13 26
Highlights:
the first t rows.
Lecture 1 : 590.02 Spring 13 27
Algorithm R:
– Pick a random number m between 1 and t+1 – If m <= n, then replace the mth row in the reservoir with R
Lecture 1 : 590.02 Spring 13 28
Lecture 1 : 590.02 Spring 13 29
all the rows in the table.
That is, each row has n/t chance of appearing in the sample.
– (t+1)st row is included in the sample with probability n/(t+1) – Any other row: P[ row is in reservoir] = P[ row is in reservoir a}er t steps]* P[ row is not replaced] = n/t * (1‐1/(t+1)) = n/(t+1)
Lecture 1 : 590.02 Spring 13 30
n + Σn
N‐1 n/(t+1) = n(1 + HN ‐ Hn) ≈ n(1 + ln(N/n))
Lecture 1 : 590.02 Spring 13 31
Algorithm R.
– Involved O(S) \me and O(S) calls to the random number generator.
Lecture 1 : 590.02 Spring 13 32
Algorithm R.
reservoir, but row t+s+1 is inserted. = { 1‐n/(t+1) } x {1 – n/(t+2)} x … x {1‐n/(t+s)} x n/(t+s+1)
P[ S(n,t) <= s ] = 1 – (t/t+s+1)(t‐1/t+s)(t‐2/t+s‐1) … (t‐n+1/t+s‐n+2)
Lecture 1 : 590.02 Spring 13 33
Algorithm X
CDF
Lecture 1 : 590.02 Spring 13 34
Algorithm X
CDF
– Pick a random U between 0 and 1 – Find the minimum s such that P[ S(n,t) <= s] <= 1‐U
Lecture 1 : 590.02 Spring 13 35
Each skip takes O(s) \me to compute Total \me = sum of all the skips = O(N)
= 2 * expected number of rows in the reservoir = O(n(1 + ln(N/n))) op\mal! See paper for algorithm which has op\mal run\me
Lecture 1 : 590.02 Spring 13 36
too large, or the computa\on is intractable, or if access to data is limited.
without knowledge of the size of the data.
– Also can do weighted sampling [Efraimidis, Spirakis IPL 2006]
Lecture 1 : 590.02 Spring 13 37
So}ware, 1985
Informa\on Processing LeCers, 97(5), 2006
Problems”, Journal of Algorithms, 1989
Lecture 1 : 590.02 Spring 13 38