Big-Data Algorithms: Overview Reference: - PowerPoint PPT Presentation

Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf

What’s the problem here? I So far, linear (i.e., linear-cost) algorithms have been “gold standard”. I What if linear algorithms aren’t good enough?

What’s the problem here? I So far, linear (i.e., linear-cost) algorithms have been “gold standard”. I What if linear algorithms aren’t good enough? Example: Search the web for pages of interest.

Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 .

Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 . I Streaming: May not be able to store a huge dataset. Need to process stream of data, coming in one chunk at a time, on the fly. Must answer queries with sublinear memory.

Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 . I Streaming: May not be able to store a huge dataset. Need to process stream of data, coming in one chunk at a time, on the fly. Must answer queries with sublinear memory. I Dimensionality reduction: For example, spam filtering. Bag-of-words model: Let d be a dictionary of words. Represent email by vector v , where v i is the number of times d i appears in msg. Then dim v = | d | .

I Large-scale matrix computation , such as least squares regression : Suppose we want to learn f : R n ! R , where f = h b , · i for some b 2 R n , where n X 8 u , v 2 R n . h u , v i = u i v i j =1 Collect data { ( x i 2 R n , y i 2 R ) : 1  i  m } . Want to compute b minimizing ✓ n ◆ 1 / 2 ( y i � h b , x i i ) 2 k Xb � y k 2 X 2 = , j =1 where X 2 R m × n is composed of the (column) vectors x T 1 , . . . , x T p m and k · k 2 = h · , · i is ` 2 -norm. Also, principal component analysis, given by singular value decomposition of matrix: which features are most important?

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time.

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”).

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n = ) f ( n ) � d log n e

Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n = ) f ( n ) � d log n e since n 2 Z

If we want sublinear-space algorithm, need an estimate ˜ n of n . Want to know that for some " , � 2 (0 , 1), we have P ( | ˜ n � n | > " n ) < � .

If we want sublinear-space algorithm, need an estimate ˜ n of n . Want to know that for some " , � 2 (0 , 1), we have P ( | ˜ n � n | > " n ) < � . Equivalently: P ( | ˜ n � n |  " n ) � 1 � � .

Morris’ algorithm : Uses an integer counter X , with data structure operations I init() : sets X 0 I update() : increments X with probability 2 − X n = 2 X � 1 I query() : outputs ˜ Intuitively, X attempts to store a value approximately log n . How good is this?

Morris’ algorithm : Uses an integer counter X , with data structure operations I init() : sets X 0 I update() : increments X with probability 2 − X n = 2 X � 1 I query() : outputs ˜ Intuitively, X attempts to store a value approximately log n . How good is this? Not so great; we’ll see that 1 P ( | ˜ n � n | > " n ) < 2 " 2 Since " < 1, RHS exceeds 1 2 , which means that estimator may always be zero!

Improvement Morris+: Create s independent copies of Morris, and average their outputs. Calling these estimators ˜ n 1 , . . . , ˜ n s , then output is n n = 1 X ˜ n i . ˜ s i =1 Then 1 P ( | ˜ n � n | > " n ) < 2 s " 2 So 1 P ( | ˜ n � n | > " n ) < � for s > 2 " 2 � = Θ (1 / � ) Better!

Improvement Morris++: Reduces dependence of failure probability from Θ (1 / � ) to Θ (log 1 / � ).

Improvement Morris++: Reduces dependence of failure probability from Θ (1 / � ) to Θ (log 1 / � ). Run t instances of Morris+, each with failure probability 1 3 . So s = Θ (1 / " 2 ) for each instance. Now output median estimate of these t Morris+ instances. Calling this output ˜ n , it turns out that P ( | ˜ n � n | > " n ) < � for t = Θ (log 1 / � ) .

Probability Review Let X be a random variable taking values in S ✓ R . The expected value of X is X E X = j · P ( X = j ) . j ∈ S The variance of X is ( X � E X ) 2 � � Var[ X ] = E . Linearity of expected value: Let X and Y be random variables. Than E ( aX + bY ) = a E X + b E Y 8 a , b 2 R . Markov’s inequality: If X is a nonnegative random variable, then P ( X > � ) < E X 8 � > 0 . �

Chebyshev’s inequality: Let X be a nonnegative random variable. Then P ( | X � E X | > � ) < E ( X � E X ) 2 = Var[ X ] 8 � > 0 . � 2 � 2 More generally, if p � 1, then P ( | X � E X | > � ) < E ( X � E X ) p . 8 � > 0 . � p Cherno ff ’s inequality: Suppose X 1 , X 2 , . . . , X n are independent random variables with X i 2 [0 , 1]. Let X = P n i =1 X i . Then P ( | X � E X | > " E X )  2 · e − ε 2 µ/ 3 8 " 2 (0 , 1) .

Analysis of Morris’ algorithm Let X n be X after n updates. Claim: E 2 X n = n + 1 for n 2 N 0 . Proof of claim: By induction, the base case n = 0 being E 2 X n = E 2 X 0 = E 1 = n + 1 .

Induction step: Suppose that E 2 X n = n + 1 for some n 2 N 0 . Then ∞ E 2 X n +1 = P ( X n = j ) · E (2 X n +1 | X n = j ) X j =0 ∞ ✓✓ 1 � 1 ◆ 2 j + 1 ◆ X 2 j · 2 j +1 = P ( X n = j ) · 2 j j =0 ∞ ∞ P ( X n = j ) 2 j + X X = P ( X n = j ) j =0 j =0 = E 2 X n + 1 = ( n + 1) + 1 , as required.

Big-Data Algorithms: Overview Reference: - PowerPoint PPT Presentation

Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf Whats the problem here? I So far, linear (i.e., linear-cost) algorithms have been gold standard. I What if linear algorithms arent good

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Contributions to the Formal Verification of Arithmetic Algorithms rik Martin-Dorel PhD

Multilevel discrete least squares polynomial approximation Ral Tempone Alexander von Humboldt

Discrete least squares polynomial approximation with random evaluations application to PDEs

Polynomial approximation via de la Vall ee Poussin means { Generalized airfoil equation

Timing Slides Manually 1. Switch to Normal View by choosing the option in the slides view window.

Automatic Processing of Residual Functional Capacity Form Images JULIA PORCINO AND CHUNXIAO ZHOU

Widgets Logical devices Widgets & Toolkits Toolkit Design JavaFX Widgets Clipboard &

Greeting People at Registration Greeting People at Registration You have signed up to be on the

Big-Data Algorithms: Overview Reference: - PowerPoint PPT Presentation

Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf Whats the problem here? I So far, linear (i.e., linear-cost) algorithms have been gold standard. I What if linear algorithms arent good

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Contributions to the Formal Verification of Arithmetic Algorithms rik Martin-Dorel PhD

Multilevel discrete least squares polynomial approximation Ral Tempone Alexander von Humboldt

Discrete least squares polynomial approximation with random evaluations application to PDEs

Polynomial approximation via de la Vall ee Poussin means { Generalized airfoil equation

Timing Slides Manually 1. Switch to Normal View by choosing the option in the slides view window.

Automatic Processing of Residual Functional Capacity Form Images JULIA PORCINO AND CHUNXIAO ZHOU

Widgets Logical devices Widgets &amp; Toolkits Toolkit Design JavaFX Widgets Clipboard &amp;

Greeting People at Registration Greeting People at Registration You have signed up to be on the

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Widgets Logical devices Widgets & Toolkits Toolkit Design JavaFX Widgets Clipboard &