Dimension Independent Matrix Square Introduction using MapReduce - PowerPoint PPT Presentation

Dimension Independent Matrix Square Reza Zadeh Dimension Independent Matrix Square Introduction using MapReduce The Problem Why Bother MapReduce First Pass Naive Reza Bosagh Zadeh Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small STOC 2013 More Results

Outline Introduction 1 Dimension Independent The Problem Matrix Square Why Bother Reza Zadeh MapReduce Introduction First Pass 2 The Problem Why Bother Naive MapReduce Analysis First Pass Naive DIMSUM 3 Analysis Algorithm DIMSUM Algorithm Shuffle Size Shuffle Size Correctness Correctness Singular values Similarities Singular values Experiments Similarities Large Small Experiments 4 More Results Large Small More Results 5

Computing A T A Dimension Independent Matrix Square Given m × n matrix A with entries in [ 0 , 1 ] and m ≫ n , Reza Zadeh compute A T A . Introduction The Problem   · · · a 1 , 1 a 1 , 2 a 1 , n Why Bother MapReduce a 2 , 1 a 2 , 2 · · · a 2 , n   First Pass A =  . . .  ... Naive . . .   . . . Analysis   DIMSUM a m , 1 a m , 2 · · · a m , n Algorithm Shuffle Size Correctness A is tall and skinny, example values m = 10 12 , n = 10 6 . Singular values Similarities A has sparse rows , each row has at most L nonzeros. Experiments Large A is stored across thousands of machines and cannot Small More Results be streamed through a single machine.

Guarantees Dimension Independent Matrix Square Reza Zadeh Preserve singular values of A T A with ǫ relative error Introduction The Problem paying shuffle size O ( n 2 /ǫ 2 ) and reduce-key complexity Why Bother MapReduce O ( n /ǫ 2 ) . i.e. independent of m . First Pass Preserve specific entries of A T A , then we can reduce Naive Analysis the shuffle size to O ( n log ( n ) / s ) and reduce-key DIMSUM Algorithm complexity to O ( log ( n ) / s ) where s is the minimum Shuffle Size Correctness similarity for the entries being estimated. Similarity can Singular values Similarities be via Cosine, Dice, Overlap, or Jaccard. Experiments Large Small More Results

Computing All Pairs of Cosine Similarities Dimension Independent Matrix Square Reza Zadeh We have to find dot products between all pairs of columns of A Introduction The Problem We prove results for general matrices, but can do better Why Bother MapReduce for those entries with cos ( i , j ) ≥ s First Pass Naive Cosine similarity: a widely used definition for “similarity" Analysis between two vectors DIMSUM Algorithm Shuffle Size c T i c j Correctness cos ( i , j ) = Singular values Similarities || c i |||| c j || Experiments Large c i is the i ′ th column of A Small More Results

Ubiquitous problem Dimension Independent Matrix Square Reza Zadeh Introduction The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

MapReduce Dimension Independent Matrix Square Reza Zadeh With such large datasets (e.g. m = 10 12 ), we must use Introduction The Problem many machines. Why Bother MapReduce Biggest clusters of computers use MapReduce First Pass Naive MapReduce is the tool of choice in such distributed Analysis systems DIMSUM Algorithm With so many machines (around 1000), CPU power is Shuffle Size Correctness abundant, but communication is expensive Singular values Similarities 2 Minute description of MapReduce... Experiments Large Small More Results

MapReduce Dimension Independent Matrix Square Reza Zadeh Introduction The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

MapReduce Dimension Independent Matrix Square Reza Zadeh Introduction Input gets dished out to the mappers roughly equally The Problem Why Bother Two performance measures MapReduce First Pass 1) Shuffle size: shuffling the data output by the Naive mappers to the correct reducer is expensive Analysis DIMSUM 2) Largest reduce-key: can’t send too much of the data Algorithm Shuffle Size to a single reducer Correctness Singular values First pass at implementing cos ( i , j ) in MapReduce... Similarities Experiments Large Small More Results

Naive Implementation Dimension Independent Matrix Square Given row r i , Map with NaiveMapper (Algorithm 1) 1 Reza Zadeh Reduce using the NaiveReducer (Algorithm 2) 2 Introduction The Problem Why Bother Algorithm 1 NaiveMapper ( r i ) MapReduce First Pass for all pairs ( a ij , a ik ) in r i do Naive Analysis Emit (( c j , c k ) → a ij a ik ) DIMSUM end for Algorithm Shuffle Size Correctness Singular values Similarities Experiments Algorithm 2 NaiveReducer (( c i , c j ) , � v 1 , . . . , v R � ) Large Small i c j → � R output c T i = 1 v i More Results

Analysis for First Pass Dimension Independent Matrix Square Reza Zadeh Introduction Very easy analysis The Problem Why Bother 1) Shuffle size: O ( mL 2 ) MapReduce First Pass 2) Largest reduce-key: O ( m ) Naive Analysis Both depend on m , the larger dimension, and are DIMSUM Algorithm intractable for m = 10 12 , L = 100. Shuffle Size Correctness We’ll bring both down via clever sampling Singular values Similarities Experiments Large Small More Results

DIMSUM Algorithm Dimension Independent Algorithm 3 DIMSUMMapper ( r i ) Matrix Square Reza Zadeh for all pairs ( a ij , a ik ) in r i do � � 1 Introduction With probability min 1 , γ || c j |||| c k || The Problem Why Bother emit (( c j , c k ) → a ij a ik ) MapReduce end for First Pass Naive Analysis DIMSUM Algorithm Algorithm 4 DIMSUMReducer (( c i , c j ) , � v 1 , . . . , v R � ) Shuffle Size Correctness γ Singular values if || c i |||| c j || > 1 then Similarities � R Experiments 1 output b ij → i = 1 v i || c i |||| c j || Large Small else More Results � R output b ij → 1 i = 1 v i γ end if

Analysis for DIMSUM Dimension Independent Matrix Square Reza Zadeh Four things to prove: Introduction The Problem Shuffle size: O ( nL γ ) Why Bother 1 MapReduce Largest reduce-key: O ( γ ) 2 First Pass Naive The sampling scheme preserves similarities when Analysis 3 DIMSUM γ = Ω( log ( n ) / s ) Algorithm Shuffle Size The sampling scheme preserves singular values when 4 Correctness Singular values γ = Ω( n /ǫ 2 ) Similarities Experiments Large Small More Results

Analysis for DIMSUM Dimension Independent Matrix Square Reza Zadeh Introduction Some notation The Problem Why Bother #( c i , c j ) is the number of times columns i and j have a MapReduce 1 First Pass nonzero in the same dimension Naive Analysis #( c i ) is the number of nonzeros in the vector c i 2 DIMSUM Algorithm Theorem will be about { 0 , 1 } matrices, but can be 3 Shuffle Size Correctness generalized Singular values Similarities Experiments Large Small More Results

Shuffle size for DIMSUM Dimension Theorem Independent Matrix Square For { 0 , 1 } matrices, the expected shuffle size for Reza Zadeh DIMSUMMapper is O ( nL γ ) . Introduction The Problem Why Bother Proof. MapReduce First Pass The expected contribution from each pair of columns will Naive Analysis constitute the shuffle size: DIMSUM Algorithm #( c i , c j ) n n Shuffle Size Correctness � � � Pr [ DIMSUMSampleEmit ( c i , c j )] Singular values Similarities i = 1 j = i + 1 k = 1 Experiments Large n n Small � � More Results = #( c i , c j ) Pr [ CosineSampleEmit ( c i , c j )] i = 1 j = i + 1

Shuffle size for DIMSUM Dimension Independent Proof. Matrix Square n n Reza Zadeh #( c i , c j ) � � ≤ γ � � #( c i ) #( c j ) Introduction i = 1 j = i + 1 The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

Shuffle size for DIMSUM Dimension Independent Proof. Matrix Square n n Reza Zadeh #( c i , c j ) � � ≤ γ � � #( c i ) #( c j ) Introduction i = 1 j = i + 1 The Problem Why Bother MapReduce n n 1 1 � � First Pass (by AM-GM) ≤ γ #( c i , c j )( #( c i ) + #( c j )) Naive Analysis i = 1 j = i + 1 DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

Shuffle size for DIMSUM Dimension Independent Proof. Matrix Square n n Reza Zadeh #( c i , c j ) � � ≤ γ � � #( c i ) #( c j ) Introduction i = 1 j = i + 1 The Problem Why Bother MapReduce n n 1 1 � � First Pass (by AM-GM) ≤ γ #( c i , c j )( #( c i ) + #( c j )) Naive Analysis i = 1 j = i + 1 DIMSUM n n Algorithm 1 Shuffle Size � � ≤ γ #( c i , c j ) Correctness #( c i ) Singular values i = 1 j = 1 Similarities Experiments Large Small More Results

Dimension Independent Matrix Square Introduction using MapReduce - PowerPoint PPT Presentation

Dimension Independent Matrix Square Reza Zadeh Dimension Independent Matrix Square Introduction using MapReduce The Problem Why Bother MapReduce First Pass Naive Reza Bosagh Zadeh Analysis DIMSUM Algorithm Shuffle Size Correctness

Matrix Inverses The Inverse of a Matrix Defn. The inverse of a square matrix A , de- noted A

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Homeowners Association 2018 Annual Meeting Baldwin Square November 7, 2018 BALDWIN SQUARE

Accessing the Global Markets Through London London Stock Exchange Dim Sum Bonds December 2016

IN PARTNERSHIP WITH RESURGENT CORPORATION Tim Yap is a Filipino TV and radio host, newspaper

THE REFERENDUM REPORTING COMMITTEE Recommendations and Considerations July 24, 2012 T HE R

FUNDING SOUTH CAROLINAS FUTURE A Practical Plan for Tax Reform that Creates Equity, Stability

The CIBM License Process: Interbank Bond Market Overview Brendan Ahern Chief Investment Officer

Presentation Preview What is Metagaming ? Mind over Matter: Where can we find it? How

ANALYST MEETING November 22, 2019 1 Disclaimer The information (Confidential Information)

Securing renminbi via the private sector Renminbi Internationalization: Opportunities and Policy