Deterministic Distributed and Streaming Algorithms for Linear - - PowerPoint PPT Presentation
Deterministic Distributed and Streaming Algorithms for Linear - - PowerPoint PPT Presentation
Deterministic Distributed and Streaming Algorithms for Linear Algebra Problems Charlie Dickens Joint work with Graham Cormode and David P. Woodruff University of Warwick, Department of Computer Science WPCCS, 30th June 2017 Motivation
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra!
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems...
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store ◮ Time complexity - ‘efficient’ polynomial time algorithms might
be too slow
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store ◮ Time complexity - ‘efficient’ polynomial time algorithms might
be too slow
◮ Instead of solving exactly can we find ‘efficient’ algorithms
which are more suitable for large scale data analysis, perhaps allowing for some approximate solution?
Motivation
◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store ◮ Time complexity - ‘efficient’ polynomial time algorithms might
be too slow
◮ Instead of solving exactly can we find ‘efficient’ algorithms
which are more suitable for large scale data analysis, perhaps allowing for some approximate solution?
◮ Randomised methods have been proposed but are they
necessary?
Computation Models
Computation Models
Streaming Model
Computation Models
Streaming Model
◮ See data one item at a
time
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
◮ Need to keep a running
‘summary’ of the data
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
◮ Need to keep a running
‘summary’ of the data
◮ Use the summary to
compute approximation to the original problem.
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
◮ Need to keep a running
‘summary’ of the data
◮ Use the summary to
compute approximation to the original problem. Distributed Model
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
◮ Need to keep a running
‘summary’ of the data
◮ Use the summary to
compute approximation to the original problem. Distributed Model
◮ Coordinator sends small
blocks of input to worker nodes
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
◮ Need to keep a running
‘summary’ of the data
◮ Use the summary to
compute approximation to the original problem. Distributed Model
◮ Coordinator sends small
blocks of input to worker nodes
◮ Worker nodes report back
a summary of the data to coordinator
Computation Models
Streaming Model
◮ See data one item at a
time
◮ Cannot store all of the data ◮ Want to optimise storage -
sublinear in n
◮ Need to keep a running
‘summary’ of the data
◮ Use the summary to
compute approximation to the original problem. Distributed Model
◮ Coordinator sends small
blocks of input to worker nodes
◮ Worker nodes report back
a summary of the data to coordinator
◮ Coordinator computes
approximation to original problem using by using the summaries sent back
Summary of Results
Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).
Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d)
Summary of Results
Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).
Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd
Summary of Results
Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).
Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd ℓ∞-regression εbp additive poly(nd5) dO(p)/εO(1)
Summary of Results
Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).
Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd ℓ∞-regression εbp additive poly(nd5) dO(p)/εO(1) ℓ1 low- (k) rank approximation poly(k) relative poly(nd) O(1/γ)nγpoly(d)
Summary of Results
Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).
Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd ℓ∞-regression εbp additive poly(nd5) dO(p)/εO(1) ℓ1 low- (k) rank approximation poly(k) relative poly(nd) O(1/γ)nγpoly(d)
Main Algorithmic Techniques: well-conditioned basis
Much of the work relies on the notion of a well-conditioned basis (wcb).
Main Algorithmic Techniques: well-conditioned basis
Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:
Main Algorithmic Techniques: well-conditioned basis
Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:
◮ Up ≤ α
Main Algorithmic Techniques: well-conditioned basis
Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:
◮ Up ≤ α ◮ for all z, zq ≤ βUzp where q is the dual norm to p
Main Algorithmic Techniques: well-conditioned basis
Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:
◮ Up ≤ α ◮ for all z, zq ≤ βUzp where q is the dual norm to p ◮ α and β are at most poly(d).
Main Algorithmic Techniques: well-conditioned basis
Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:
◮ Up ≤ α ◮ for all z, zq ≤ βUzp where q is the dual norm to p ◮ α and β are at most poly(d).
Mahoney et al. show that a change of basis matrix R can be computed in polynomial time such that AR is well conditioned.
Main Algorithmic Techniques: high leverage rows
Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip
- p. Similar definition for local
leverage scores.
Main Algorithmic Techniques: high leverage rows
Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip
- p. Similar definition for local
leverage scores. Problem: Can the rows of high leverage be found without reading the whole matrix?
Main Algorithmic Techniques: high leverage rows
Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip
- p. Similar definition for local
leverage scores. Problem: Can the rows of high leverage be found without reading the whole matrix? Theory: Rows with high global leverage scores have high local leverage scores up to poly(d) factors: ˆ wi ≥ wi′/poly(d).
Main Algorithmic Techniques: high leverage rows
Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip
- p. Similar definition for local
leverage scores. Problem: Can the rows of high leverage be found without reading the whole matrix? Theory: Rows with high global leverage scores have high local leverage scores up to poly(d) factors: ˆ wi ≥ wi′/poly(d). Idea: Find local leverage scores and keep a superset of heavy rows. Our algorithm computes a superset of high leverage rows for data matrix A by reading the matrix one block at a time and keeping rows over a threshold.
High Leverage Rows: toy example
1 2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- 20
1.6
- 0.58
- 2.3
- 0.75
2.7 1.8
- 1.8
2.4 1.3
- 2.3
0.58 0.11
- 1.2
- 7.7
5.4
- 0.44
1.7
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
5.1
- 0.22
- 2.4
- 0.84
- 0.018
12 1.1 2.1
- 2.7
- 0.34
- 1.6
0.38
- 2.6
0.58
- 0.93
- 1.5
- 0.41
- 0.42
0.04 0.08 0.12 0.16 1 2 5 4 3 2 1 5.1
- 0.22
- 2.4
- 0.84
- 0.018
12 1.1 2.1
- 2.7
- 0.34
- 1.6
0.38
- 2.6
0.58
- 0.93
- 1.5
- 0.41
- 0.42
1 2 5 4 3 2 1
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
1 2 5 4 3 2 1
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
High Leverage Rows: toy example
1 2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- 20
1.6
- 0.58
- 2.3
- 0.75
2.7 1.8
- 1.8
2.4 1.3
- 2.3
0.58 0.11
- 1.2
- 7.7
5.4
- 0.44
1.7
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
5.1
- 0.22
- 2.4
- 0.84
- 0.018
12 1.1 2.1
- 2.7
- 0.34
- 1.6
0.38
- 2.6
0.58
- 0.93
- 1.5
- 0.41
- 0.42
0.04 0.08 0.12 0.16
Figure: Full Data
1 2 5 4 3 2 1 5.1
- 0.22
- 2.4
- 0.84
- 0.018
12 1.1 2.1
- 2.7
- 0.34
- 1.6
0.38
- 2.6
0.58
- 0.93
- 1.5
- 0.41
- 0.42
1 2 5 4 3 2 1
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
1 2 5 4 3 2 1
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
High Leverage Rows: toy example
1 2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- 20
1.6
- 0.58
- 2.3
- 0.75
2.7 1.8
- 1.8
2.4 1.3
- 2.3
0.58 0.11
- 1.2
- 7.7
5.4
- 0.44
1.7
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
5.1
- 0.22
- 2.4
- 0.84
- 0.018
12 1.1 2.1
- 2.7
- 0.34
- 1.6
0.38
- 2.6
0.58
- 0.93
- 1.5
- 0.41
- 0.42
0.04 0.08 0.12 0.16
Figure: Full Data
1 2 5 4 3 2 1 5.1
- 0.22
- 2.4
- 0.84
- 0.018
12 1.1 2.1
- 2.7
- 0.34
- 1.6
0.38
- 2.6
0.58
- 0.93
- 1.5
- 0.41
- 0.42
1 2 5 4 3 2 1
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
1 2 5 4 3 2 1
- 21
1.2
- 1.2
0.11 0.72 33
- 0.78
- 4.5
- 0.3
- 4.3
0.96
- 3.1
- 2.1
4.3 3.1 0.11 0.31
- 0.32
Figure: Blocks of Data
Application: ℓ∞-regression
Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem.
Application: ℓ∞-regression
Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min
x∈Rd Ax − b∞ = min x∈Rd
- max
i
|Ai, x − bi|
Application: ℓ∞-regression
Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min
x∈Rd Ax − b∞ = min x∈Rd
- max
i
|Ai, x − bi|
- Idea: Store all rows in Aof high leverage and set all others to zero,
call this A′ . Keep b on these indices, otherwise set to zero, call this b′ . Solve the regression on A′ and b′ via linear programming.
Application: ℓ∞-regression
Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min
x∈Rd Ax − b∞ = min x∈Rd
- max
i
|Ai, x − bi|
- Idea: Store all rows in Aof high leverage and set all others to zero,
call this A′ . Keep b on these indices, otherwise set to zero, call this b′ . Solve the regression on A′ and b′ via linear programming. High leverage rows: kept in entirety so ℓ∞ cost is the same as in full problem. Low leverage rows are shown not to contribute much to ℓ∞ cost.
Application: ℓ∞-regression
Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min
x∈Rd Ax − b∞ = min x∈Rd
- max
i
|Ai, x − bi|
- Idea: Store all rows in Aof high leverage and set all others to zero,
call this A′ . Keep b on these indices, otherwise set to zero, call this b′ . Solve the regression on A′ and b′ via linear programming. High leverage rows: kept in entirety so ℓ∞ cost is the same as in full problem. Low leverage rows are shown not to contribute much to ℓ∞ cost. Lower bound: relative error approximation cannot be obtained in sublinear space via reduction to Indexing in communication complexity.
Empirical Evaluation
1 million row sample of US Census Data, blocks of size 1000.
10-5 10-4 10-3 10-2 10-1
Threshold
100 101 102 103 104 105 106
Storage (number of rows kept) Storage vs Threshold - log/log scale
Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035
Threshold
20 40 60 80 100 120 140
Total Time (secs) Total Time (hlr and regression) vs Threshold
Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force
Empirical Evaluation
1 million row sample of US Census Data, blocks of size 1000. Want to keep the time and space costs as small as possible.
10-5 10-4 10-3 10-2 10-1
Threshold
100 101 102 103 104 105 106
Storage (number of rows kept) Storage vs Threshold - log/log scale
Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035
Threshold
20 40 60 80 100 120 140
Total Time (secs) Total Time (hlr and regression) vs Threshold
Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force
Empirical Evaluation
1 million row sample of US Census Data, blocks of size 1000. Want to keep the time and space costs as small as possible.
10-5 10-4 10-3 10-2 10-1
Threshold
100 101 102 103 104 105 106
Storage (number of rows kept) Storage vs Threshold - log/log scale
Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035
Threshold
20 40 60 80 100 120 140
Total Time (secs) Total Time (hlr and regression) vs Threshold
Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force
Figure: Comparing space and time costs against thresholds.
Empirical Evaluation
1 million row sample of US Census Data, blocks of size 1000.
200000 400000 600000 800000 1000000
Storage (number of rows kept)
0.0 0.2 0.4 0.6 0.8 1.0
Relative Error Accuracy vs Space tradeoff
Sparse Cauchy Well-conditioned Basis Orthonormal Basis
10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0
Empirical Evaluation
1 million row sample of US Census Data, blocks of size 1000. Want accuracy to be high whilst keeping storage small.
200000 400000 600000 800000 1000000
Storage (number of rows kept)
0.0 0.2 0.4 0.6 0.8 1.0
Relative Error Accuracy vs Space tradeoff
Sparse Cauchy Well-conditioned Basis Orthonormal Basis
10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0
Empirical Evaluation
1 million row sample of US Census Data, blocks of size 1000. Want accuracy to be high whilst keeping storage small.
200000 400000 600000 800000 1000000
Storage (number of rows kept)
0.0 0.2 0.4 0.6 0.8 1.0
Relative Error Accuracy vs Space tradeoff
Sparse Cauchy Well-conditioned Basis Orthonormal Basis
10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0