Deterministic Distributed and Streaming Algorithms for Linear - - PowerPoint PPT Presentation

deterministic distributed and streaming algorithms for
SMART_READER_LITE
LIVE PREVIEW

Deterministic Distributed and Streaming Algorithms for Linear - - PowerPoint PPT Presentation

Deterministic Distributed and Streaming Algorithms for Linear Algebra Problems Charlie Dickens Joint work with Graham Cormode and David P. Woodruff University of Warwick, Department of Computer Science WPCCS, 30th June 2017 Motivation


slide-1
SLIDE 1

Deterministic Distributed and Streaming Algorithms for Linear Algebra Problems

Charlie Dickens

Joint work with Graham Cormode and David P. Woodruff

University of Warwick, Department of Computer Science

WPCCS, 30th June 2017

slide-2
SLIDE 2

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d

slide-3
SLIDE 3

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra!

slide-4
SLIDE 4

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems...

slide-5
SLIDE 5

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store

slide-6
SLIDE 6

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store ◮ Time complexity - ‘efficient’ polynomial time algorithms might

be too slow

slide-7
SLIDE 7

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store ◮ Time complexity - ‘efficient’ polynomial time algorithms might

be too slow

◮ Instead of solving exactly can we find ‘efficient’ algorithms

which are more suitable for large scale data analysis, perhaps allowing for some approximate solution?

slide-8
SLIDE 8

Motivation

◮ Large data can be abstracted as a matrix, A ∈ Rn×d ◮ Matrices = Linear Algebra! ◮ But there are also some problems... ◮ Storage - the data may be too large to store ◮ Time complexity - ‘efficient’ polynomial time algorithms might

be too slow

◮ Instead of solving exactly can we find ‘efficient’ algorithms

which are more suitable for large scale data analysis, perhaps allowing for some approximate solution?

◮ Randomised methods have been proposed but are they

necessary?

slide-9
SLIDE 9

Computation Models

slide-10
SLIDE 10

Computation Models

Streaming Model

slide-11
SLIDE 11

Computation Models

Streaming Model

◮ See data one item at a

time

slide-12
SLIDE 12

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data

slide-13
SLIDE 13

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

slide-14
SLIDE 14

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

◮ Need to keep a running

‘summary’ of the data

slide-15
SLIDE 15

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

◮ Need to keep a running

‘summary’ of the data

◮ Use the summary to

compute approximation to the original problem.

slide-16
SLIDE 16

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

◮ Need to keep a running

‘summary’ of the data

◮ Use the summary to

compute approximation to the original problem. Distributed Model

slide-17
SLIDE 17

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

◮ Need to keep a running

‘summary’ of the data

◮ Use the summary to

compute approximation to the original problem. Distributed Model

◮ Coordinator sends small

blocks of input to worker nodes

slide-18
SLIDE 18

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

◮ Need to keep a running

‘summary’ of the data

◮ Use the summary to

compute approximation to the original problem. Distributed Model

◮ Coordinator sends small

blocks of input to worker nodes

◮ Worker nodes report back

a summary of the data to coordinator

slide-19
SLIDE 19

Computation Models

Streaming Model

◮ See data one item at a

time

◮ Cannot store all of the data ◮ Want to optimise storage -

sublinear in n

◮ Need to keep a running

‘summary’ of the data

◮ Use the summary to

compute approximation to the original problem. Distributed Model

◮ Coordinator sends small

blocks of input to worker nodes

◮ Worker nodes report back

a summary of the data to coordinator

◮ Coordinator computes

approximation to original problem using by using the summaries sent back

slide-20
SLIDE 20

Summary of Results

Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).

Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d)

slide-21
SLIDE 21

Summary of Results

Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).

Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd

slide-22
SLIDE 22

Summary of Results

Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).

Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd ℓ∞-regression εbp additive poly(nd5) dO(p)/εO(1)

slide-23
SLIDE 23

Summary of Results

Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).

Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd ℓ∞-regression εbp additive poly(nd5) dO(p)/εO(1) ℓ1 low- (k) rank approximation poly(k) relative poly(nd) O(1/γ)nγpoly(d)

slide-24
SLIDE 24

Summary of Results

Previous results are specific for p = 2. Our results are the first deterministic algorithms which generalise to arbitrary p-norm (where applicable).

Problem Solution type Time Space High Leverage Scores 1/poly(d) additive O(nd2 + nd5 log(n)) poly(d) ℓp-regression (p = ∞) poly(d) relative poly(nd) O(1/γ)nγd ℓ∞-regression εbp additive poly(nd5) dO(p)/εO(1) ℓ1 low- (k) rank approximation poly(k) relative poly(nd) O(1/γ)nγpoly(d)

slide-25
SLIDE 25

Main Algorithmic Techniques: well-conditioned basis

Much of the work relies on the notion of a well-conditioned basis (wcb).

slide-26
SLIDE 26

Main Algorithmic Techniques: well-conditioned basis

Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:

slide-27
SLIDE 27

Main Algorithmic Techniques: well-conditioned basis

Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:

◮ Up ≤ α

slide-28
SLIDE 28

Main Algorithmic Techniques: well-conditioned basis

Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:

◮ Up ≤ α ◮ for all z, zq ≤ βUzp where q is the dual norm to p

slide-29
SLIDE 29

Main Algorithmic Techniques: well-conditioned basis

Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:

◮ Up ≤ α ◮ for all z, zq ≤ βUzp where q is the dual norm to p ◮ α and β are at most poly(d).

slide-30
SLIDE 30

Main Algorithmic Techniques: well-conditioned basis

Much of the work relies on the notion of a well-conditioned basis (wcb). A matrix U is a wcb for the column space of A if:

◮ Up ≤ α ◮ for all z, zq ≤ βUzp where q is the dual norm to p ◮ α and β are at most poly(d).

Mahoney et al. show that a change of basis matrix R can be computed in polynomial time such that AR is well conditioned.

slide-31
SLIDE 31

Main Algorithmic Techniques: high leverage rows

Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip

  • p. Similar definition for local

leverage scores.

slide-32
SLIDE 32

Main Algorithmic Techniques: high leverage rows

Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip

  • p. Similar definition for local

leverage scores. Problem: Can the rows of high leverage be found without reading the whole matrix?

slide-33
SLIDE 33

Main Algorithmic Techniques: high leverage rows

Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip

  • p. Similar definition for local

leverage scores. Problem: Can the rows of high leverage be found without reading the whole matrix? Theory: Rows with high global leverage scores have high local leverage scores up to poly(d) factors: ˆ wi ≥ wi′/poly(d).

slide-34
SLIDE 34

Main Algorithmic Techniques: high leverage rows

Let U = AR for change of basis matrix R. Then the full ℓp-leverage scores are wi = (AR)ip

  • p. Similar definition for local

leverage scores. Problem: Can the rows of high leverage be found without reading the whole matrix? Theory: Rows with high global leverage scores have high local leverage scores up to poly(d) factors: ˆ wi ≥ wi′/poly(d). Idea: Find local leverage scores and keep a superset of heavy rows. Our algorithm computes a superset of high leverage rows for data matrix A by reading the matrix one block at a time and keeping rows over a threshold.

slide-35
SLIDE 35

High Leverage Rows: toy example

1 2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • 20

1.6

  • 0.58
  • 2.3
  • 0.75

2.7 1.8

  • 1.8

2.4 1.3

  • 2.3

0.58 0.11

  • 1.2
  • 7.7

5.4

  • 0.44

1.7

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

5.1

  • 0.22
  • 2.4
  • 0.84
  • 0.018

12 1.1 2.1

  • 2.7
  • 0.34
  • 1.6

0.38

  • 2.6

0.58

  • 0.93
  • 1.5
  • 0.41
  • 0.42

0.04 0.08 0.12 0.16 1 2 5 4 3 2 1 5.1

  • 0.22
  • 2.4
  • 0.84
  • 0.018

12 1.1 2.1

  • 2.7
  • 0.34
  • 1.6

0.38

  • 2.6

0.58

  • 0.93
  • 1.5
  • 0.41
  • 0.42

1 2 5 4 3 2 1

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

1 2 5 4 3 2 1

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32
slide-36
SLIDE 36

High Leverage Rows: toy example

1 2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • 20

1.6

  • 0.58
  • 2.3
  • 0.75

2.7 1.8

  • 1.8

2.4 1.3

  • 2.3

0.58 0.11

  • 1.2
  • 7.7

5.4

  • 0.44

1.7

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

5.1

  • 0.22
  • 2.4
  • 0.84
  • 0.018

12 1.1 2.1

  • 2.7
  • 0.34
  • 1.6

0.38

  • 2.6

0.58

  • 0.93
  • 1.5
  • 0.41
  • 0.42

0.04 0.08 0.12 0.16

Figure: Full Data

1 2 5 4 3 2 1 5.1

  • 0.22
  • 2.4
  • 0.84
  • 0.018

12 1.1 2.1

  • 2.7
  • 0.34
  • 1.6

0.38

  • 2.6

0.58

  • 0.93
  • 1.5
  • 0.41
  • 0.42

1 2 5 4 3 2 1

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

1 2 5 4 3 2 1

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32
slide-37
SLIDE 37

High Leverage Rows: toy example

1 2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  • 20

1.6

  • 0.58
  • 2.3
  • 0.75

2.7 1.8

  • 1.8

2.4 1.3

  • 2.3

0.58 0.11

  • 1.2
  • 7.7

5.4

  • 0.44

1.7

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

5.1

  • 0.22
  • 2.4
  • 0.84
  • 0.018

12 1.1 2.1

  • 2.7
  • 0.34
  • 1.6

0.38

  • 2.6

0.58

  • 0.93
  • 1.5
  • 0.41
  • 0.42

0.04 0.08 0.12 0.16

Figure: Full Data

1 2 5 4 3 2 1 5.1

  • 0.22
  • 2.4
  • 0.84
  • 0.018

12 1.1 2.1

  • 2.7
  • 0.34
  • 1.6

0.38

  • 2.6

0.58

  • 0.93
  • 1.5
  • 0.41
  • 0.42

1 2 5 4 3 2 1

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

1 2 5 4 3 2 1

  • 21

1.2

  • 1.2

0.11 0.72 33

  • 0.78
  • 4.5
  • 0.3
  • 4.3

0.96

  • 3.1
  • 2.1

4.3 3.1 0.11 0.31

  • 0.32

Figure: Blocks of Data

slide-38
SLIDE 38

Application: ℓ∞-regression

Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem.

slide-39
SLIDE 39

Application: ℓ∞-regression

Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min

x∈Rd Ax − b∞ = min x∈Rd

  • max

i

|Ai, x − bi|

slide-40
SLIDE 40

Application: ℓ∞-regression

Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min

x∈Rd Ax − b∞ = min x∈Rd

  • max

i

|Ai, x − bi|

  • Idea: Store all rows in Aof high leverage and set all others to zero,

call this A′ . Keep b on these indices, otherwise set to zero, call this b′ . Solve the regression on A′ and b′ via linear programming.

slide-41
SLIDE 41

Application: ℓ∞-regression

Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min

x∈Rd Ax − b∞ = min x∈Rd

  • max

i

|Ai, x − bi|

  • Idea: Store all rows in Aof high leverage and set all others to zero,

call this A′ . Keep b on these indices, otherwise set to zero, call this b′ . Solve the regression on A′ and b′ via linear programming. High leverage rows: kept in entirety so ℓ∞ cost is the same as in full problem. Low leverage rows are shown not to contribute much to ℓ∞ cost.

slide-42
SLIDE 42

Application: ℓ∞-regression

Problem: Let A ∈ Rn×d, b ∈ Rn. Find εbp additive error approximation to the ℓ∞-regression problem. min

x∈Rd Ax − b∞ = min x∈Rd

  • max

i

|Ai, x − bi|

  • Idea: Store all rows in Aof high leverage and set all others to zero,

call this A′ . Keep b on these indices, otherwise set to zero, call this b′ . Solve the regression on A′ and b′ via linear programming. High leverage rows: kept in entirety so ℓ∞ cost is the same as in full problem. Low leverage rows are shown not to contribute much to ℓ∞ cost. Lower bound: relative error approximation cannot be obtained in sublinear space via reduction to Indexing in communication complexity.

slide-43
SLIDE 43

Empirical Evaluation

1 million row sample of US Census Data, blocks of size 1000.

10-5 10-4 10-3 10-2 10-1

Threshold

100 101 102 103 104 105 106

Storage (number of rows kept) Storage vs Threshold - log/log scale

Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035

Threshold

20 40 60 80 100 120 140

Total Time (secs) Total Time (hlr and regression) vs Threshold

Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force

slide-44
SLIDE 44

Empirical Evaluation

1 million row sample of US Census Data, blocks of size 1000. Want to keep the time and space costs as small as possible.

10-5 10-4 10-3 10-2 10-1

Threshold

100 101 102 103 104 105 106

Storage (number of rows kept) Storage vs Threshold - log/log scale

Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035

Threshold

20 40 60 80 100 120 140

Total Time (secs) Total Time (hlr and regression) vs Threshold

Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force

slide-45
SLIDE 45

Empirical Evaluation

1 million row sample of US Census Data, blocks of size 1000. Want to keep the time and space costs as small as possible.

10-5 10-4 10-3 10-2 10-1

Threshold

100 101 102 103 104 105 106

Storage (number of rows kept) Storage vs Threshold - log/log scale

Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035

Threshold

20 40 60 80 100 120 140

Total Time (secs) Total Time (hlr and regression) vs Threshold

Sparse Cauchy Well-conditioned Basis Orthonormal Basis Brute Force

Figure: Comparing space and time costs against thresholds.

slide-46
SLIDE 46

Empirical Evaluation

1 million row sample of US Census Data, blocks of size 1000.

200000 400000 600000 800000 1000000

Storage (number of rows kept)

0.0 0.2 0.4 0.6 0.8 1.0

Relative Error Accuracy vs Space tradeoff

Sparse Cauchy Well-conditioned Basis Orthonormal Basis

10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0

slide-47
SLIDE 47

Empirical Evaluation

1 million row sample of US Census Data, blocks of size 1000. Want accuracy to be high whilst keeping storage small.

200000 400000 600000 800000 1000000

Storage (number of rows kept)

0.0 0.2 0.4 0.6 0.8 1.0

Relative Error Accuracy vs Space tradeoff

Sparse Cauchy Well-conditioned Basis Orthonormal Basis

10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0

slide-48
SLIDE 48

Empirical Evaluation

1 million row sample of US Census Data, blocks of size 1000. Want accuracy to be high whilst keeping storage small.

200000 400000 600000 800000 1000000

Storage (number of rows kept)

0.0 0.2 0.4 0.6 0.8 1.0

Relative Error Accuracy vs Space tradeoff

Sparse Cauchy Well-conditioned Basis Orthonormal Basis

10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0

Figure: Comparing accuracy-space tradeoff with varying threshold

slide-49
SLIDE 49

Thanks for listening!