MapReduce and Streaming Algorithms for Diversity Maximization in - - PowerPoint PPT Presentation

mapreduce and streaming algorithms for diversity
SMART_READER_LITE
LIVE PREVIEW

MapReduce and Streaming Algorithms for Diversity Maximization in - - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dellInformazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U.


slide-1
SLIDE 1

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

Andrea Pietracaprina

  • Dip. Ingegneria dell’Informazione, Universit`

a di Padova

Joint work with:

  • M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.)

VLDB’17

slide-2
SLIDE 2

Outline

◮ Problem definition and applications ◮ Background ◮ Summary of results ◮ Our approach:

◮ Core-set construction ◮ Streaming implementation ◮ MapReduce implementation

◮ Experiments ◮ Diversity maximization under matroid costraints ◮ Summary and future work

slide-3
SLIDE 3

Problem definition and applications

slide-4
SLIDE 4

Diversity maximization Objective:

For a given dataset

slide-5
SLIDE 5

Diversity maximization Objective:

Determine the most diverse subset of given (small) size k

slide-6
SLIDE 6

Applications

slide-7
SLIDE 7

Applications Aggregator websites (e.g., Google News)

◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)

slide-8
SLIDE 8

Applications Aggregator websites (e.g., Google News)

◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)

E-commerce

◮ Consideration set: products returned by shopping portal in

reply to user query

◮ Diversity of returned products (w.r.t. unspecified attributes)

correlates to user satisfaction

slide-9
SLIDE 9

Applications Aggregator websites (e.g., Google News)

◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)

E-commerce

◮ Consideration set: products returned by shopping portal in

reply to user query

◮ Diversity of returned products (w.r.t. unspecified attributes)

correlates to user satisfaction

Facility location

◮ Franchise location (noncompetition) ◮ Strategic facilities (dispersion against simultaneous attacks)

slide-10
SLIDE 10

Diversity maximization: formal definition

slide-11
SLIDE 11

Diversity maximization: formal definition Given:

  • 1. Set S of n points in a metric space ∆
  • 2. Distance function d : ∆ × ∆ → R+ ∪ {0}
  • 3. Integer k > 1
  • 4. (Distance-based) diversity function div : 2∆ → R+ ∪ {0}
slide-12
SLIDE 12

Diversity maximization: formal definition Given:

  • 1. Set S of n points in a metric space ∆
  • 2. Distance function d : ∆ × ∆ → R+ ∪ {0}
  • 3. Integer k > 1
  • 4. (Distance-based) diversity function div : 2∆ → R+ ∪ {0}

Return

S∗ ⊂ S, |S∗| = k s.t. S∗ = argmaxS′⊆S,|S′|=k div(S′)

slide-13
SLIDE 13

Diversity maximization: formal definition Given:

  • 1. Set S of n points in a metric space ∆
  • 2. Distance function d : ∆ × ∆ → R+ ∪ {0}
  • 3. Integer k > 1
  • 4. (Distance-based) diversity function div : 2∆ → R+ ∪ {0}

Return

S∗ ⊂ S, |S∗| = k s.t. S∗ = argmaxS′⊆S,|S′|=k div(S′) The k-diversity of S is divk(S) = div(S∗)

slide-14
SLIDE 14

Diversity functions studied in this work

Problem div(S′) Remote-edge minp,q∈S′ d(p, q) Remote-clique

  • p,q∈S′ d(p, q)

Remote-star w(MSTAR(S′)) Remote-bipartition w(MBIP(S′)) Remote-tree w(MST(S′)) Remote-cycle w(TSP(S′))

slide-15
SLIDE 15

Diversity functions studied in this work

Problem div(S′) Remote-edge minp,q∈S′ d(p, q) Remote-clique

  • p,q∈S′ d(p, q)

Remote-star w(MSTAR(S′)) Remote-bipartition w(MBIP(S′)) Remote-tree w(MST(S′)) Remote-cycle w(TSP(S′))

◮ MSTAR(S′): min-weight star in G(S′) (≡ complete graph over S′)

slide-16
SLIDE 16

Diversity functions studied in this work

Problem div(S′) Remote-edge minp,q∈S′ d(p, q) Remote-clique

  • p,q∈S′ d(p, q)

Remote-star w(MSTAR(S′)) Remote-bipartition w(MBIP(S′)) Remote-tree w(MST(S′)) Remote-cycle w(TSP(S′))

◮ MSTAR(S′): min-weight star in G(S′) (≡ complete graph over S′) ◮ MBIP(S′): min-weight balanced bipartition of S′ in G(S′)

slide-17
SLIDE 17

Diversity functions studied in this work

Problem div(S′) Remote-edge minp,q∈S′ d(p, q) Remote-clique

  • p,q∈S′ d(p, q)

Remote-star w(MSTAR(S′)) Remote-bipartition w(MBIP(S′)) Remote-tree w(MST(S′)) Remote-cycle w(TSP(S′))

◮ MSTAR(S′): min-weight star in G(S′) (≡ complete graph over S′) ◮ MBIP(S′): min-weight balanced bipartition of S′ in G(S′) ◮ MST(S′): min-weight spanning tree of G(S′)

slide-18
SLIDE 18

Diversity functions studied in this work

Problem div(S′) Remote-edge minp,q∈S′ d(p, q) Remote-clique

  • p,q∈S′ d(p, q)

Remote-star w(MSTAR(S′)) Remote-bipartition w(MBIP(S′)) Remote-tree w(MST(S′)) Remote-cycle w(TSP(S′))

◮ MSTAR(S′): min-weight star in G(S′) (≡ complete graph over S′) ◮ MBIP(S′): min-weight balanced bipartition of S′ in G(S′) ◮ MST(S′): min-weight spanning tree of G(S′) ◮ TSP(S′): min-weight tour of S′ in G(S′)

slide-19
SLIDE 19

Diversity functions studied in this work

Problem div(S′) Remote-edge minp,q∈S′ d(p, q) Remote-clique

  • p,q∈S′ d(p, q)

Remote-star w(MSTAR(S′)) Remote-bipartition w(MBIP(S′)) Remote-tree w(MST(S′)) Remote-cycle w(TSP(S′))

◮ MSTAR(S′): min-weight star in G(S′) (≡ complete graph over S′) ◮ MBIP(S′): min-weight balanced bipartition of S′ in G(S′) ◮ MST(S′): min-weight spanning tree of G(S′) ◮ TSP(S′): min-weight tour of S′ in G(S′)

Except for remote-clique, all problems are max-min optimizations.

slide-20
SLIDE 20

Background

slide-21
SLIDE 21

Previous work Sequential approximation and hardness results

Problem

  • Seq. Approx.

LB Remote-edge 2 ≥ 2 Remote-clique 2 ≥ 2 − ǫ Remote-star 2 – Remote-bipartition 3 – Remote-tree 4 ≥ 2 Remote-cycle 3 ≥ 2 Specialized results (hardness and better approx. ratios) for remote clique and remote edge under Euclidean distances

slide-22
SLIDE 22

Previous work β-core-set [Agarwal et al.’04]

slide-23
SLIDE 23

Previous work β-core-set [Agarwal et al.’04]

◮ (Small) subset T ⊆ S such that divk(T) ≥ (1/β) divk(S)

slide-24
SLIDE 24

Previous work β-core-set [Agarwal et al.’04]

◮ (Small) subset T ⊆ S such that divk(T) ≥ (1/β) divk(S)

slide-25
SLIDE 25

Previous work β-core-set [Agarwal et al.’04]

◮ (Small) subset T ⊆ S such that divk(T) ≥ (1/β) divk(S) ◮ T filters out redundancy

slide-26
SLIDE 26

Previous work β-core-set [Agarwal et al.’04]

◮ (Small) subset T ⊆ S such that divk(T) ≥ (1/β) divk(S) ◮ T filters out redundancy ◮ Approximate solution can be computed on T.

slide-27
SLIDE 27

Previous work β-composable core-set [Indyk et al.’14]

slide-28
SLIDE 28

Previous work β-composable core-set [Indyk et al.’14]

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ

slide-29
SLIDE 29

Previous work β-composable core-set [Indyk et al.’14]

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ β-composable core-sets Ti ⊂ Si ⇒ ∪Ti is a β-core-set

slide-30
SLIDE 30

Previous work β-composable core-set [Indyk et al.’14]

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ β-composable core-sets Ti ⊂ Si ⇒ ∪Ti is a β-core-set

slide-31
SLIDE 31

Previous work β-composable core-set [Indyk et al.’14]

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ β-composable core-sets Ti ⊂ Si ⇒ ∪Ti is a β-core-set ◮ Application to MapReduce and Streaming frameworks

slide-32
SLIDE 32

Previous work Known β-composable core-sets for k-diversity maximization

([Indyk et al.’14,Aghamolaei et al.’15]) β αseq β · αseq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

slide-33
SLIDE 33

Previous work Known β-composable core-sets for k-diversity maximization

([Indyk et al.’14,Aghamolaei et al.’15]) β αseq β · αseq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

◮ αseq = best sequential approximation ratio

slide-34
SLIDE 34

Previous work Known β-composable core-sets for k-diversity maximization

([Indyk et al.’14,Aghamolaei et al.’15]) β αseq β · αseq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

◮ αseq = best sequential approximation ratio ◮ General metric spaces

slide-35
SLIDE 35

Previous work Known β-composable core-sets for k-diversity maximization

([Indyk et al.’14,Aghamolaei et al.’15]) β αseq β · αseq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

◮ αseq = best sequential approximation ratio ◮ General metric spaces ◮ Core-set size: k

slide-36
SLIDE 36

Computational Frameworks for Massive Data Analysis

slide-37
SLIDE 37

Computational Frameworks for Massive Data Analysis MapReduce

◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into

(small) subsets, processed in parallel

◮ Goals: few rounds, sublinear local space, linear overall space.

slide-38
SLIDE 38

Computational Frameworks for Massive Data Analysis MapReduce

◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into

(small) subsets, processed in parallel

◮ Goals: few rounds, sublinear local space, linear overall space.

Streaming

◮ Data provided as a continuous stream and processed using limited

local space (much smaller than input size)

◮ Multiple passes over data may be allowed ◮ Goals: few passes, sublinear local space

slide-39
SLIDE 39

Computational Frameworks for Massive Data Analysis MapReduce

◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into

(small) subsets, processed in parallel

◮ Goals: few rounds, sublinear local space, linear overall space.

Streaming

◮ Data provided as a continuous stream and processed using limited

local space (much smaller than input size)

◮ Multiple passes over data may be allowed ◮ Goals: few passes, sublinear local space

Fact: Known composable core-sets for k-diversity maximization yield 1-pass/2-round Streaming/MapReduce algorithms with O √ kn

  • space.
slide-40
SLIDE 40

Doubling space Doubling space

Metric space such that any ball of radius r is covered by ≤ 2D balls of radius r/2 (⇒1/ǫD balls of radius ǫr), for some D = O(1). E.g.: Euclidean spaces of constant dimension (under ℓ1 or ℓ2 norm); Shortest-path distances of mildly expanding topologies

slide-41
SLIDE 41

Summary of Results

slide-42
SLIDE 42

Summary of results For doubling spaces of dim D, all div’s, and ǫ > 0:

slide-43
SLIDE 43

Summary of results For doubling spaces of dim D, all div’s, and ǫ > 0:

◮ (1 + ǫ)-(composable) core-sets

slide-44
SLIDE 44

Summary of results For doubling spaces of dim D, all div’s, and ǫ > 0:

◮ (1 + ǫ)-(composable) core-sets ◮ 1-pass Streaming and 2-round MapReduce (MR) algorithms

– approximation ratio: αseq + ǫ – local space requirements:

slide-45
SLIDE 45

Summary of results For doubling spaces of dim D, all div’s, and ǫ > 0:

◮ (1 + ǫ)-(composable) core-sets ◮ 1-pass Streaming and 2-round MapReduce (MR) algorithms

– approximation ratio: αseq + ǫ – local space requirements:

Streaming MR (det.) MR (rand.) r-edge/cycle O(k/ǫD) O

  • kn/ǫD
  • ther div’s

O(k2/ǫD) O

  • k
  • n/ǫD
  • O
  • kn log n/ǫD
slide-46
SLIDE 46

Summary of results For doubling spaces of dim D, all div’s, and ǫ > 0:

◮ (1 + ǫ)-(composable) core-sets ◮ 1-pass Streaming and 2-round MapReduce (MR) algorithms

– approximation ratio: αseq + ǫ – local space requirements:

Streaming MR (det.) MR (rand.) r-edge/cycle O(k/ǫD) O

  • kn/ǫD
  • ther div’s

O(k2/ǫD) O

  • k
  • n/ǫD
  • O
  • kn log n/ǫD
  • ◮ One extra pass/round brings deterministic space bounds for
  • ther div′ s down to those for r-edge/cycle
slide-47
SLIDE 47

Our approach

slide-48
SLIDE 48

Core-set construction: algorithm

Input dataset: S Optimal solution OPT ⊂ S, with |OPT| = k

slide-49
SLIDE 49

Core-set construction: algorithm

Input dataset: S Optimal solution OPT ⊂ S, with |OPT| = k MAIN IDEA: Compute core-set T such that each o ∈ OPT has a (distinct) proxy p(o) ∈ T with “small” d(o, p(o))

slide-50
SLIDE 50

Core-set construction: algorithm

Input dataset: S Optimal solution OPT ⊂ S, with |OPT| = k MAIN IDEA: Compute core-set T such that each o ∈ OPT has a (distinct) proxy p(o) ∈ T with “small” d(o, p(o))

  • 1. Partition S into k′ > k clusters of small radius
slide-51
SLIDE 51

Core-set construction: algorithm

Input dataset: S Optimal solution OPT ⊂ S, with |OPT| = k MAIN IDEA: Compute core-set T such that each o ∈ OPT has a (distinct) proxy p(o) ∈ T with “small” d(o, p(o))

  • 1. Partition S into k′ > k clusters of small radius
  • 2. T = {cluster centers}
slide-52
SLIDE 52

Core-set construction: algorithm

Input dataset: S Optimal solution OPT ⊂ S, with |OPT| = k MAIN IDEA: Compute core-set T such that each o ∈ OPT has a (distinct) proxy p(o) ∈ T with “small” d(o, p(o))

  • 1. Partition S into k′ > k clusters of small radius
  • 2. T = {cluster centers}
  • 3. If injectivity of p(·) required:

T = {cluster centers} ∪ {f (k) ≤ k − 1 delegates per cluster}.

slide-53
SLIDE 53

Core-set construction: algorithm

◮ k = 3, k′ = 8

slide-54
SLIDE 54

Core-set construction: algorithm

◮ Compute k′-center clustering

slide-55
SLIDE 55

Core-set construction: algorithm

◮ No injectivity required: T = {cluster centers} (|T| = k′)

slide-56
SLIDE 56

Core-set construction: algorithm

◮ Injectivity required: T = {1 center + 2 delegates per cluster}

slide-57
SLIDE 57

Core-set construction: analysis

Setting for the analysis:

slide-58
SLIDE 58

Core-set construction: analysis

Setting for the analysis:

◮ S from doubling space of dimension D ◮ Focus on remote-clique (similiar for other div’s) ◮ Fix ǫ ∈ (0, 1/2).

slide-59
SLIDE 59

Core-set construction: analysis

Setting for the analysis:

◮ S from doubling space of dimension D ◮ Focus on remote-clique (similiar for other div’s) ◮ Fix ǫ ∈ (0, 1/2).

Theorem

If k′ = (16/ǫ)Dk and k − 1 delegates per clusters are taken, T is a (1 + ǫ)-core-set for S, of size O

  • k2(1/ǫ)D
slide-60
SLIDE 60

Core-set construction: analysis Proof.

slide-61
SLIDE 61

Core-set construction: analysis Proof.

◮ Let ρ = div(OPT)/

k

2

  • .

Claim: The radius of an optimal k′-clustering of S is rk′ ≤ (ǫ/8)ρ

slide-62
SLIDE 62

Core-set construction: analysis Proof.

◮ Let ρ = div(OPT)/

k

2

  • .

Claim: The radius of an optimal k′-clustering of S is rk′ ≤ (ǫ/8)ρ

◮ ∃ an injective p(·) such that for each o ∈ OPT, p(o) ∈ T and

d(o, p(o)) ≤ 2rk′ ≤ (ǫ/4)ρ

slide-63
SLIDE 63

Core-set construction: analysis Proof.

◮ Let ρ = div(OPT)/

k

2

  • .

Claim: The radius of an optimal k′-clustering of S is rk′ ≤ (ǫ/8)ρ

◮ ∃ an injective p(·) such that for each o ∈ OPT, p(o) ∈ T and

d(o, p(o)) ≤ 2rk′ ≤ (ǫ/4)ρ

◮ Hence:

divk(T) ≥

  • 1,o2∈OPT

d(p(o1), p(o2)) (injectivity!) ≥ divk(S)/(1 + ǫ) (claim+triangle ineq.)

slide-64
SLIDE 64

Core-set construction: composability

slide-65
SLIDE 65

Core-set construction: composability Composability

slide-66
SLIDE 66

Core-set construction: composability Composability

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ Extract a core-set Ti ⊂ Si as before ◮ ∪Ti is a (1 + ǫ)-core-set of size O

  • ℓk2(1/ǫ)D

.

slide-67
SLIDE 67

Core-set construction: composability Composability

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ Extract a core-set Ti ⊂ Si as before ◮ ∪Ti is a (1 + ǫ)-core-set of size O

  • ℓk2(1/ǫ)D

. ⇒ the Ti’s are (1 + ǫ)-composable core-sets

slide-68
SLIDE 68

Core-set construction: composability Composability

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ Extract a core-set Ti ⊂ Si as before ◮ ∪Ti is a (1 + ǫ)-core-set of size O

  • ℓk2(1/ǫ)D

. ⇒ the Ti’s are (1 + ǫ)-composable core-sets

Saving space

slide-69
SLIDE 69

Core-set construction: composability Composability

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ Extract a core-set Ti ⊂ Si as before ◮ ∪Ti is a (1 + ǫ)-core-set of size O

  • ℓk2(1/ǫ)D

. ⇒ the Ti’s are (1 + ǫ)-composable core-sets

Saving space

◮ Initial random partition ◮ Build each Ti picking Θ (max{k/ℓ, log n}) delegates per

cluster (rather than k − 1)

◮ ∪Ti is a (1 + ǫ)-core-set of size O

  • k(k + ℓ log n)(1/ǫ)D

, w.h.p.

slide-70
SLIDE 70

Streaming implementation

slide-71
SLIDE 71

Streaming implementation

◮ One-pass

◮ Fix k′ = O

  • k/ǫD

◮ Run adaptation of 8-approximate k′-center algorithm of

[Charikar et al.’04] to compute a (1 + ǫ)-core-set of size O (k′ · k) = O

  • k2/ǫD

◮ Run sequential approximation on the core-set

slide-72
SLIDE 72

Streaming implementation

◮ One-pass

◮ Fix k′ = O

  • k/ǫD

◮ Run adaptation of 8-approximate k′-center algorithm of

[Charikar et al.’04] to compute a (1 + ǫ)-core-set of size O (k′ · k) = O

  • k2/ǫD

◮ Run sequential approximation on the core-set

◮ If D is unknown, it can be guessed in O (1) passes

slide-73
SLIDE 73

MapReduce implementation

slide-74
SLIDE 74

MapReduce implementation

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ

slide-75
SLIDE 75

MapReduce implementation

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ Two rounds

◮ Round 1: run 2-approximation k′-center algorithm in each Si

to find Ti (e.g., Gonzalez’s)

◮ Round 2: run best sequential algorithm for k-diversity on ∪Ti

slide-76
SLIDE 76

MapReduce implementation

◮ Input partition S = S1 ∪ S2 ∪ · · · ∪ Sℓ ◮ Two rounds

◮ Round 1: run 2-approximation k′-center algorithm in each Si

to find Ti (e.g., Gonzalez’s)

◮ Round 2: run best sequential algorithm for k-diversity on ∪Ti

◮ Need not know D! Stop when radius of k′-clustering is a

factor ǫ of radius of k-clustering

slide-77
SLIDE 77

Experiments

slide-78
SLIDE 78

Experiments: setup

slide-79
SLIDE 79

Experiments: setup Datasets:

◮ Synthetic data: [4M ÷ 1.6G] points in R3 ◮ Real data: musiXmatch dataset

◮ ≈ 250K songs. ◮ bag-of-words model, cosine distance

slide-80
SLIDE 80

Experiments: setup Datasets:

◮ Synthetic data: [4M ÷ 1.6G] points in R3 ◮ Real data: musiXmatch dataset

◮ ≈ 250K songs. ◮ bag-of-words model, cosine distance

Platform:

◮ 16-node cluster (Intel i7) ◮ 18GB-RAM/256GB-SSD per node ◮ 10Gbps ethernet ◮ MapReduce: Apache Spark ◮ Streaming: simulation

slide-81
SLIDE 81

Experiments: setup Datasets:

◮ Synthetic data: [4M ÷ 1.6G] points in R3 ◮ Real data: musiXmatch dataset

◮ ≈ 250K songs. ◮ bag-of-words model, cosine distance

Platform:

◮ 16-node cluster (Intel i7) ◮ 18GB-RAM/256GB-SSD per node ◮ 10Gbps ethernet ◮ MapReduce: Apache Spark ◮ Streaming: simulation

slide-82
SLIDE 82

Experiments: effectiveness of approach Streaming algorithm on real dataset

◮ X-axis: K ◮ Y-axis: approx-ratio (w.r.t. empirical best solution)

slide-83
SLIDE 83

Experiments: comparison with state-of-art Our algorithm (CPPU) vs [Aghamolei et al.’15] (AFZ)

◮ MapReduce using 16 machines ◮ 4M points in R3 (max feasible for AFZ) ◮ Our algorithm: k′ = 128

approximation time (s) k AFZ CPPU AFZ CPPU 4 1.023 1.012 807.79 1.19 6 1.052 1.018 1,052.39 1.29 8 1.029 1.028 4,625.46 1.12

slide-84
SLIDE 84

Experiments: scalability Scalability in MapReduce wrt input size

◮ Synthetic data

slide-85
SLIDE 85

Experiments: scalability Scalability in MapReduce wrt number of processors

◮ Synthetic data: 100M points ◮ 1 processor ≡ streaming algorithm

slide-86
SLIDE 86

Diversity maximization under matroid costraints

slide-87
SLIDE 87

Diversity maximization under matroid costraints

slide-88
SLIDE 88

Diversity maximization under matroid costraints

◮ Consider diversity maximization problems where solutions are

required to satisfy some given matroid constraint

slide-89
SLIDE 89

Diversity maximization under matroid costraints

◮ Consider diversity maximization problems where solutions are

required to satisfy some given matroid constraint

◮ [Abassi et al. KDD’13]: sequential (2 + ǫ)-approximation

(based on local-search) for remote-clique diversity under matroid constraint

slide-90
SLIDE 90

Diversity maximization under matroid costraints

◮ Consider diversity maximization problems where solutions are

required to satisfy some given matroid constraint

◮ [Abassi et al. KDD’13]: sequential (2 + ǫ)-approximation

(based on local-search) for remote-clique diversity under matroid constraint

◮ Our approach can be generalized to provide

(1 + ǫ)-composable core-sets for all div’s under partition/transversal matroid constraint in doubling spaces. Main adaptations required:

slide-91
SLIDE 91

Diversity maximization under matroid costraints

◮ Consider diversity maximization problems where solutions are

required to satisfy some given matroid constraint

◮ [Abassi et al. KDD’13]: sequential (2 + ǫ)-approximation

(based on local-search) for remote-clique diversity under matroid constraint

◮ Our approach can be generalized to provide

(1 + ǫ)-composable core-sets for all div’s under partition/transversal matroid constraint in doubling spaces. Main adaptations required:

◮ Selection of k′

slide-92
SLIDE 92

Diversity maximization under matroid costraints

◮ Consider diversity maximization problems where solutions are

required to satisfy some given matroid constraint

◮ [Abassi et al. KDD’13]: sequential (2 + ǫ)-approximation

(based on local-search) for remote-clique diversity under matroid constraint

◮ Our approach can be generalized to provide

(1 + ǫ)-composable core-sets for all div’s under partition/transversal matroid constraint in doubling spaces. Main adaptations required:

◮ Selection of k′ ◮ Selection of delegates in each cluster to include in core-set

slide-93
SLIDE 93

Summary and future work

slide-94
SLIDE 94

Summary

slide-95
SLIDE 95

Summary

◮ (1 + ǫ)-(composable) core-sets for diversity maximization in

doubling spaces (under partition/transversal matroid costraint)

◮ 1-pass/2-round Streaming/MapReduce algorithms ◮ Experiments on real and synthetic data reveal effectiveness,

efficiency and scalability of our approach

slide-96
SLIDE 96

Future Work

slide-97
SLIDE 97

Future Work

◮ Further exploit randomization to obtain lower space

constraints in MapReduce

◮ Extend to broader classes of metric spaces ◮ Deal with general matroid constraints ◮ Incorporate other constraints (e.g., density)

slide-98
SLIDE 98

Future Work

◮ Further exploit randomization to obtain lower space

constraints in MapReduce

◮ Extend to broader classes of metric spaces ◮ Deal with general matroid constraints ◮ Incorporate other constraints (e.g., density)