mapreduce and streaming algorithms for diversity
play

MapReduce and Streaming Algorithms for Diversity Maximization in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dellInformazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U.


  1. MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dell’Informazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) VLDB’17

  2. Outline ◮ Problem definition and applications ◮ Background ◮ Summary of results ◮ Our approach: ◮ Core-set construction ◮ Streaming implementation ◮ MapReduce implementation ◮ Experiments ◮ Diversity maximization under matroid costraints ◮ Summary and future work

  3. Problem definition and applications

  4. Diversity maximization Objective: For a given dataset

  5. Diversity maximization Objective: Determine the most diverse subset of given (small) size k

  6. Applications

  7. Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)

  8. Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction

  9. Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction Facility location ◮ Franchise location (noncompetition) ◮ Strategic facilities (dispersion against simultaneous attacks)

  10. Diversity maximization: formal definition

  11. Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 }

  12. Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ )

  13. Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ ) The k -diversity of S is div k ( S ) = div( S ∗ )

  14. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle

  15. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ )

  16. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ )

  17. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ )

  18. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ )

  19. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ ) Except for remote-clique, all problems are max-min optimizations.

  20. Background

  21. Previous work Sequential approximation and hardness results Problem Seq. Approx. LB Remote-edge 2 ≥ 2 Remote-clique 2 ≥ 2 − ǫ Remote-star 2 – Remote-bipartition 3 – Remote-tree 4 ≥ 2 Remote-cycle 3 ≥ 2 Specialized results (hardness and better approx. ratios) for remote clique and remote edge under Euclidean distances

  22. Previous work β -core-set [Agarwal et al.’04]

  23. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )

  24. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )

  25. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy

  26. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy ◮ Approximate solution can be computed on T .

  27. Previous work β -composable core-set [Indyk et al.’14]

  28. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ

  29. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set

  30. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set

  31. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set ◮ Application to MapReduce and Streaming frameworks

  32. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

  33. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio

  34. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces

  35. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces ◮ Core-set size: k

  36. Computational Frameworks for Massive Data Analysis

  37. Computational Frameworks for Massive Data Analysis MapReduce ◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into (small) subsets, processed in parallel ◮ Goals: few rounds, sublinear local space, linear overall space.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend