MapReduce and Streaming Algorithms for Diversity Maximization in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dell’Informazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) VLDB’17

Outline ◮ Problem definition and applications ◮ Background ◮ Summary of results ◮ Our approach: ◮ Core-set construction ◮ Streaming implementation ◮ MapReduce implementation ◮ Experiments ◮ Diversity maximization under matroid costraints ◮ Summary and future work

Problem definition and applications

Diversity maximization Objective: For a given dataset

Diversity maximization Objective: Determine the most diverse subset of given (small) size k

Applications

Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)

Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction

Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction Facility location ◮ Franchise location (noncompetition) ◮ Strategic facilities (dispersion against simultaneous attacks)

Diversity maximization: formal definition

Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 }

Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ )

Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ ) The k -diversity of S is div k ( S ) = div( S ∗ )

Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle

Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ )

Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ )

Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ )

Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ )

Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ ) Except for remote-clique, all problems are max-min optimizations.

Background

Previous work Sequential approximation and hardness results Problem Seq. Approx. LB Remote-edge 2 ≥ 2 Remote-clique 2 ≥ 2 − ǫ Remote-star 2 – Remote-bipartition 3 – Remote-tree 4 ≥ 2 Remote-cycle 3 ≥ 2 Specialized results (hardness and better approx. ratios) for remote clique and remote edge under Euclidean distances

Previous work β -core-set [Agarwal et al.’04]

Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )

Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy

Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy ◮ Approximate solution can be computed on T .

Previous work β -composable core-set [Indyk et al.’14]

Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ

Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set

Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set ◮ Application to MapReduce and Streaming frameworks

Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio

Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces

Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces ◮ Core-set size: k

Computational Frameworks for Massive Data Analysis

Computational Frameworks for Massive Data Analysis MapReduce ◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into (small) subsets, processed in parallel ◮ Goals: few rounds, sublinear local space, linear overall space.

MapReduce and Streaming Algorithms for Diversity Maximization in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dellInformazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U.

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Diversity maximization in MapReduce and Streaming Under Cardinality and Matroid Constraints

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

Machine Learning APIs Comm mmon n appli pplications cations Autonomous vehicles Optical

Unsupervised learning: basics CLUS TERIN G METH ODS W ITH S CIP Y Shaumik Daityari Business

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Improved User News Feed Customization for an Open Source Search Engine Timothy Chow Agenda -

OSG News Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC Two Slides of

IMPROVING NEWS RANKING BY COMMUNITY TWEETS Xin Shuai, Xiaozhong Liu, Johan Bollen Sunday, April

Logistics Please fill out class survey! https://uw.iasystem.org/survey/205862 Midterm

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

MapReduce and Streaming Algorithms for Diversity Maximization in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dellInformazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U.

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Diversity maximization in MapReduce and Streaming Under Cardinality and Matroid Constraints

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

Machine Learning APIs Comm mmon n appli pplications cations Autonomous vehicles Optical

Unsupervised learning: basics CLUS TERIN G METH ODS W ITH S CIP Y Shaumik Daityari Business

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Improved User News Feed Customization for an Open Source Search Engine Timothy Chow Agenda -

OSG News Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC Two Slides of

IMPROVING NEWS RANKING BY COMMUNITY TWEETS Xin Shuai, Xiaozhong Liu, Johan Bollen Sunday, April

Logistics Please fill out class survey! https://uw.iasystem.org/survey/205862 Midterm

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the