Set Cover Algorithms For Very Large Datasets Graham Cormode Howard - PowerPoint PPT Presentation

Set Cover Algorithms For Very Large Datasets Graham Cormode Howard Karloff AT&T Labs-Research Tony Wirth University of Melbourne

Set Cover? � Given a collection of sets over a universe of items � Find smallest subcollection of sets that also cover all the items. 2

Why Set Cover? � The set cover problem arises in many contexts: – Facility location: facility covers sites – Machine learning: labeled example covers some items – Information Retrieval: each document covers set of topics – Data mining: finding a minimal ‘explanation’ for patterns – Data mining: finding a minimal ‘explanation’ for patterns – Data quality: find a collection of rules to describe structure 3

How to solve it? � Set Cover is NP-hard! � Simple greedy algorithm: – Repeatedly select set with most uncovered items. – Logarithmic factor guarantee: 1 + ln n – No factor better than (1 - o (1)) ln n possible – No factor better than (1 - o (1)) ln n possible � In practice, greedy very useful: – Better than other approximation algorithms – Often within 10% of optimal 4

Existing Algorithms � Greedy algorithm: 1+ ln n approximation – Until all n elements of X are in C (initially empty): � Choose (one of) set(s) with maximum value of | S i - C | � Let C = C ∪ S i* � Naïve algorithm: no guaranteed approximation – Sort the sets by their (initial) sizes, | S i |, descending – Single pass through the sorted list: � If a set has an uncovered item, select it � Update C 5

Example greedy ABCDE ABDFG AFG BCG GH EH C I A E I 6

Optimum ABCDE ABDFG AFG BCG GH EH CI A E I 7

What ’ s wrong? � Try implementing greedy on large dataset: – Scales very poorly � Millions of sets with universe of many millions of items? � Dataset growth exceeds fast memory growth � If forced to use disk: selecting “largest” set requires updating set sizes to account for covered items � Even 30Mb instance required >1 minute to run on disk 8

Implementing greedy � Main step: find set with largest | S i - C | value � Inverted index: – Maintain updated sizes in priority queue – Inverted index records which sets each item is in – Costly to build index, no locality of reference – Costly to build index, no locality of reference � Multipass solution: – Loop through all sets, calculating | S i - C | on the fly – Good locality of reference, but many passes! – If | S i* - C | drops below a threshold: � Loop adds all sets with specific | S i* - C | value 9

Idea for our algorithm � Huge effort to find max | S i - C | � Instead find set close to maximum uncovered size � If always at least factor α × maximum: – We have 1 + (ln n) / α approximation algorithm – Proof similar to that for greedy � We call it Disk-Friendly Greedy (DFG) 10

How to achieve this � Select parameter p > 1: governs approximation and run time � Partition sets into subcollections: – S i in Z k if: p k ≤ | S i | < p k +1 � For k ← K down to 0: – For each set S i in Z k : � If | S i - C | ≥ p k : select S i and update C � Else: let S i ← S i - C and add it to Z k’ : p k’ ≤ | S i | < p k’ +1 � For each S i in Z 0 : select S i , update C , if has uncovered item 11

Example DFG run 4–7 ABCDE ABD FG 2–3 2–3 AFG AFG BCG BCG H H E E H H C C I I G G 1 A E I 12

In-memory Cost analysis � Each S i either selected or put in lower subcollection � Guaranteed to shrink by factor p every other pass � Total number of items in all iterations is (1 + 1/( p -1))| S i | � So 1 + 1/( p -1) times input read time 13

Disk model analysis � All file accesses are sequential! � Initial sweep through input � Two passes for each subcollection – One when sets from higher subcollections added – One to select or knock down sets � Block size B, K collections: – Disk accesses for reading input: D = ∑| S i |/ B – DFG requires 2 D [1 + 1/( p -1)] + 2 K disk reads 14

Disk-based results � Tested on Frequent Itemset Mining Dataset Repository � Show results on kosarak (31Mb) and webdocs (1.4Gb) time (s) |Solution| kosarak.dat naive 8.51 20664 multipass multipass 331.66 331.66 17746 17746 greedy 98.66 17750 DFG 2.61 17748 webdocs.dat naive 91.21 433412 multipass — — greedy — — DFG 86.28 406440 15

Memory-based results time (s) |Solution| kosarak.dat naive 2.20 20664 multipass 4.21 17746 greedy 2.99 17750 DFG DFG 1.97 1.97 17741 17741 webdocs.dat naive 100.98 433412 multipass 8049.08 406381 greedy 199.02 406351 DFG 93.38 406338 16

Impact of p � RAM-based results for webdocs.dat � Improving guaranteed accuracy only increases running time by 50% (30s) � Observed solution size improves, though not as much 17

Summary � Noted poor performance of greedy, especially on disk � Introduced alternative algorithm to greedy: – Has approximation bound similar to greedy � On each disk-resident dataset: our algorithm 10 × faster � On largest instance: over 400 × faster � Solution essentially as good as greedy � Disk version almost as fast as RAM version: – Not disk bound! 18

Set Cover Algorithms For Very Large Datasets Graham Cormode Howard - PowerPoint PPT Presentation

Set Cover Algorithms For Very Large Datasets Graham Cormode Howard Karloff AT&T Labs-Research Tony Wirth University of Melbourne Set Cover? Given a collection of sets over a universe of items Find smallest subcollection of sets

Graphs Vertex Cover Vertex Cover A vertex cover of a graph G=(V ,E) is a set C of vertices such

COAL COVER COAL COAL COAL COVER COVER COVER Searfoss

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

LP techniques for set cover Chs. 13, 14, 15 Risto Hakala risto.m.hakala@tkk.fi March 10, 2008

11.4 The Pricing Method: Vertex Cover Weighted Vertex Cover Weighted vertex cover. Given a

Streaming Algorithms for Set Cover Piotr Indyk With : Sepideh Mahabadi, Ali Vakilian Set Cover

Input. A set of men M , and a set of women W . Input. A set of men M , and a set of women W .

CARPENTER Biological Datasets Find Closed Patterns in Long Biological Datasets Gene

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline 1) What is big data? 2)

Parallel Algorithms for Graphs on a Very Large Number of Nodes Krzysztof Onak IBM T.J. Watson

Fast distributed approximation algorithms for vertex cover and set cover in anonymous networks

Science Frontiers Very Small - Elementary Particle Physics Very Large - Astrophysics Very Complex

documentation Overview The datasets Common data manipulations Analysis using weights

6/24/2019 1 2 3 4 5 6 1 6/24/2019 7 8 9 10 11 12 2 6/24/2019 13 14 15 16 17

On the Abhyankar-Moh inequality Evelia Garca Barroso La Laguna University, Tenerife September,

EH Regulation EH Regulation EH Regulation ENV H 471 Winter 2004 Room T-498 HSC Charles D.

Triangular 2D and 3D Grid Refinement for Atmosphere and Ocean Simulation Jrn Behrens

Derivation of Hartree theory for generic mean-field Bose gases Mathieu LEWIN

Ecient Likelihood Evaluation of State-Space Representations David N. DeJong , Roman

Entropy-based Selection of Graph Cuboids Dritan Bleco Yannis Kotidis

Third Generation PV and Other Ways to Utilize Solar Energy Third Generation PV Technologies Week