Data Driven Algorithm Design Maria-Florina (Nina) Balcan Carnegie - PowerPoint PPT Presentation

Data Driven Algorithm Design Maria-Florina (Nina) Balcan Carnegie Mellon University

Analysis and Design of Algorithms Classic algo design: solve a worst case instance. • Easy domains, have optimal poly time algos. E.g., sorting, shortest paths • Most domains are hard. E.g., clustering, partitioning, subset selection, auction design, … Data driven algo design: use learning & data for algo design. • Suited when repeatedly solve instances of the same algo problem.

Data Driven Algorithm Design Data driven algo design: use learning & data for algo design. Different methods work better in different settings. • Large family of methods – what’s best in our application? • Prior work: largely empirical. Artificial Intelligence: • [Horvitz-Ruan-Gomes-Kautz-Selman-Chickering, UAI 2001] [Xu-Hutter-Hoos-LeytonBrown, JAIR 2008] Computational Biology: E.g., [DeBlasio-Kececioglu, 2018] • Game Theory: E.g., [Likhodedov and Sandholm, 2004] •

Data Driven Algorithm Design Data driven algo design: use learning & data for algo design. Different methods work better in different settings. • Large family of methods – what’s best in our application? • Prior work: largely empirical. Our Work: Data driven algos with formal guarantees . Several cases studies of widely used algo families. • General principles: push boundaries of algo design and ML. • Related to: Hyperparameter tuning, AutoML, MetaLearning. Program Synthesis (Sumit Gulwani’s talk on Mon) .

Structure of the Talk • Data driven algo design as batch learning. A formal framework. • Case studies: clustering, partitioning pbs, auction pbs. • General sample complexity theorem. • • Data driven algo design as online learning.

Example: Clustering Problems Clustering : Given a set objects organize then into natural groups. • E.g., cluster news articles, or web pages, or search results by topic. • Or, cluster customers according to purchase history. • Or, cluster images by who is in them. Often need do solve such problems repeatedly. • E.g., clustering news articles (Google news).

Example: Clustering Problems Clustering : Given a set objects organize then into natural groups. Objective based clustering 𝒍 -means Input: Set of objects S, d Output: centers {c 1 , c 2 , … , c k } To minimize σ p min d 2 (p, c i ) i 𝐥 -median : min σ p min d(p, c i ) . k-center/facility location : minimize the maximum radius. • Finding OPT is NP-hard, so no universal efficient algo that works on all domains.

Algorithm Design as Distributional Learning Goal: given family of algos 𝐆 , sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. MST + Dynamic Programming Large family 𝐆 of algorithms Greedy + Farthest Location Sample of typical inputs … Input 2: Input N: Input 1: Clustering: … Input 2: Input N: Input 1: … Input 2: Input N: Input 1: Facility … location:

Sample Complexity of Algorithm Selection Goal: given family of algos 𝐆 , sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. Approach: ERM, find ෡ 𝐁 near optimal algorithm over the set of samples. Key Question: Will ෡ 𝐁 do well on future instances? Seen: … New: Sample Complexity: How large should our sample of typical instances be in order to guarantee good performance on new instances?

Sample Complexity of Algorithm Selection Goal: given family of algos 𝐆 , sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. Approach: ERM, find ෡ 𝐁 near optimal algorithm over the set of samples. Key tools from learning theory Uniform convergence : for any algo in F , average performance • over samples “close” to its expected performance. Imply that ෡ 𝐁 has high expected performance. • N = O dim 𝐆 /ϵ 2 instances suffice for 𝜗 -close. •

Sample Complexity of Algorithm Selection Goal: given family of algos 𝐆 , sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. Key tools from learning theory N = O dim 𝐆 /ϵ 2 instances suffice for 𝜗 -close. dim 𝐆 (e.g. pseudo-dimension) : ability of fns in 𝐆 to fit complex patterns More complex patterns can fit, more samples needed for UC and generalization

Sample Complexity of Algorithm Selection Goal: given family of algos 𝐆 , sample of typical instances from domain (unknown distr. D), find algo that performs well on new instances from D. Key tools from learning theory N = O dim 𝐆 /ϵ 2 instances suffice for 𝜗 -close. dim 𝐆 (e.g. pseudo-dimension) : ability of fns in 𝐆 to fit complex patterns 𝑧 Overfitting 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 5 𝑦 6 𝑦 7 Training set

Statistical Learning Approach to AAD Challenge : “nearby” algos can have drastically different behavior. IQP objective value 𝑡 α ∈ ℝ Revenue Revenue 2 nd highest bid Reserve r 2 nd Highest Price Price highest bid bid Challenge : design a computationally efficient meta-algorithm.

Algorithm Design as Distributional Learning Prior Work: [Gupta- Roughgarden, ITCS’16 &SICOMP’17] proposed model; analyzed greedy algos for subset selection pbs (knapsack & independent set) . Our results : New algorithm classes for a wide range of problems. Clustering: Parametrized Linkage Parametrized Lloyd’s [Balcan-Nagarajan-Vitercik-White, COLT 2017] [Balcan-Dick-White, NeurIPS 2018] [Balcan-Dick-Lang, 2019] dim (F) = O k log n DATA DATA dim (F) = O log n 𝛽 − Weighted comb … Random Farthest first … 𝑙𝑛𝑓𝑏𝑜𝑡 + + 𝐸 𝛽 sampling Complete linkage seeding traversal Single linkage Ward’s alg DP for DP for DP for 𝑀 2 -Local search 𝛾 -Local search k-means k-median k-center CLUSTERING CLUSTERING Alignment pbs (e.g., string alignment): parametrized dynamic prog. [Balcan-DeBlasio-Dick-Kingsford-Sandholm-Vitercik, 2019]

Algorithm Design as Distributional Learning Our results : New algo classes applicable for a wide range of pbs. Partitioning pbs via IQPs: SDP + Rounding • Integer Quadratic Programming (IQP) [Balcan-Nagarajan-Vitercik-White, COLT 2017] Semidefinite Programming dim (F) = O log n Relaxation (SDP) E.g., Max-Cut, GW s-linear … 1-linear … … Max-2SAT, Correlation Clustering rounding rounding roundig Feasible solution to IQP • Automated mechanism design [Balcan-Sandholm-Vitercik, EC 2018] Generalized parametrized VCG auctions, posted prices, lotteries.

Algorithm Design as Distributional Learning Our results : New algo classes applicable for a wide range of pbs. Branch and Bound Techniques for solving MIPs • [Balcan-Dick-Sandholm- Vitercik, ICML’18] Max 𝒅 ∙ 𝒚 s.t. 𝐵𝒚 = 𝒄 𝑦 𝑗 ∈ {0,1}, ∀𝑗 ∈ 𝐽 Max (40, 60, 10, 10, 30, 20, 60) ∙ 𝒚 1 2 , 1, 0, 0, 0, 0, 1 MIP instance s.t. 40, 50, 30, 10, 10, 40, 30 ∙ 𝒚 ≤ 100 𝒚 ∈ {0,1} 7 140 𝑦 1 = 0 𝑦 1 = 1 Choose a leaf of the search tree 1 3 0, 1, 0, 1, 0, 4 , 1 1, 5 , 0, 0, 0, 0, 1 1 2 Best-bound Depth-first 135 136 𝑦 6 = 0 𝑦 2 = 0 𝑦 2 = 1 𝑦 6 = 1 1 3 1 1 0, 1, 3 , 1, 0, 0, 1 0, 5 , 0, 0, 0, 1, 1 1, 0, 0, 1, 0, 2 , 1 1, 1, 0, 0, 0, 0, 3 Choose a variable to branch on 3 1 116 120 120 133 3 𝛽 -linear Product Most fractional 𝑦 3 = 1 𝑦 3 = 0 0, 1, 0, 1, 1, 0, 1 4 0, 5 , 1, 0, 0, 0, 1 Fathom if possible and terminate if possible 133 118

Clustering Problems Clustering : Given a set objects (news articles, customer surveys, web pages, …) organize then into natural groups. Objective based clustering 𝒍 -means Input: Set of objects S, d Output: centers {c 1 , c 2 , … , c k } To minimize σ p min d 2 (p, c i ) i Or minimize distance to ground-truth

Clustering: Linkage + Post-processing Family of poly time 2-stage algorithms: [Balcan-Nagarajan-Vitercik-White, COLT 2017] 1. Greedy linkage-based algo to get hierarchy (tree) of clusters. 2. Fixed algo (e.g., DP or last k-merges) to select a good pruning. A B C D E F A B C D E F DEF DEF A B C A B C D E D E A B A B A A B B C C D D E E F F

Clustering: Linkage + Post-processing 1. Linkage-based algo to get a hierarchy. 2. Post-processing to identify a good pruning. Both steps can be done efficiently. DATA 𝛽 − Weighted Complete Ward’s Single … comb linkage linkage algo DP for DP for DP for k-means k-median k-center CLUSTERING

Linkage Procedures for Hierarchical Clustering Bottom-Up (agglomerative) All topics Start with every point in its own cluster. • sports fashion Repeatedly merge the “closest” two • clusters. tennis Lacoste soccer Gucci Different defs of “closest” give different algorithms.

Linkage Procedures for Hierarchical Clustering All topics Have a distance measure on pairs of objects. d(x,y) – distance between x and y sports fashion E.g., # keywords in common, edit distance, etc tennis Lacoste soccer Gucci • Single linkage: x∈A,x ′ ∈B dist(x, x ′ ) dist A, B = min • Complete linkage: x∈A,x ′ ∈B dist(x, x ′ ) dist A, B = max • Parametrized family, α -weighted linkage: x∈A,x ′ ∈B d(x, x ′ ) + α max x∈A,x ′ ∈B d(x, x ′ ) dist α A, B = (1 − 𝛽) min

Data Driven Algorithm Design Maria-Florina (Nina) Balcan Carnegie - PowerPoint PPT Presentation

Data Driven Algorithm Design Maria-Florina (Nina) Balcan Carnegie Mellon University Analysis and Design of Algorithms Classic algo design: solve a worst case instance. Easy domains, have optimal poly time algos. E.g., sorting, shortest

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Gillian Smith September 13, 2012 gillian@ccs.neu.edu Graphics-Driven Game Design

False fasting is driven by pride False fasting is driven by pride False fasting is

Domain Driven Domain Driven Design with relational Design with relational Databases and Spring

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Data-Driven Research Program Data-Driven Research Program Linked Longitudinal Retrospective

SCE Map Update: Data-Driven Spatial and E Field Maps Michael Mooney, Hannah Rogers Colorado

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Domain-Driven Design Brett D. Roads Domain-Driven Design: Tackling Complexity in the Heart of

A Performance-Driven Standard-Cell A Performance-Driven Standard-Cell Placer Based on a Modified

Data Driven Marketing the DNA of customer oriented companies 00101001 yes no Data Driven

1 Data-dr Data-driven philosophy n philosophy Data-dr Data-driven: push n: push 7 8

CS 528 Mobile and Ubicomp Lecture 3a: Data-Driven Layouts & Android Components Emmanuel Agu

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Gene Ontology and Functional Enrichment Genome 373 Genomic Informatics Elhanan Borenstein A

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

The Ontario Cancer Data Linkage Project (cd-link) A new data release mechanism for cancer

Introduction to Microarray Data Analysis and Gene Networks Lecture 5 Alvis Brazma European

Shape Co-analysis and constrained clustering Daniel Cohen-Or Tel-Aviv University 1 High-level

B Street / Broadway Piers, Downtown Anchorage, and Switzer Creek TMDLs Public Workshop &

Tropospheric Water Vapor Variability and Linkage to Tropospheric Water Vapor Variability and

Data Driven Algorithm Design Maria-Florina (Nina) Balcan Carnegie - PowerPoint PPT Presentation

Data Driven Algorithm Design Maria-Florina (Nina) Balcan Carnegie Mellon University Analysis and Design of Algorithms Classic algo design: solve a worst case instance. Easy domains, have optimal poly time algos. E.g., sorting, shortest

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Gillian Smith September 13, 2012 gillian@ccs.neu.edu Graphics-Driven Game Design

False fasting is driven by pride False fasting is driven by pride False fasting is

Domain Driven Domain Driven Design with relational Design with relational Databases and Spring

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Data-Driven Research Program Data-Driven Research Program Linked Longitudinal Retrospective

SCE Map Update: Data-Driven Spatial and E Field Maps Michael Mooney, Hannah Rogers Colorado

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Domain-Driven Design Brett D. Roads Domain-Driven Design: Tackling Complexity in the Heart of

A Performance-Driven Standard-Cell A Performance-Driven Standard-Cell Placer Based on a Modified

Data Driven Marketing the DNA of customer oriented companies 00101001 yes no Data Driven

1 Data-dr Data-driven philosophy n philosophy Data-dr Data-driven: push n: push 7 8

CS 528 Mobile and Ubicomp Lecture 3a: Data-Driven Layouts &amp; Android Components Emmanuel Agu

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Gene Ontology and Functional Enrichment Genome 373 Genomic Informatics Elhanan Borenstein A

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

The Ontario Cancer Data Linkage Project (cd-link) A new data release mechanism for cancer

Introduction to Microarray Data Analysis and Gene Networks Lecture 5 Alvis Brazma European

Shape Co-analysis and constrained clustering Daniel Cohen-Or Tel-Aviv University 1 High-level

B Street / Broadway Piers, Downtown Anchorage, and Switzer Creek TMDLs Public Workshop &amp;

Tropospheric Water Vapor Variability and Linkage to Tropospheric Water Vapor Variability and

CS 528 Mobile and Ubicomp Lecture 3a: Data-Driven Layouts & Android Components Emmanuel Agu

B Street / Broadway Piers, Downtown Anchorage, and Switzer Creek TMDLs Public Workshop &