Statistical Algorithmic Profiling for Randomized Approximate - PowerPoint PPT Presentation

CCF-1629431 CCF-1703637 Statistical Algorithmic Profiling for Randomized Approximate Programs Keyur Joshi , Vimuth Fernando, Sasa Misailovic University of Illinois at Urbana-Champaign ICSE 2019

Randomized Approximate Algorithms Modern applications deal with large amounts of data Obtaining exact answers for such applications is resource intensive Approximate algorithms give a “good enough” answer in a much more efficient manner

Randomized Approximate Algorithms Randomized approximate algorithms have attracted the attention of many authors and researchers Developers still struggle to properly test implementations of these algorithms

Example Application: Finding Near-Duplicate Images

Locality Sensitive Hashing (LSH) Finds vectors near a given vector in high dimensional space LSH randomly chooses some locality sensitive hash functions in every run Locality sensitive – nearby vectors are more likely to have the same hash Every run uses different hash functions – output can vary

Locality Sensitive Hashing (LSH) Visualization ℎ 2 0 1 ℎ 1 0 ℎ 3 1 1 0

Locality Sensitive Hashing (LSH) Visualization ℎ 2 ℎ 3 0 0 1 1 ℎ 1 1 0

Comparing Images with LSH Suppose, over 100 runs, an LSH implementation considered the images similar 90 times Is this the expected behavior? Usually, algorithm designers state the expected behavior by providing an accuracy specification We wish to ensure that the implementation satisfies the accuracy specification

LSH Accuracy Specification* Correct LSH implementations consider two vectors 𝑏 and 𝑐 to be 𝑚 over runs 𝑙 neighbors with probability 𝑞 𝑡𝑗𝑛 = 1 − 1 − 𝑞 𝑏,𝑐 𝑞 𝑡𝑗𝑛 depends on: • 𝑙, 𝑚 : algorithm parameters (number of hash functions) • 𝑞 𝑏,𝑐 : dependent on the hash function and the distance between 𝑏 and 𝑐 (part of the specification) *P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in STOC 1998

Challenges in Testing an LSH Implementation Output can vary in every run due to different hash functions Need to run LSH multiple times to observe value of 𝑞 𝑡𝑗𝑛 Need to compare expected and observed values of 𝑞 𝑡𝑗𝑛 Values may not be exactly the same – how close must they be? Need to use an appropriate statistical test for such a comparison

Testing an LSH Implementation Manually To test manually, the developer must provide: Algorithm Parameters Implementation Runner (for LSH: range of 𝑙, 𝑚 values) Appropriate Statistical Test Number of Times to Run LSH Multiple Test Inputs Visualization Script

Testing an LSH Implementation With AxProf To test with AxProf, the developer must provide: Accuracy / Performance Input and Output Types Specification (math notation) (for LSH: list of vectors) Algorithm Parameters Implementation Runner (for LSH: range of 𝑙, 𝑚 values) Appropriate Statistical Test Number of Times to Run LSH AxProf Multiple Test Inputs Visualization Script

Approximate Algorithm Testing an LSH Implementation With AxProf To test with AxProf, the developer must provide: Accuracy / Performance Input and Output Types Specification (math notation) (vectors / matrices / maps) Algorithm Parameters Implementation Runner Number of Samples Appropriate Statistical Test (runs / inputs) AxProf Multiple Test Inputs Visualization Script

LSH Accuracy Specification Given to AxProf Math Specification: A vector pair 𝑏, 𝑐 appears in the output if LSH considers 𝑚 𝑙 them neighbors. This should occur with probability 𝑞 𝑡𝑗𝑛 = 1 − 1 − 𝑞 𝑏,𝑐 AxProf specification: Input list of (vector of real); Output list of (pair of (vector of real)); forall a in Input, b in Input : Probability over runs [ [a, b] in Output ] == 1 - (1 – (p_ab(a, b)) ^ k) ^ l p_ab is a helper function that calculates 𝑞 𝑏,𝑐

Example LSH Implementation: TarsosLSH Popular (150 stars) LSH implementation in Java available on GitHub* Includes a (faulty) benchmark which runs LSH once and reports accuracy AxProf found a fault not detected by the benchmark Fault is present for one hash function for the ℓ 1 distance metric *https://github.com/JorenSix/TarsosLSH

TarsosLSH Failure Visualization 1 Represents a pair of neighboring vectors Should ideally lie along the diagonal AxProf: FAIL Obtained by running TarsosLSH We found and multiple times fixed 3 faults and ran AxProf again Obtained from specification

TarsosLSH Failure Visualization 2 AxProf: Contains 1 FAIL subtle fault Visual analysis not sufficient!

Visualization of Corrected TarsosLSH AxProf: PASS

AxProf Accuracy Specification Language Handles a wide variety of algorithm specifications AxProf language specifications appear very similar to mathematical specifications Expressive: • Supports list, matrix, and map data structures • Supports probability and expected value specifications • Supports specifications with universal quantification over input items Unambiguous: • Explicit specification of probability space – over inputs, runs, or input items

Accuracy Specification Example 1: Probability over inputs Probability over inputs [Output > 25] == 0.1 Multiple Multiple Inputs: Outputs: Algorithm 𝑗𝑜𝑞𝑣𝑢 1 𝑝𝑣𝑢𝑞𝑣𝑢 1 One Run: 𝑗𝑜𝑞𝑣𝑢 2 𝑝𝑣𝑢𝑞𝑣𝑢 2 10% of the 𝑡𝑓𝑓𝑒 1 𝑗𝑜𝑞𝑣𝑢 3 𝑝𝑣𝑢𝑞𝑣𝑢 3 outputs … … must be > 25 𝑝𝑣𝑢𝑞𝑣𝑢 𝑛 𝑗𝑜𝑞𝑣𝑢 𝑛

Accuracy Specification Example 2: Probability over runs Probability over runs [Output > 25] == 0.1 Algorithm Multiple Multiple Runs: Outputs: 𝑡𝑓𝑓𝑒 1 𝑝𝑣𝑢𝑞𝑣𝑢 1 One Input: 𝑗𝑜𝑞𝑣𝑢 1 𝑡𝑓𝑓𝑒 2 𝑝𝑣𝑢𝑞𝑣𝑢 2 10% of the 𝑡𝑓𝑓𝑒 3 𝑝𝑣𝑢𝑞𝑣𝑢 3 outputs … … must be > 25 𝑡𝑓𝑓𝑒 𝑜 𝑝𝑣𝑢𝑞𝑣𝑢 𝑜

Accuracy Specification Example 3: Probability over input items Probability over i in Input [Output[i] > 25] == 0.1 One Input, One Output, Multiple Multiple Items: Items: Algorithm 𝑗 1 𝑝𝑣𝑢𝑞𝑣𝑢 𝑗 1 One Run: 𝑗 2 𝑝𝑣𝑢𝑞𝑣𝑢 𝑗 2 10% of the 𝑡𝑓𝑓𝑒 1 𝑗 3 𝑝𝑣𝑢𝑞𝑣𝑢 𝑗 3 output items … … must be > 25 𝑝𝑣𝑢𝑞𝑣𝑢 𝑗 𝑙 𝑗 𝑙

Accuracy Specification Example 4: Expectation Expectation over inputs [Output] == 100 Expectation over runs [Output] == 100 Expectation over i in Input [Output[i]] == 100

Accuracy Specification Example 5: Universal quantification forall i in Input: Probability over runs [Output [i] > 25] == 0.1 Multiple Outputs per Item: One Input, Algorithm Multiple 𝑝𝑣𝑢𝑞𝑣𝑢 1 𝑗 1 Multiple Multiple Outputs, Runs: Multiple Items: … Items: 𝑝𝑣𝑢𝑞𝑣𝑢 𝑜 𝑗 1 𝑗 1 𝑡𝑓𝑓𝑒 1 𝑝𝑣𝑢𝑞𝑣𝑢 1…𝑜 𝑗 1 𝑡𝑓𝑓𝑒 2 𝑗 2 𝑝𝑣𝑢𝑞𝑣𝑢 1…𝑜 𝑗 2 10% of the outputs … … … for every input 𝑗 𝑙 𝑡𝑓𝑓𝑒 𝑜 𝑝𝑣𝑢𝑞𝑣𝑢 1…𝑜 𝑗 𝑙 item must be > 25

Accuracy Specification Testing AxProf generates code to fully automate specification testing: 1. Generate inputs with varying properties 2. Gather outputs of the program from multiple runs/inputs 3. Test the outputs against the specification with a statistical test 4. Combine the results of multiple statistical tests, if required 5. Interpret the final combined result (PASS/FAIL)

LSH: Choosing a Statistical Test AxProf accuracy specification for LSH: forall a in Input, b in Input : Probability over runs [[a, b] in Output] == 1-(1 – (p_ab(a,b))^k)^l Must compare values of 𝑞 𝑏,𝑐 for every 𝑏, 𝑐 in input Then combine results of each comparison into a single result AxProf uses the non-parametric binomial test for each probability comparison • Non-parametric – does not make any assumptions about the data For forall , AxProf combines individual statistical tests using Fisher’s method

LSH: Choosing the Number of Runs Number of runs for the binomial test depends on desired level of confidence: • 𝜷 : Probability of incorrectly assuming a correct implementation is faulty (Type 1 error) • 𝜸 : Probability of incorrectly assuming a faulty implementation is correct (Type 2 error) • 𝜺 : Minimum deviation in probability that the binomial test should detect 2 𝑨 1− 𝛽 𝑞 0 1−𝑞 0 +𝑨 1−𝛾 𝑞 𝑏 1−𝑞 𝑏 2 Formula for calculating the number of runs: 𝜀 We choose 𝛽 = 0.05, 𝛾 = 0.2, 𝜀 = 0.1 (commonly used values) • AxProf calculates that 200 runs are necessary

LSH: Generating Inputs Input list of (vector of real); forall a in Input, b in Input : Probability over runs [[a, b] in Output] == 1-(1 – (p_ab(a,b))^k)^l There is an implicit requirement that this specification should be satisfied for every input AxProf provides flexible input generators for various input types • User can provide their own input generators

Statistical Algorithmic Profiling for Randomized Approximate - PowerPoint PPT Presentation

CCF-1629431 CCF-1703637 Statistical Algorithmic Profiling for Randomized Approximate Programs Keyur Joshi , Vimuth Fernando, Sasa Misailovic University of Illinois at Urbana-Champaign ICSE 2019 Randomized Approximate Algorithms Modern

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Treewidth reduction and algorithmic applications Treewidth reduction and algorithmic applications

Provider Profiling Prepared by Melissa Reagan, MSW, LSW, Quality Performance Specialist Agenda

Continuous Profiling in Production: What, Why and How Richard Warburton (@richardwarburto) Sadiq

MALT : MALloc Tracker A memory profiling tool 3/02/2019 MALT, Sbastien Valat 1 Questions

Integrating mol Integrating mol ecular Profiling ecular Profiling Into Patient Se election for

The UNC Law Librarys Redaction of its Digitized NC Supreme Court Briefs: A Case Study Nicole

Regulation And Solutions by Darmain Segaran CHILD DIGITAL PRIVACY & RISKS PRIVACY RISKS

Welcome to Fifth Grade! Language Arts ELAR instruction consists of reading, language,

Threats to Data: Legal Compliance Challenges at the Intersection of Privacy and Security William

Why SmartMusic? SmartMusic is a web-based music education platform that connects teachers and

Get to Know J.B. Van Hollen, Wisconsin Attorney General Teleseminar Noon 1 p.m. (U.S. EST)

Todays Investigation, Tomorrows Risk: How to avoid pitfalls in doing investigations

Copper, Nickel & Precious Metals in the U.S. October 2014 Cautionary Statement This

Statistical Algorithmic Profiling for Randomized Approximate - PowerPoint PPT Presentation

CCF-1629431 CCF-1703637 Statistical Algorithmic Profiling for Randomized Approximate Programs Keyur Joshi , Vimuth Fernando, Sasa Misailovic University of Illinois at Urbana-Champaign ICSE 2019 Randomized Approximate Algorithms Modern

Algorithmic Complexity Algorithmic Complexity &quot;Algorithmic Complexity&quot;, also called

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Treewidth reduction and algorithmic applications Treewidth reduction and algorithmic applications

Provider Profiling Prepared by Melissa Reagan, MSW, LSW, Quality Performance Specialist Agenda

Continuous Profiling in Production: What, Why and How Richard Warburton (@richardwarburto) Sadiq

MALT : MALloc Tracker A memory profiling tool 3/02/2019 MALT, Sbastien Valat 1 Questions

Integrating mol Integrating mol ecular Profiling ecular Profiling Into Patient Se election for

The UNC Law Librarys Redaction of its Digitized NC Supreme Court Briefs: A Case Study Nicole

Regulation And Solutions by Darmain Segaran CHILD DIGITAL PRIVACY &amp; RISKS PRIVACY RISKS

Welcome to Fifth Grade! Language Arts ELAR instruction consists of reading, language,

Threats to Data: Legal Compliance Challenges at the Intersection of Privacy and Security William

Why SmartMusic? SmartMusic is a web-based music education platform that connects teachers and

Get to Know J.B. Van Hollen, Wisconsin Attorney General Teleseminar Noon 1 p.m. (U.S. EST)

Todays Investigation, Tomorrows Risk: How to avoid pitfalls in doing investigations

Copper, Nickel &amp; Precious Metals in the U.S. October 2014 Cautionary Statement This

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

Regulation And Solutions by Darmain Segaran CHILD DIGITAL PRIVACY & RISKS PRIVACY RISKS

Copper, Nickel & Precious Metals in the U.S. October 2014 Cautionary Statement This