SparkFuzz : Searching Correctness Regressions in Modern Query - PowerPoint PPT Presentation

SparkFuzz : Searching Correctness Regressions in Modern Query Engines Bogdan Ghit, Nicolas Poggi , Josh Rosen, Reynold Xin, and Peter Boncz* June 19 - DBTest 2020 *

UNIFIED DATA ANALYTICS PLATFORM DATA ENGINEERS DATA SCIENTISTS ML ENGINEERS DATA ANALYSTS DATA SCIENCE WORKSPACE UNIFIED DATA SERVICE ENTERPRISE CLOUD SERVICE

Introduction Apache Spark June 2002 v 3.0.0 released Fast and expressive data processing 3500+ resolved tickets engine distributed computing ▪ rich APIs ▪ including SQL ▪ large community ▪ Started at UC Berkeley in 2009 2010 - open sourced ▪ 2014 - top level project ▪ 2020 - v3 released (10 years!) ▪

SparkFuzz proposal SparkFuzz 1. Leverage fuzz testing techniques a. to complement SQL testing query b. automate bug discovery 2. Design of a toolkit for SQL engines a. model for randomized DDL, data, and queries i. b. A runner and evaluator 3. Applicability of coverage metrics a. as test stop gaps SUT (dev) Test oracle (stable) b. reducing time (and costs) c. enabling more testing dimensions

DDL and data generation Random number of columns Random partition columns Automated dataset generation Choose a data type ... ▪ by randomly sampling ... supported data types String ▪ ... parameter ranges ▪ Boolean BigInt Decimal SmallInt ▪ Producing valid schemas Integer Float ... Timestamp ▪ Populating datasets Random number of rows Random number of tables

Recursive query model w/ a probabilistic profile SQL Query Query Operators and features annotated with: Independent weights Clause 10% GROUP BY Optional clauses JOIN WITH SELECT ▪ ORDER BY Inter-dependent weights UNION WHERE FROM 10% 50% 10% Join types ▪ Select functions ▪ Functions Expression Table Constant Alias Column

Query and regression example Query produced in a small dataset with 2 tables of 5x5 size SELECT COALESCE (t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) AS int_col, IF(NULL, VARIANCE(COALESCE(t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)), COALESCE (t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_1, STDDEV(t2.double_col_2) AS float_col, COALESCE (MIN((t1.smallint_col_3) - ( COALESCE (t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3))), COALESCE (t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3), COALESCE (t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3)) AS int_col_2 FROM table_4 t1 INNER JOIN table_4 t2 ON (t2.timestamp_col_7) = (t1.timestamp_col_7) WHERE (t1.smallint_col_3) IN (CAST('0.04' AS DECIMAL(10,10)), t1.smallint_col_3) GROUP BY COALESCE (t2.smallint_col_3, t1.smallint_col_3, t2.smallint_col_3) Within 10 queries, this query triggered an exception ▪ Related to COALESCE flattening ▪

Correctness regression example [SPARK-16633] Using constant input values breaks the the LEAD function SELECT (t1.decimal0803_col_3) / (t1.decimal0803_col_3) AS decimal_col, CAST(696 AS STRING) AS char_col, t1.decimal0803_col_3, (COALESCE(CAST('0.02' AS DECIMAL(10,10)), CAST('0.47' AS DECIMAL(10,10)), CAST('-0.53' AS DECIMAL(10,10)))) + ( LEAD (-65, 4) OVER (ORDER BY (t1.decimal0803_col_3) / (t1.decimal0803_col_3), CAST(696 AS STRING))) AS decimal_col_1, CAST(-349 AS STRING) AS char_col_1 FROM table_16 t1 WHERE (943) > (889) Spark [1.0, 696, -871.81, -64.98 , -349] ▪ PostgreSQL [1.0, 696, -871.81, NULL , -349] ▪

Query operator coverage analysis In 15m (500 queries), reaches near max coverage

Continuous Integration pipeline SparkFuzz Correctness Classify Root-cause Failure Events Alert Re-test Regression - Minimize - Impact Performance - Drill-down - Scope - Profile - Correlation - Compare - Confirm? - Validate 10

Conclusion and future work ▪ Prevented SQL correctness errors reaching production ▪ complementing the testing practices ▪ Runtime operator coverage metrics found applicable ▪ For testing code changes rapidly ▪ With a degree of coverage ▪ Future work ▪ Improve the metric coverage to include operator chaining ▪ Update the model generation to use Spark AST grammar directly

SparkFuzz : Searching Correctness Regressions Thanks, questions? Bogdan Ghit, Nicolas Poggi, Josh Rosen, Reynold Xin, and Peter Boncz Feedback: Nicolas.Poggi@databricks.com

SparkFuzz : Searching Correctness Regressions in Modern Query - PowerPoint PPT Presentation

SparkFuzz : Searching Correctness Regressions in Modern Query Engines Bogdan Ghit, Nicolas Poggi , Josh Rosen, Reynold Xin, and Peter Boncz* June 19 - DBTest 2020 * UNIFIED DATA ANALYTICS PLATFORM DATA ENGINEERS DATA SCIENTISTS ML ENGINEERS

HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation

COMP 110-003 Introduction to Programming Multidimensional Arrays April 04, 2013 Haohan Li TR

CPSC 531: Random Numbers Jonathan Hudson Department of Computer Science University of Calgary

Generating Massive Amount of Generating Massive Amount of High- -Quality Random Numbers using

Large-Scale Invisible Attack on AFC Systems with NFC-Equipped Smartphones Fan Dang 1 , Pengfei

Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees

Fast Object Distribution Andrew Willmott Maxis, Electronic Arts Distributing Objects Goal:

Modified Noise for Evaluation on Graphics Hardware Marc Olano Computer Science and Electrical

Abusing Performance Optimization Weaknesses to Bypass ASLR Byoungyoung Lee Yeongjin Jang Tielei

Security II: Cryptography Markus Kuhn Computer Laboratory, University of Cambridge

CSCI 104 Hash Tables & Functions Mark Redekopp David Kempe Sandra Batista 2

Light-Weight, Delay-Aware and Scalable Authentication for Smart-Grid System Dr. Attila A. Yavuz,

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

Statistics, Measures of Central Tendency I We are considering a random variable X with a

Nonce Generators and the Nonce Reset Problem Erik Zenner Technical University Denmark (DTU)

Stochastic Simulation Generation of random variables Discrete sample space Bo Friis Nielsen

Random Numbers Computational Randomness is Hard The best known academic computer scientist at

How not to generate random numbers Nadia Heninger University of Pennsylvania May 13, 2015

Generation Stephen Booth David Henty Introduction Random numbers are frequently used in many

s r rs rrs

How not to generate random numbers Nadia Heninger University of Pennsylvania June 15, 2018 A

Random Number Generators: Design Principles and Statistical Testing Pierre LEcuyer Mixmax

Pseudo-Random Generators Computer programming (e.g. randomized algorithm) Elementary and

SparkFuzz : Searching Correctness Regressions in Modern Query - PowerPoint PPT Presentation

SparkFuzz : Searching Correctness Regressions in Modern Query Engines Bogdan Ghit, Nicolas Poggi , Josh Rosen, Reynold Xin, and Peter Boncz* June 19 - DBTest 2020 * UNIFIED DATA ANALYTICS PLATFORM DATA ENGINEERS DATA SCIENTISTS ML ENGINEERS

HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation

COMP 110-003 Introduction to Programming Multidimensional Arrays April 04, 2013 Haohan Li TR

CPSC 531: Random Numbers Jonathan Hudson Department of Computer Science University of Calgary

Generating Massive Amount of Generating Massive Amount of High- -Quality Random Numbers using

Large-Scale Invisible Attack on AFC Systems with NFC-Equipped Smartphones Fan Dang 1 , Pengfei

Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees

Fast Object Distribution Andrew Willmott Maxis, Electronic Arts Distributing Objects Goal:

Modified Noise for Evaluation on Graphics Hardware Marc Olano Computer Science and Electrical

Abusing Performance Optimization Weaknesses to Bypass ASLR Byoungyoung Lee Yeongjin Jang Tielei

Security II: Cryptography Markus Kuhn Computer Laboratory, University of Cambridge

CSCI 104 Hash Tables &amp; Functions Mark Redekopp David Kempe Sandra Batista 2

Light-Weight, Delay-Aware and Scalable Authentication for Smart-Grid System Dr. Attila A. Yavuz,

OLAP and Data Mining Chapter 17 OLTP Compared With OLAP On Line Transaction Processing

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

Statistics, Measures of Central Tendency I We are considering a random variable X with a

Nonce Generators and the Nonce Reset Problem Erik Zenner Technical University Denmark (DTU)

Stochastic Simulation Generation of random variables Discrete sample space Bo Friis Nielsen

Random Numbers Computational Randomness is Hard The best known academic computer scientist at

How not to generate random numbers Nadia Heninger University of Pennsylvania May 13, 2015

Generation Stephen Booth David Henty Introduction Random numbers are frequently used in many

s r rs rrs

How not to generate random numbers Nadia Heninger University of Pennsylvania June 15, 2018 A

Random Number Generators: Design Principles and Statistical Testing Pierre LEcuyer Mixmax

Pseudo-Random Generators Computer programming (e.g. randomized algorithm) Elementary and

CSCI 104 Hash Tables & Functions Mark Redekopp David Kempe Sandra Batista 2