Learning Data Systems Components Tim Kraska <kraska@mit.edu> - PowerPoint PPT Presentation

Does It Work? 200M records of map data (e.g., restaurant locations). index on longitude Intel-E5 CPU with 32GB RAM without GPU/TPUs No Special SIMD optimization (there is a lot of potential) Type Config Lookup Speedup Size (MB) Size vs. time vs. BTree Btree 260 ns 1.0X 12.98 MB 1.0X BTree page size: 128 Learned 2nd stage size: 10000 222 ns 1.17X 0.15 MB 0.01X index Learned 2nd stage size: 162 ns 1.60X 0.76 MB 0.05X index 50000 Learned 2nd stage size: 144 ns 1.67X 1.53 MB 0.12X index 100000 60% faster at 1/20th the space, or 17% faster at 1/100th the space 126 ns 2.06X 3.05 MB 0.23X Learned 2nd stage size: index 200000

You Might Have Seen Certain Blog Posts

Worse FAST 256 Size (MB) 32 Lookup Table 4 Learned Fixed-Size Read-Optimized Better Index B-Tree w/ interpolation search 0.5 Better Worse 0 50 100 150 200 250 300 350 Lookup-Time (ns)

My Own Comparison

A Comparison To ARTful Indexes (Radix-Tree) Viktor Leis, Alfons Kemper, Thomas Neumann: The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases. ICDE 2013 Experimental setup: • Dense: continuous keys from 0 to 256M • Sparse: 256M keys where each bit is equally likely 0 or 1.

A Comparison To ARTful Indexes (Radix-Tree) Viktor Leis, Alfons Kemper, Thomas Neumann: The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases. ICDE 2013 Experimental setup: continuous keys from 0 to 256M Reported lookup throughput: 10M/s ≈ 100ns (1) Size: not measured, but paper says overhead of ≈ 8 Bytes per key (dense, best case): 256M * 8 Byte ≈ 1953MB (1) Numbers from the paper

Learned Index Generate Code: Record lookup(key) { return data[0 + 1 * key]; }

Learned Index Generate Code: Record lookup(key) { return data[key]; }

Learned Index Generate Code: Record lookup(key) { return data[key]; } Lookup Latency: 10ns (learned index) vs 100ns* (ARTfull)   or one-order-of-magnitude better Space: 0MB vs 1953MB   Infinitely better :)

What about Updates and Inserts?

What about Updates and Inserts? Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, Tim Kraska:   A-Tree: A Bounded Approximate Index Structure https://arxiv.org/abs/1801.10207

The Simple Approach: Delta Indexing updates Training a simple Multi-Variate Regression Model Can be done in one pass over the data

Leverage the Distribution

Leverage the Distribution for Appends New Inserts (e.g., Timestamps) Inserts Time If the Learned Model Can Generalize to Inserts Insert complexity is O(1) not O(Log N)

Updates/Inserts • Less beneficial as the data still has to be stored sorted • Idea: Leave space in the array where more updates/ inserts are expected • Can also be done with traditional trees. • But, the error of learned indexes should increase with     𝑂 per node in RMI whereas traditional indexes with 𝑂

Still at the Beginning! • Can we provide bounds for inserts? • When to retrain? • How to retrain models on the fly? • …

Fundamental Algorithms & Data Structures Join Tree Sorting Hash-Map Bloom-Filter Cache Policy Scheduling Range-Filter Priority Queue …..

Hash Map Hash Key Key Model Function Goal: Reduce Conflicts

Hash Map - Results 25% - 70% Reduction in Hash-Map Conflicts Skip

You Might Have Seen Certain Blog Posts

Independent of Hash-Map Architecture

Hash Map – Example Results Type Time (ns) Utilization 31ns 99% Stanford AVX Cuckoo, 4 Byte value Stanford AVX Cuckoo, 20 Byte record - Standard Hash 43ns 99% Commercial Cuckoo, 20Byte record - Standard Hash 90ns 95% In-place chained Hash-map, 20Byte record,   35ns 100% learned hash functions

Bloom Filter- Approach 1 Is This Key In My Set? Is This Key In My Set? Maybe Model No Maybe Yes No Maybe Yes No 36% Space Improvement over Bloom Filter   at Same False Positive Rate

Bloom Filter- Approach 2 (Future Work) Hash Function 1 Key Model Hash Key Function 2 Hash Function 3

Future Work CDF How Would You Design Your Algorithms/Data Structure If You Have a Model for the Empirical Data Distribution?

Future Work Join Tree Sorting Hash-Map Bloom-Filter Cache Policy Scheduling Range-Filter Priority Queue …..

Future work: Multi-Dim Indexes

Future work: Data Cubes

If You Have a Model for the Empirical Data Distribution? How Would You Design Your Algorithms/Data Structure Other Database Components • Cardinality Estimation • Cost Model • Query Scheduling • Storage Layout • Query Optimizer • …

Related Work • Succinct Data Structures � Most related, but succinct data structures usually are carefully, manually tuned for each use case • B-Trees with Interpolation search � Arbitrary worst-case performance • Perfect Hashing � Connection to our Hash-Map approach, but they usually increase in size with N • Mixture of Expert Models � Used as part of our solution • Adaptive Data Structures / Cracking � orthogonal problem • Local Sensitive Hashing (LSH) (e.g., learened by NN)   � Has nothing to do with Learned Structures

Local Sensitive Hashing (LSH) Thanks Alkis for the analogy

Summarize CDF How Would You Design Your Algorithms/Data Structure If You Have a Model for the Empirical Data Distribution?

Adapts To Your Data

Big Potential For TPUs/GPUs

Can Lower the Complexity Class Time or Space O(N 2 ) O(N) O(Log N) O(1) N data_array[(lookup_key – 900)]

Warning Not An Almighty Solution

Learning Data Systems Components Tim Kraska <kraska@mit.edu> - PowerPoint PPT Presentation

Work partially done at Learning Data Systems Components Tim Kraska <kraska@mit.edu> [Disclaimer: I am NOT talking on behalf of Google] Comments on Social Media Sorting Joins Tree Bloom Filter HashMaps Machine Learning Just Ate

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Company introduction Soyter Components Our company Soyter Components located in Klaudyn near

Digital System-On-Chip components at ESA components at ESA ASIC technology platforms and

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Introduction to Machine Learning ML-Basics: Components of Supervised Learning Learning goals

Strongly Connected Components Detection Strongly Connected Components A directed graph is

PV Power project economics (optimization) Speaker: Emmanuel Guyot April 2012 Key Components of a

Layout Components www.icefaces.org ICESOFT TECHNOLOGIES INC. Component Naming Schemes Here

Morse Code Mayhem Eager Beavers Storyboard Storyboard The Alarm Components The Picture

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Introduction to Machine Learning Session 3b: Principal Components Analysis Reto West

Learning Systems Research at the Intersection of Machine Learning & Data Systems Joseph E.

G Corner Electrical Systems Limited SYSTEMS DC Busbar Systems G Corner Electrical CORNER Systems

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Select Index Components & Import Data Manipulating Time Series Data in Python Market

Problem: Scaling Content Delivery Millions of clients server and network meltdown

Rethinking transportation in cities: Making smarter traffic through Optimization and Location

Last mile logistics innovations: modelling their traffic, energy and environmental impacts

Life and Career Development: a Perspective from a Chinese Scholar Haibo Chen Shanghai Jiao Tong

Externalizing Faryaneh Poursardar Virginia Tech Slides based on the Effective Problem

Autonomous Shuttle Pilot April 2, 2019 Experiences from Calgary and Edmonton Background

TCP CONGESTION SIGNATURES Srikanth Sundaresan (Princeton Univ.) Amogh Dhamdhere (CAIDA/UCSD) kc

strt tst Prs trs

Learning Data Systems Components Tim Kraska <kraska@mit.edu> - PowerPoint PPT Presentation

Work partially done at Learning Data Systems Components Tim Kraska <kraska@mit.edu> [Disclaimer: I am NOT talking on behalf of Google] Comments on Social Media Sorting Joins Tree Bloom Filter HashMaps Machine Learning Just Ate

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Company introduction Soyter Components Our company Soyter Components located in Klaudyn near

Digital System-On-Chip components at ESA components at ESA ASIC technology platforms and

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Introduction to Machine Learning ML-Basics: Components of Supervised Learning Learning goals

Strongly Connected Components Detection Strongly Connected Components A directed graph is

PV Power project economics (optimization) Speaker: Emmanuel Guyot April 2012 Key Components of a

Layout Components www.icefaces.org ICESOFT TECHNOLOGIES INC. Component Naming Schemes Here

Morse Code Mayhem Eager Beavers Storyboard Storyboard The Alarm Components The Picture

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Introduction to Machine Learning Session 3b: Principal Components Analysis Reto West

Learning Systems Research at the Intersection of Machine Learning &amp; Data Systems Joseph E.

G Corner Electrical Systems Limited SYSTEMS DC Busbar Systems G Corner Electrical CORNER Systems

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Select Index Components &amp; Import Data Manipulating Time Series Data in Python Market

Problem: Scaling Content Delivery Millions of clients server and network meltdown

Rethinking transportation in cities: Making smarter traffic through Optimization and Location

Last mile logistics innovations: modelling their traffic, energy and environmental impacts

Life and Career Development: a Perspective from a Chinese Scholar Haibo Chen Shanghai Jiao Tong

Externalizing Faryaneh Poursardar Virginia Tech Slides based on the Effective Problem

Autonomous Shuttle Pilot April 2, 2019 Experiences from Calgary and Edmonton Background

TCP CONGESTION SIGNATURES Srikanth Sundaresan (Princeton Univ.) Amogh Dhamdhere (CAIDA/UCSD) kc

strt tst Prs trs

Learning Systems Research at the Intersection of Machine Learning & Data Systems Joseph E.

Select Index Components & Import Data Manipulating Time Series Data in Python Market