Speeding Up Data Science: From a Data Management Perspective - PowerPoint PPT Presentation

Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1

Simon Fraser University 2

SFU DB/DM Group Ke Wang Martin Ester (Joined SFU in 2000) (Joined SFU in 2001) Privacy-Preserving Data Publishing Recommendation in Social Media • • Secure Query Answering for Outsourced Databases Biological Data Mining • • Jiannan Wang Jian Pei (Joined SFU in 2016) (Joined SFU in 2004) Interpretable Machine Learning and Deep Learning Data Cleaning for Machine Learning • • Computational Fraud Investigation Data Enrichment with Deep Web • • Robust AI models Against Adversarial Attacks Interactive Analytics Over Big Data • • 3

My Lab’s Mission Speeding Up Data Science 4

Computer Science vs. Data Science What When Who Goal Computer 1950- Software Engineer Write software to make computers work Science Plan à Design à Develop à Test à Deploy à Maintain What When Who Goal Data 2010- Data Scientist Extract insights from data to answer questions Science Collect à Clean à Integrate à Analyze à Visualize à Communicate 5

Lab Members Collect à Clean à Integrate à Analyze à Visualize à Communicate 6

Today’s Talk Deeper Collect à Clean à Integrate à Analyze à Visualize à Communicate AQP++ 7

Deeper (2016 - ) Leverage Deep Web To Speed Up Data Enrichment & Cleaning Pei Wang, Yongjun He, Ryan Shea, Jiannan Wang, Eugene Wu. Deeper: A Data Enrichment System Powered by Deep Web. SIGMOD 2018 Demo (in submission) 8

Deep Web Hidden Database Invaluable External Resource ◦ Big: Consisting of a substantial number of entities ◦ Rich: Having rich Information about each entity ◦ High-quality. Being trustful and up-to-date 9

Data Enrichment & Cleaning Leverage Deep Web Name City Zip Code Tel Fable Burnaby V6J 1MS (604)732-1322 How ? Name City Zip Code Tel Category Rating Fable Vancouver V6J 1MS (604)732-1322 Canadian (New) 4.5 10

NaïveCrawl Match one record at a time OpenRefine is doing this! 11

Limitations Limited Query Budget ◦ Goolge Maps API allows 2,500 free requests per day Dirty Data ◦ User’s data is usually messy. Naïve queries will miss results 12

SmartCrawl 1. Generate a query pool 𝑅 2. Select at most 𝑐 queries from 𝑅 such that 𝐼 $%&'()* ∩ 𝐸 is maximized 3. Perform entity resolution between 𝐼 $%&'()* and 𝐸 13

Challenges 1. Query Benefit Estimation 2. Efficient Implementations 3. Inadequate Sample Size 4. Fuzzy Matching 14

Demo: https://deeper.sfucloud.ca Video: https://youtu.be/QHYgLIqqjWY 15

Today’s Talk Deeper Collect à Clean à Integrate à Analyze à Visualize à Communicate AQP++ 16

Interactive Analytics How to enable interactive analytics over Big Data? 17

Two Separate Ideas Idea 1. Approximate Query Processing (AQP) SELECT SUM(salary) WHERE id in [6, 10000] 1GB sample 1TB data 18

Two Separate Ideas Idea 2. Aggregation Precomputation (AggPre) SELECT SUM(salary) WHERE id in [6, 10000] Base Table Prefix-Sum Cube[1] ID Salary ID Salary 1 50,000 ≤ 1 50,000 2 62,492 ≤ 2 112,492 3 78,212 ≤ 3 190,704 4 120,242 ≤ 4 310,946 5 98,341 ≤ 5 409,287 6 75,453 ≤ 6 484,740 7 60,000 ≤ 7 544,740 8 72,492 ≤ 8 617,232 9 88,212 ≤ 9 705,444 … … 10000 86,798 ≤ 10000 9.3*10^8 [1] Ho, Ching-Tien, et al. Range queries in OLAP data cubes. (1997) 19

Trade-Off Response Time AggPre AQP++ AQP Preprocessing Cost Query Error 20

AQP++ (2016 - ) Connecting Approximate Query Processing With Aggregate Precomputation Jinglin Peng, Dongxiang Zhang, Jiannan Wang, Jian Pei. AQP++: Connecting Approximate Query Processing with Aggregate Precomputation for Interactive Analytics. SIGMOD 2018 (to appear) 21

How AQP++ works? SELECT SUM(salary) WHERE id in [6, 10000] SELECT SUM(salary) SELECT SUM(salary) WHERE id in [0, 10000] WHERE id in [0, 5] ID Salary ≤ 1000 1.2 * 10^8 ≤ 2000 1.8 * 10^8 1GB sample ≤ 3000 2.9 * 10^8 ≤ 4000 3.1 * 10^8 Blocked ≤ 5000 4.0 * 10^8 Prefix-Sum ≤ 6000 4.8 * 10^8 Cube ≤ 7000 5.4 * 10^8 ≤ 8000 6.1 * 10^8 ≤ 9000 8.1 * 10^8 ≤ 10000 9.3 * 10^8 22

Experimental Result TPCD (Laptop,100GB) ◦ 0.05% sample, skew = 2 Preprocessing Cost Response Answer Quality Time (Avg Err.) Space Time AQP 51.2 MB 4.3 min 0.6 sec 2.67% AggPre > 10 TB > 1 day < 0.01 sec 0.00% AQP++ 51.9 MB 9.8 min 0.64 sec 0.28% 23

3 Posters From SFU 1. Deeper ( Pei Wang ) 2. AQP++ ( Jinglin Peng ) 3. DTLR: An Interpretation of Deep Neural Network ( Xia Hu ) + + Approximate local + Decision decision boundary boundary - ≈ + of a deep model of a deep + using a linear model. model - - + + - - - + - - - Local decision boundary of a deep model 24

Take-away Messages Our Mission https://github.com/sfu-db ◦ Speeding Up Data Science Thanks! Deeper ◦ Leverage Deep Web to speed up data cleaning and enrichment AQP++ ◦ Connect AQP with AggPre to speed up data analysis 25

Speeding Up Data Science: From a Data Management Perspective - PowerPoint PPT Presentation

Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1 Simon Fraser University 2 SFU DB/DM Group Ke Wang Martin Ester (Joined SFU in 2000)

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Wheeler Road Virtual Community Meeting Summer 2020 Station 1 Speeding 2 Station 1

Speeding up query execution in PostgreSQL using LLVM JIT compiler Dmitry Melnik dm@ispras.ru

Speeding up by using ISM-like calls Junji NAKANO (The Institute of Statistical Mathematics, Japan)

Speeding up Permutation Testing Vamsi Ithapu http://pages.cs.wisc.edu/~vamsi/pt_fast November

Speeding up target-language driven part-of-speech tagger training for machine translation Felipe

Speeding up Asset Connectivity and Management by EdgeLink Scott Sun Wireless sensing &

1 Ways to improve data-flow analysis efficiency Example (liveness) 1st 2nd 3rd

Improving Implementation with the SPeeding Research INTerventions (SPRINT) Program Anna

Speeding R up on your computer by parallelized computations a geostatistical case study

Speeding up stack unwinding by compiling DWARF debug data Thophile Bastian Under supervision of

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao,

Speeding Genetic Discovery in Autism through the iHART Information Commons J. Jung, N. Stockham,

Speeding-up Large-Scale Storage with Non-Volatile Memory CERN openlab Open Day 10 June 2015 KL

Drongo Speeding Up CDNs with Subnet Assimilation from the Client CoNEXT 17 Authors: Incheon,

Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query

Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Our Goal: reuse

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda

Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas

Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations Supun

DATA ANALYTICS USING DEEP LEARNING GT 8803 // SIDDHARTH BISWAL L E C T U R E # 0 3 : B L A Z E

Anticoagulation Services at Sandwell and West Birmingham Hospitals NHS Trust Joanne Malpass and

Concentrated Dark Matter and PBHs Scott Watson ( Syracuse University ) Based on: Concentrated

Speeding Up Data Science: From a Data Management Perspective - PowerPoint PPT Presentation

Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1 Simon Fraser University 2 SFU DB/DM Group Ke Wang Martin Ester (Joined SFU in 2000)

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Speeding Up Your Mac A Joe ON Tech Guide Speeding Up Your Mac Basics Three factors affect

Wheeler Road Virtual Community Meeting Summer 2020 Station 1 Speeding 2 Station 1

Speeding up query execution in PostgreSQL using LLVM JIT compiler Dmitry Melnik dm@ispras.ru

Speeding up by using ISM-like calls Junji NAKANO (The Institute of Statistical Mathematics, Japan)

Speeding up Permutation Testing Vamsi Ithapu http://pages.cs.wisc.edu/~vamsi/pt_fast November

Speeding up target-language driven part-of-speech tagger training for machine translation Felipe

Speeding up Asset Connectivity and Management by EdgeLink Scott Sun Wireless sensing &amp;

1 Ways to improve data-flow analysis efficiency Example (liveness) 1st 2nd 3rd

Improving Implementation with the SPeeding Research INTerventions (SPRINT) Program Anna

Speeding R up on your computer by parallelized computations a geostatistical case study

Speeding up stack unwinding by compiling DWARF debug data Thophile Bastian Under supervision of

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao,

Speeding Genetic Discovery in Autism through the iHART Information Commons J. Jung, N. Stockham,

Speeding-up Large-Scale Storage with Non-Volatile Memory CERN openlab Open Day 10 June 2015 KL

Drongo Speeding Up CDNs with Subnet Assimilation from the Client CoNEXT 17 Authors: Incheon,

Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query

Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Our Goal: reuse

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda

Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas

Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations Supun

DATA ANALYTICS USING DEEP LEARNING GT 8803 // SIDDHARTH BISWAL L E C T U R E # 0 3 : B L A Z E

Anticoagulation Services at Sandwell and West Birmingham Hospitals NHS Trust Joanne Malpass and

Concentrated Dark Matter and PBHs Scott Watson ( Syracuse University ) Based on: Concentrated

Speeding up Asset Connectivity and Management by EdgeLink Scott Sun Wireless sensing &