Ask what, not how Kostas Tzoumas Data is an important asset video - PowerPoint PPT Presentation

Ask “what,” not “how” Kostas Tzoumas

Data is an important asset video & audio streams, sensor data, RFID, GPS, user online behavior, scientific simulations, web archives, ... Volume Handle petabytes of data Velocity Handle high data arrival rates Variety Handle many heterogeneous data sources Veracity Handle inherent uncertainty of data 2

Data Analysis 3

Four “I”s for Big Analysis text mining, interactive and ad hoc analysis, machine learning, graph analysis, statistical algorithms Iterative Model the data, do not just describe it Incremental Maintain the model under high arrival rates Interactive Step-by-step data exploration on very large data Integrative Fluent unified interfaces for different data models 4

MapReduce and Hadoop (Romeo, 1) Reduce (Romeo, 1) (Romeo, (1,1,1)) (Romeo, 3) “Romeo, Romeo, Map (wherefore, 1) (art, (1,1)) (art, 2) wherefore art thou (art, 1) (thou, (1,1)) (thou, 2) Romeo?” (thou, 1) (Romeo, 1) (What, 1) Reduce (wherefore, 1) (wherefore, 1) Map “What, art thou (art, 1) (What, 1) (What, 1) hurt?” (thou, 1) (hurt, 1) (hurt, 1) (hurt, 1) Data written Data shuffled to disk over network 5

SQL analytics with Hadoop Pitfalls: � Lacking in declarativity � HDFS-based data exchange Reduce Map � Sort the only Reduce Map grouping operator � Hadoop engine Reduce Map tailored to simple aggregations 6

SQL MapReduce BigAnalytics BigSQL NoMapReduce

Advanced Analytics Analytics that model the data to reveal hidden relationships , not just describe the data. E.g., machine learning, statistics, graph analysis Increasingly important from a market perspective. Very different than SQL analytics: different languages and access patterns (iterative vs. one-pass programs). Hadoop toolchain poor; R, Matlab, etc not parallel. 8

Use case in Media and Communications Example: Risk management, analytics on phone call logs, risk management, all verticals sentiment analysis, clickstream and call analysis Manufacturing Travel and tourism Example: Data-driven quality Example: Improve personalized customer control and assurance, demand experience in hotels, estimate no-show in forecasting, sales and operation flights, route planning planning, process optimization Retail Social and e-commerce Example: Improve campaign ROI Example: Targeted customer experience, by optimizing advertising channels, explore new business models, real-time market basket analysis, fraud recommendations, social graph analysis, detection, social trend analysis, game analytics product recommendation 9

Big data lives in Hadoop. Hadoop clusters offer very low effective storage cost , and are becoming a data vortex , attracting cross- departmental data . Companies want to perform advanced and predictive analytics to maximize ROI of their data assets by modeling the data, not just describing it. How do we bring advanced analytics to the world of big data? 10

What, Big data consumers in the future not how Recipe for success: declarativity people with data analysis skills User specifies what information to extract out of the data, not how the system extracts the information. systems This is what relational databases programming experts pioneered in the 70s resulting in a Big data vibrant research community and a consumers now billion dollar industry. 11

Desiderata for next-gen big data platforms: Usability 3 million “the market faces certain challenges R users such as unavailability 10 million of qualified and Excel users experienced work professionals , who can effectively handle the 70,000 Hadoop architecture.” Hadoop users 12

Desiderata for next-gen big data platforms: Performance Stratosphere ! Hadoop ! 0 ! 100 ! 200 ! 300 ! 400 ! 500 ! 600 ! 700 ! Performance difference from days to minutes enables real time decision making and widespread use of data within the organization. 13

How to lift declarativity from the closed world of relational algebra to the open world of advanced analytics. 14

Step 1: Specify //"get"the"customers"with"their"debit" Unify data and val" debits:((String,(Double)(=( sql ( ((((" SELECT&customerId,&debit&FROM&customer_accounts; ") programming models in //"get"the"number"of"warned"invoices"in"the"last" a declarative abstraction. //"12"and"6"months val" warnings:((String,(Int,(Int)(=( sql """" " SELECT&R12.customerId,&R12.cnt,&R6.cnt &&&&&&&&&&&&FROM&(…)&R12&LEFT&OUTER&JOIN&(…)&R6 SQL for extracting &&&&&&&&&&&&&&ON&(R6.customerId&=&R12.customerId); ") //"number"of"contracts"a"customer"has enterprise data from val" numContracts(:((String,(Int)(=( sql ( ((((" SELECT&customerId,&numContracts&FROM&customers; ") databases. //"join"the"data"into"one"data"point General-purpose case"class" DataPoint(x:(Vector,(y:(Double) programming for feature val (dataPoints(=(numContracts( (( join (warnings extraction and (( where ({_._1}( isEqualTo ({_._1} (( join (debits normalization. (( where ({_._1}( isEqualTo ({_._1} ""map ({((x,y,z)(=>(DataPoint(Vector(x._2,(y._2,(y._3), ((((((((((((((((((((((((((((( if ((z._2(>(X)(1( else (0)(} Statistical libraries for //"run"regression"with"dimensionality"3"for"40"iterations val (weights:(Vector(=( logRegression (3,(dataPoints,(40) advanced analysis. 15

First step for declarative analytics Scala: functional and object-oriented JVM language, excellent basis for domain-specific language development. Coolest kid in the block ☺ Feels like a scripting language, but is not restricted to a fixed data model like Pig, Hive, etc. Scala’s extensible compiler architecture is a good match for implementing optimizers. 16

Step 2: Optimize Each color is a differently written Query optimizers: the program that produces the same result but has very different performance depending on small changes enabling technology for SQL in the data set and the analysis requirements data warehousing and BI Data characteristics change Successful industrial application of artificial intelligence Currently, no other system can optimize non-relational data analysis programs. Data characteristics change Complex Plan Diagram 17

Ask what, not how Kostas Tzoumas Data is an important asset video - PowerPoint PPT Presentation

Ask what, not how Kostas Tzoumas Data is an important asset video & audio streams, sensor data, RFID, GPS, user online behavior, scientific simulations, web archives, ... Volume Handle petabytes of data Velocity Handle high

ASK C o r p o r a t i o n ASK Corporation American ADM, Inc. ASK 1 C o r p o r a t i o n Ask

Ask Arthur Ask Arthur Arthurs Story Ask ArthurThe First Year Resources

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

NOT FOR REPRODUCTION NOT FOR REPRODUCTION NOT FOR REPRODUCTION NOT FOR REPRODUCTION NOT FOR

ONLY James 4:2, You have not because you ask not. SOME BLESSINGS ONLY COME WHEN WE ASK IN

NJ ASK Information Evening The 2013 NJ ASK* will measure the Common Core State Standards(CCSS)

T ask Analysis Ov erview What is task analysis? T ask Analysis Metho ds task

Choosing Objects and Relations How to represent: Pen #7 is red. 2 Its easy to ask

Human-Computer Interaction 6. Mental Model (1) Recap: Interview: Ask More! Ask Why? and

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

1 A SMALL STILL VOICE 2 James 4:2-3 Ye lust, and have not: ye kill, and desire to have, and

Ask For What You Want If you dont ask, the answer is always No! Lydia Kennedy, M.Ed Director,

Don't ask what the world needs. Ask what makes you come alive, and go do it. Because what the

Second Phase of Art. Ask for More. Second Phase of Art. Ask for More. PSA Campaign -- New Ads PSA

1 2 When people ask me about the business case for diversity and inclusion, I ask them to tell

DOES ANYONE TRUST THE GOVERNMENT ANYMORE? Or does it depend on who you ask and when you ask

Chris Snijders - Irrelevant private stuff 2 Chris Snijders @Dagstuhl The models themselves

TOWARDS TRANSPARENT ZERO- KNOWLEDGE COMPUTATION - BASED ON 10 YEARS OF COMMERCIAL USE Kurt

Delivering Real-Time Data with Azure DevUp October 16, 2019 Chad Green @ChadGreen Delivering

Getting Started with Azure IoT Edge Machine Intelligence Modern Infrastructure http://mi2.live

Building dev tools at the right level of abstraction Ben Davis CTO @BenCDavis

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 2: From MapReduce to

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir