mad skills: new analysis practices for big data 2. dude, you got - PowerPoint PPT Presentation

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton November 10, 2015 presented by Ritwika Ghosh mad skills: new analysis practices for big data

2. dude, you got mad skills. – UrbanDictionary.com 1 mad (adj.): an adjective used to enhance a noun. 1. dude, you got skills.

If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. 2 So what’s getting ubiquitous and cheap? Data . And what is complementary to data? Analysis . -Prof. Hal Varian, UC Berkeley, Chief Economist at Google

∙ Enterprise Data Warehouse(EDW) is queried by Business Intelligence(BI) software. ∙ A carefully constructed EDW was key. ∙ ”Mission Critical, expensive resource, used for serving data intensive reports targeted at executive decision makers”. 3 A bit of History

∙ Super cheap storage. ∙ Massive-scale data sources in an enterprise has grown remarkably : everything is data ∙ Grassroots move to collect and leverage data in multiple organizational units : Rise of data driven culture espoused by Google, Wired etc. ∙ Sophisticated data analysis leads to cost savings and even direct revenue 4 What has changed

∙ New requirements : MAD Skills. ∙ M : Magnetic (attract data and analysts) ∙ A : Agile (rapid iteration) ∙ D : Deep (sophisticated analytics in Big Data) ∙ Analysts with MAD skills need to be complemented by MAD approaches to design and infrastructure. ∙ 5 MAD skills

∙ MAD analytics for Fox Interactive Media, using Greenplum . ∙ Data parallel statistical algorithms for modeling and comparing the densities of distribution. ∙ Critical database system features that enable agile design and flexible algorithm development. ∙ Challenging data warehousing orthodoxy :”Model Less, Iterate More”. 6 This paper

∙ Serves ads across several Fox online publishers. (huge ad network). ∙ Greenplum Database system on 42 nodes: ∙ 40 Sun X4500s for query processing, ∙ 2 dual-core Opteron master nodes (one for failover). ∙ Big and Growing : ∙ 200 TB of mirrored data. Fact table of 1.5T rows. (2009) ∙ 5TB growth per day. ∙ Variety of data : Ad logs, CRM, User data. ∙ Diverse user set. ∙ Extensive use of R and Hadoop. 7 Fox Audience Network

Different needs, variety of reporting and statistical tools, command line access : Dynamic query ecosystem. Question: : How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? Problem : No set of pre-defined aggregates can possibly cover every question combining various variables. 8 Fox Audience Network: Contd. Diverse user base Dealing with ad-hoc questions

they tolerate dirty data, they attract data, they produce data. ∙ Sandboxing allows analysts to feed datasets directly from main warehouse. ∙ Encourage novel data sources. ∙ Business > application. 9 M agnetic : Attracting users and Methods Central Design Principle : Get data into the warehouse ASAP ∙ Analysts > DBAs : they like all data,

A gile: Analytics to adjust, react and learn from busi- 3 million users login to IMDb. 2 million shared enough personal information to be able to attach 1 out of 2k attributes of behavior. 3 billion ads serving as tracking devices. Acquiring this data, strategically sub-sampling, determine scaling, change practices to suit : rinse and repeat. 10 ness Case Study: Audience Forecasting Number of decisions : 1 . 2 × 10 16 Business cycle

∙ Infinite cycles of drill down and roll up : No single number is the answer. ∙ Anomaly detection, longitudinal variance, distribution functions. ∙ Statistical modeling : curves and models, as opposed to points ! 11 D eep : learning from data

tables/ logs ∙ Production Data Warehouse schema : aggregates for reporting tools and casual users. 12 MAD Modeling Intelligently staging cleaning and integration of data ∙ Staging schema : raw fact

∙ A hierarchy of mathematical concepts in SQL (MapReduce as well). Functional. ∙ Encapsulated as stored procedures and UDFs. ∙ Need to be able to use statistical vocabulary. 13 Data Parallel statistics ∙ Abstraction levels : Scalar → Vector → Function →

SELECT A.row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_number; SELECT 1, array_accum(row_number,vector*v) FROM A; 14 Vectors and Matrices Let A and B be two matrices of identical dimensions. Matrix Addition: Multiplication of matrix and a vector Av :

SELECT S.col_number, array_accum(A.row_number, A.vector[S.col_number]) FROM A, generate_series(1,3) AS S(col_number) Group by S.col_number; SELECT A.row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_number 15 Vectors and Matrices : Contd. Matrix transpost of an m × n : Matrix Multiplication

∙ Create marginals along document and term using group by queries. ∙ Expand each triple with a tf-idf score. Let A have one row per document vector. SELECT a1.row_id AS document_i, a2.row_id AS document_j, (a1.row_v * a2.row_v) / ((a1.row_v * a1.row_v) * (a2.row_v * a2.row_v)) AS theta FROM a AS a1, a AS a2 WHERE a1.row_id > a2.row_id 16 Example: tf-idf and Cosine similarity Document similarity : Fraud detection ∙ Create triples of ( document , term , count ) . ∙ Obtain cosine similarity of two document vectors x , y : θ = x . y || x || 2 || y || 2

Matrix based analytical methods : Ordinary Least ∙ coefficient of determination: TSS 17 Large dense matrices: distance matrix D, covariance matrices. ∙ OLS : modeling seasonal trends. Squares ∙ Statistical estimate of β ∗ best satisfying Y = X β . ∙ X = n × k , Y = { o 1 , . . . , o n } , β ∗ = ( X ′ X ) − 1 X ′ y . SSR = b ′ β − 1 ∑ n ( y i ) 2 y i ) 2 − 1 ∑ ∑ TSS = ( n ( y i ) 2 R 2 = SSR

CREATE VIEW ols AS SELECT pseudo_inverse(A) * b as beta_star, (transpose(b) * (pseudo_inverse(A) * b) - sum_y2/count) -- SSR / (sum_yy - sumy2/n) -- TSS as r_squared FROM ( SELECT sum(transpose(d.vector) * d.vector) as A, sum(d.vector * y) as b, sum(y)^2 as sum_y2, sum(y^2) as sum_yy, count(*) as n FROM design d ) ols_aggs; 18 Routine to compute OLS

∙ Agile : physical storage evolution easy and efficient. ∙ Magnetic : painless and efficient data insertion. ∙ Deep : powerful flexible programming environment. 19 MAD DBMS

∙ Database is not proprietary hardware : parallel computation engine. ∙ Storage is not expensive, math is not hard. ∙ SQL is flexible and highly extensible. ∙ How are queries parallelized? If we write in R, its not automatic. ∙ MapReduce here vs Hadoop? ∙ Ad for Greenplum :) 20 Conclusions Issues with Paper

mad skills: new analysis practices for big data 2. dude, you got - PowerPoint PPT Presentation

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton November 10, 2015 presented by Ritwika Ghosh mad skills: new analysis practices for big data 2. dude, you got mad skills. UrbanDictionary.com 1 mad (adj.): an

OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR

Dude, where's my spaceship? Dude, where's my spaceship? Albert Einstein Albert Einstein Science

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Keeping the Mad River Valley Vibrant (A Working Document) 11/15/18 2 The Mad

Public Art Donation for Guelph Park Dude Chilling Park Sign February 3, 2014

How How Thinking Thinking in in Python Made Me a Better Python Made Me a Better Soware

Going D/S/K Prod Like A Pro BRET FISHER Docker Captain, DevOps Dude, Creator of Docker Mastery

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Episode 8: Binary Trees (You have 1 chat request) (You have 1 chat request) Steve:

Algebra 1 SOL Review 1 - Finding MAD and Variance 1) Find the mean of the data 2) Create the

Best Practices Presentation Skills Best Practices in Presentation Skills Whether you are

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Killer Presentation Skills: How to Acquire the Skills and Killer Presentation Skills: How to

OECD SKILLS STRATEGY: LATVIA GOOD PRACTICES WORKSHOP Andrew Bell Head, National Skills Strategy

Parcel Tax Election In The Beginning 3 rd Bond Authorization MAD Maintenance

Practice in analysis of multistate R allows you to build powerful procedures from simple building

Status of the Health Insurance Project for Users Pierre Bonnal Doris Chromek-Burckhart Risk

Group Health Insurance Benefit Options for Nonprofits Welcome! Margie Siegel Association Health

CYBERSECURITY Situational awareness Franois Thill, Director Cybersecurity, Ministry of the

Collective issues and industrial action Betsan Criddle Old Square Chambers 1 24/04 /2019

The impact of COVID-19 on migration globally and in Canada Dan Hiebert OCASI November 20, 2020

Last Class Parameter passing Call-by-value vs call-by-reference Inheritance Public,

McBits: Objectives fast constant-time Set new speed records code-based cryptography for

Sambuz

Useful Links

Newsletter

Mail Us