 
              Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 Luke Lonergan CTO and Co-Founder Greenplum 7/26/2010 Confidential 1
Greenplum: Demonstrating Market Momentum and Leadership • 2009 was a breakout year for GP • Surpassed more than +100 global enterprise customers • 100% year over year growth • Projecting the same for 2010 • Open systems model is winning: significant leverage with Dell, EMC, Cisco and others • Acquiring net new customers at a pace faster than Netezza and Teradata • Growth is enabling us to innovate beyond our competitors, not only within just the core database but in new product initiatives 7/26/2010 Confidential 2
Announcing Greenplum Database 4.0: Critical Mass Innovation • 4.0 represents industry leading innovations in: – Workload Management – Fault-Tolerance – Advance Analytics • Culmination of more than +7 years of research and development • First emerging SW vendor to achieve critical mass and maturity across all necessary aspects of enterprise class DBMS platforms: – Complex query optimization – Data loading – Workload Management – Fault-Tolerance – Embedded languages/analytics 3 rd Party ISV certification – – Administration and Monitoring • Genuine floor-sweep replacement option for Teradata, Oracle, DB2, and SQL Server 7/26/2010 Confidential 3 3
The Need for Change: State of Play – Data in a Typical Enterprise • Data is everywhere – EDW ~10% of data corporate EDW, 100s of data marts, ‘shadow’ databases and spreadsheets • The goal of centralizing all data in a single EDW has Data Marts and proven untenable ‘Personal Databases’ ~90% of data 7/26/2010 Confidential 4
Introducing Greenplum Chorus: The World’s First Enterprise Data Cloud Platform • New software product • World’s first Enterprise Data Cloud Platform (EDC), enabling: – Self-service provisioning – Data services – Data collaboration • Customers deploy Chorus along with GP Database to create a net new and self-service analytic infrastructure • Chorus can significantly accelerate the time and ease with which companies extract value and insight from their data 7/26/2010 Confidential 5
Greenplum Chorus: Core Design Philosophies • Secure – Provide comprehensive and granular access control over whom is authorized to view and subscribe to data within Chorus • Collaborative – Facilitate the publishing, discovery, and sharing of data and insight using a social computing model that appears familiar and easy-to-use • Data-centric – Focus on the necessary tooling to manage the flow and provenance of data sets as they are created/shared within a company • MAD Skills in Action – Build a platform capable of supporting the magnetic, agile, and deep principles of MAD Skills 7/26/2010 Confidential 6
Greenplum Chorus: Core Technologies • Greenplum Chorus has unique technical requirements that demand a new kind of core infrastructure • Cloud platforms have focused on scalable processing – not scalable and dynamic flow of data • Chorus needs both: • DATA REQUIREMENTS: Coordinate complex dataflow and data lifecycle across 10s or 100s of distinct databases • CLOUD REQUIREMENTS: Low TCO provisioning and control of distributed processing and storage 7/26/2010 Confidential 7
Greenplum Data Hypervisor™ • Greenplum Data Hypervisor™ is the execution and operational framework for all ‘outside the database’ activities – Coordinate complex cross-database data movement – Manage all Chorus state and in-flight activities – Orchestrate database instance provisioning, expansion, and other operational activities Developer API – Respond to events and failures with compensating actions and escalation – Execute arbitrary programs and process flow in a Schedule + strongly fault-tolerance manner Dependency Management • Greenplum Data Hypervisor™ is built for the cloud App & Process – Underlying consensus/replication model is similar Flow Runtime to Google’s core – Handles and recovers from failures mid-operation Distributed State Management – even within complex multi-step flows – Scales to 10,000s of nodes across geographic State Replication boundaries and WAN links – Runs unnoticed within every Chorus and GPDB Consensus Protocol server, and elsewhere as needed 7/26/2010 Confidential 8
Greenplum Chorus: Customer Example, Telecom GO Database + EDC Chorus 100 TB EDW 1 Petabyte EDC 1 Petabyte EDC Customer Challenge: Greenplum Database + Chorus: – 100TB Teradata EDW focused on operational – Extracted data from EDW and others source reporting and financial consolidation systems to quickly assemble new analytic mart – EDW is single source of truth, under heavy – Generated a social graph from call detail governance and control records and subscriber data – Unable to support all of the critical initiatives – Within 2 weeks uncovered behavior where around data surrounding the business “connected” subscribers where 7X more likely – Customer loyalty and churn the #1 business to churn than average user initiative from the CEO on down – Now deploying 1PB production EDC with GP to power their analytic initiatives 7/26/2010 Confidential 9
Greenplum Database + Chorus: Platform for Data and Analytic Solutions Operational Operational Data Mart Analytics Data Mart Analytics Offer/Campaign Offer/Campaign Value- Value- Data Data Consolidation Lab Consolidation Lab Management Management Added Added Store Store Services Services Greenplum Chorus: Self-Service Provisioning, Data Virtualization, Collaboration Greenplum Database: Massively Scalable, Reliable, and Flexible Data Platform 7/26/2010 Confidential 10
Advanced Analytics: An Overview of MAD Skills 7/26/2010 Confidential 7/26/2010 Confidential 11 11
Magnetic • Attract data… • Parallel data loads • MapReduce • External tables • Data Cart • Web tables • ETL/ELT • Attract practitioners… • Parallel query execution • Analytics libraries • OLAP, window functions • Sandboxes • Built-in analytics • Collaboration 7/26/2010 Confidential 12
Agile analyze and model in the cloud push results back into the cloud get data into the cloud 7/26/2010 Confidential 13
Agile Generate useful Iterate on feature Design models in Ingest raw data Export a sample, Go back and features by a selection and R and test for plus hold ‐ out, for into staging ingest more data sequence of model form to fitness on out ‐ of ‐ model design tables as required aggregations improve fit sample data EDC PLATFORM (Sandbox) model <- function(x1, x2) Staging tables Sample tables Transformed Data Models and Results Implement robust Implement models Implement mapping scripts, and use transformations as as scalable more comprehensive data parallelized SQL Greenplum for training and testing Analytics functions statements CREATE TABLE temp1 AS SELECT customerID, max( DELETE FROM temp1 WHERE num < CREATE VIEW ols AS SELECT pseudo_inverse( FROM (SELECT sum(trans Staging tables EDC PLATFORM (Staging) 7/26/2010 Confidential 14
Deep How can we do How can we do What will happen? What will happen? better? better? What happened How and why did What happened How and why did where and when? it happen? where and when? it happen? 7/26/2010 Confidential 15
Deep • Window functions • Built-in analytics • Cume_Dist • Matrix operations • Lag, Lead • Multiple linear regression • Rank • Naïve Bayes • etc… • etc… • • Libraries Methods • Mann-Whitney U Test • Log likelihood • Chi-Square Test • Conjugate gradient • PL/R • Re-sampling • NLTK • etc… • etc… 7/26/2010 Confidential 16
Vector and Matrix Types and Operators 7/26/2010 Confidential 7/26/2010 Confidential 17 17
Vectors and Matrices • Types • Sparse and Dense Vectors and Matrices • All numerical formats, byte through double and complex double • Run length code compression of duplicate values • Operations • Native scalar operations (natural arithmetic) • Dot product, vector triple product, vector / matrix algebra • Set operations (contains, AND/OR, etc) • Library interfaces • Solvers from LAPACK and Sparspak 7/26/2010 Confidential 18
Example: Basic Vector Arithmetic Enter a sparse vector: greenplumdb=# select '{1,4,1,3,1,7}:{1.1,0,2.2,0,3.3,0}'::svec; Cast a sparse vector to a dense vector: greenplumdb=# select '{1,4,1,3,1,7}:{1.1,0,2.2,0,3.3,0}'::svec::float8[]; float8 ------------------------------------------- {1.1,0,0,0,0,2.2,0,0,0,3.3,0,0,0,0,0,0,0} Scalar multiply two vectors: greenplumdb=# select '{1,10,20}:{1,2,3}'*'{10,20,1}:{1,2,3}'::svec; ?column? -------------------------- {1,9,1,19,1}:{1,2,4,6,9} 7/26/2010 Confidential 19
Example: Iterative Logistic Regression Weight calculation: CREATE TABLE latest_coefficient ( coefs svec NOT NULL ); INSERT INTO weight (rownumber, logistic, weight) (SELECT rownumber,logistic,logistic*(1-logistic) FROM (SELECT a.rownumber rownumber, 1 / (1 + exp( -vec_sum(c.coefs * ARRAY[1,a.gre,a.topnotch,a.gpa]) )) logistic FROM admission a, latest_coefficient c) foo); 7/26/2010 Confidential 20
Recommend
More recommend