Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics - - PowerPoint PPT Presentation

greenplum database 4 0 greenplum chorus and advanced
SMART_READER_LITE
LIVE PREVIEW

Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics - - PowerPoint PPT Presentation

Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 Luke Lonergan CTO and Co-Founder Greenplum 7/26/2010 Confidential 1 Greenplum: Demonstrating Market Momentum and Leadership 2009 was a breakout


slide-1
SLIDE 1

Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics

Presentation, June 2010

1 7/26/2010 Confidential

Luke Lonergan CTO and Co-Founder Greenplum

slide-2
SLIDE 2

2 7/26/2010 Confidential

Greenplum: Demonstrating Market Momentum and Leadership

  • 2009 was a breakout year for GP
  • Surpassed more than +100 global

enterprise customers

  • 100% year over year growth
  • Projecting the same for 2010
  • Open systems model is winning:

significant leverage with Dell, EMC, Cisco and others

  • Acquiring net new customers at a

pace faster than Netezza and Teradata

  • Growth is enabling us to innovate

beyond our competitors, not only within just the core database but in new product initiatives

slide-3
SLIDE 3

3 7/26/2010 Confidential

Announcing Greenplum Database 4.0: Critical Mass Innovation

3

  • 4.0 represents industry leading innovations in:

– Workload Management – Fault-Tolerance – Advance Analytics

  • Culmination of more than +7 years of research

and development

  • First emerging SW vendor to achieve critical

mass and maturity across all necessary aspects

  • f enterprise class DBMS platforms:

– Complex query optimization – Data loading – Workload Management – Fault-Tolerance – Embedded languages/analytics – 3rd Party ISV certification – Administration and Monitoring

  • Genuine floor-sweep replacement option for

Teradata, Oracle, DB2, and SQL Server

slide-4
SLIDE 4

4 7/26/2010 Confidential

The Need for Change: State of Play – Data in a Typical Enterprise

  • Data is everywhere –

corporate EDW, 100s of data marts, ‘shadow’ databases and spreadsheets

  • The goal of centralizing all

data in a single EDW has proven untenable

EDW ~10% of data Data Marts and ‘Personal Databases’ ~90% of data

slide-5
SLIDE 5

5 7/26/2010 Confidential

Introducing Greenplum Chorus: The World’s First Enterprise Data Cloud Platform

  • New software product
  • World’s first Enterprise Data Cloud

Platform (EDC), enabling:

– Self-service provisioning – Data services – Data collaboration

  • Customers deploy Chorus along with

GP Database to create a net new and self-service analytic infrastructure

  • Chorus can significantly accelerate

the time and ease with which companies extract value and insight from their data

slide-6
SLIDE 6

6 7/26/2010 Confidential

Greenplum Chorus: Core Design Philosophies

  • Secure

– Provide comprehensive and granular access control over whom is authorized to view and subscribe to data within Chorus

  • Collaborative

– Facilitate the publishing, discovery, and sharing of data and insight using a social computing model that appears familiar and easy-to-use

  • Data-centric

– Focus on the necessary tooling to manage the flow and provenance of data sets as they are created/shared within a company

  • MAD Skills in Action

– Build a platform capable of supporting the magnetic, agile, and deep principles of MAD Skills

slide-7
SLIDE 7

7 7/26/2010 Confidential

Greenplum Chorus: Core Technologies

  • Greenplum Chorus has unique technical requirements

that demand a new kind of core infrastructure

  • Cloud platforms have focused on

scalable processing – not scalable and dynamic flow of data

  • Chorus needs both:
  • DATA REQUIREMENTS:

Coordinate complex dataflow and data lifecycle across 10s

  • r 100s of distinct databases
  • CLOUD REQUIREMENTS:

Low TCO provisioning and control of distributed processing and storage

slide-8
SLIDE 8

8 7/26/2010 Confidential

Greenplum Data Hypervisor™

  • Greenplum Data Hypervisor™ is the execution and
  • perational framework for all ‘outside the database’

activities

– Coordinate complex cross-database data movement – Manage all Chorus state and in-flight activities – Orchestrate database instance provisioning, expansion, and other operational activities – Respond to events and failures with compensating actions and escalation – Execute arbitrary programs and process flow in a strongly fault-tolerance manner

  • Greenplum Data Hypervisor™ is built for the cloud

– Underlying consensus/replication model is similar to Google’s core – Handles and recovers from failures mid-operation – even within complex multi-step flows – Scales to 10,000s of nodes across geographic boundaries and WAN links – Runs unnoticed within every Chorus and GPDB server, and elsewhere as needed

Consensus Protocol State Replication Distributed State Management App & Process Flow Runtime Schedule + Dependency Management Developer API

slide-9
SLIDE 9

9 7/26/2010 Confidential

Greenplum Chorus: Customer Example, Telecom

100 TB EDW 1 Petabyte EDC

GO Database + EDC Chorus

Customer Challenge:

– 100TB Teradata EDW focused on operational reporting and financial consolidation – EDW is single source of truth, under heavy governance and control – Unable to support all of the critical initiatives around data surrounding the business – Customer loyalty and churn the #1 business initiative from the CEO on down

Greenplum Database + Chorus:

– Extracted data from EDW and others source systems to quickly assemble new analytic mart – Generated a social graph from call detail records and subscriber data – Within 2 weeks uncovered behavior where “connected” subscribers where 7X more likely to churn than average user – Now deploying 1PB production EDC with GP to power their analytic initiatives

1 Petabyte EDC

slide-10
SLIDE 10

10 7/26/2010 Confidential

Greenplum Database + Chorus: Platform for Data and Analytic Solutions

Offer/Campaign Management Offer/Campaign Management Value- Added Services Value- Added Services

Greenplum Chorus: Self-Service Provisioning, Data Virtualization, Collaboration

Data Mart Consolidation Data Mart Consolidation Operational Data Store Operational Data Store Analytics Lab Analytics Lab

Greenplum Database: Massively Scalable, Reliable, and Flexible Data Platform

slide-11
SLIDE 11

11 7/26/2010 Confidential 11 7/26/2010 Confidential

Advanced Analytics: An Overview of MAD Skills

slide-12
SLIDE 12

12 7/26/2010 Confidential

Magnetic

  • Attract data…
  • Parallel data loads
  • External tables
  • Web tables
  • Attract practitioners…
  • Parallel query execution
  • OLAP, window functions
  • Built-in analytics
  • MapReduce
  • Data Cart
  • ETL/ELT
  • Analytics libraries
  • Sandboxes
  • Collaboration
slide-13
SLIDE 13

13 7/26/2010 Confidential

Agile

analyze and model in the cloud push results back into the cloud get data into the cloud

slide-14
SLIDE 14

14 7/26/2010 Confidential

Sample tables Transformed Data Models and Results

CREATE TABLE temp1 AS SELECT customerID, max( DELETE FROM temp1 WHERE num < model <- function(x1, x2)

Export a sample, plus hold‐out, for model design Design models in R and test for fitness on out‐of‐ sample data Implement transformations as parallelized SQL statements Iterate on feature selection and model form to improve fit

EDC PLATFORM (Staging)

Implement models as scalable Greenplum Analytics functions

EDC PLATFORM (Sandbox)

Staging tables Staging tables

Generate useful features by a sequence of aggregations Ingest raw data into staging tables Go back and ingest more data as required Implement robust mapping scripts, and use more comprehensive data for training and testing

CREATE VIEW ols AS SELECT pseudo_inverse( FROM (SELECT sum(trans

Agile

slide-15
SLIDE 15

15 7/26/2010 Confidential

Deep

What will happen? What will happen? How can we do better? How can we do better? What happened where and when? What happened where and when? How and why did it happen? How and why did it happen?

slide-16
SLIDE 16

16 7/26/2010 Confidential

  • Window functions
  • Cume_Dist
  • Lag, Lead
  • Rank
  • etc…
  • Libraries
  • Mann-Whitney U Test
  • Chi-Square Test
  • PL/R
  • NLTK
  • etc…
  • Built-in analytics
  • Matrix operations
  • Multiple linear regression
  • Naïve Bayes
  • etc…
  • Methods
  • Log likelihood
  • Conjugate gradient
  • Re-sampling
  • etc…

Deep

slide-17
SLIDE 17

17 7/26/2010 Confidential 17 7/26/2010 Confidential

Vector and Matrix Types and Operators

slide-18
SLIDE 18

18 7/26/2010 Confidential

  • Types
  • Sparse and Dense Vectors and Matrices
  • All numerical formats, byte through double and

complex double

  • Run length code compression of duplicate values
  • Operations
  • Native scalar operations (natural arithmetic)
  • Dot product, vector triple product, vector / matrix algebra
  • Set operations (contains, AND/OR, etc)
  • Library interfaces
  • Solvers from LAPACK and Sparspak

Vectors and Matrices

slide-19
SLIDE 19

19 7/26/2010 Confidential

Enter a sparse vector:

greenplumdb=# select '{1,4,1,3,1,7}:{1.1,0,2.2,0,3.3,0}'::svec;

Cast a sparse vector to a dense vector:

greenplumdb=# select '{1,4,1,3,1,7}:{1.1,0,2.2,0,3.3,0}'::svec::float8[]; float8

  • {1.1,0,0,0,0,2.2,0,0,0,3.3,0,0,0,0,0,0,0}

Scalar multiply two vectors:

greenplumdb=# select '{1,10,20}:{1,2,3}'*'{10,20,1}:{1,2,3}'::svec; ?column?

  • {1,9,1,19,1}:{1,2,4,6,9}

Example: Basic Vector Arithmetic

slide-20
SLIDE 20

20 7/26/2010 Confidential

Weight calculation:

CREATE TABLE latest_coefficient ( coefs svec NOT NULL ); INSERT INTO weight (rownumber, logistic, weight) (SELECT rownumber,logistic,logistic*(1-logistic) FROM (SELECT a.rownumber rownumber, 1 / (1 + exp(

  • vec_sum(c.coefs * ARRAY[1,a.gre,a.topnotch,a.gpa])

)) logistic FROM admission a, latest_coefficient c) foo);

Example: Iterative Logistic Regression

slide-21
SLIDE 21

21 7/26/2010 Confidential

Residual calculation using l2norm():

SELECT iteration, abs(residual) FROM ( SELECT iteration, l2norm-last_value(l2norm) OVER (w1) residual FROM (SELECT iteration, l2norm(coefs) FROM coefficient ORDER BY 1) bar WINDOW w1 as (ORDER BY iteration ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING ) offset 1 ) foo;

Example: Plotting Residual of Iterative Method

slide-22
SLIDE 22

22 7/26/2010 Confidential

Residual plot using gnuplot: \o | gnuplot -e "set logscale y; plot '-' with lines title 'l2norm of residual’” SELECT <data> \o

Example: Plotting Residual of Iterative Method

slide-23
SLIDE 23

23 7/26/2010 Confidential 23 7/26/2010 Confidential

Example

What’s the right price for my products?

slide-24
SLIDE 24

24 7/26/2010 Confidential

What’s the right price for my products?

slide-25
SLIDE 25

25 7/26/2010 Confidential

What’s the right price for my products?

Get the raw data…

DROP TABLE IF EXISTS misc.price_promo; CREATE TABLE misc.price_promo ( dt date ,base_price numeric ,display_price numeric ,feature_price numeric ,feature_display_price numeric ,tpr numeric ,volume numeric ) DISTRIBUTED BY(dt); \copy misc.price_promo from data.csv with delimiter ','

Date Base Price Display Price Feature Price Feature/ D isplay Price TPR Volume 2009-02- 24 7.33 6.67 7.33 7.33 7.20 20484.52 2009-03- 10 7.47 5.94 5.72 7.00 5.72 34313.94 2009-03- 24 7.75 6.74 5.74 7.26 5.82 25477.33 2009-04- 07 7.40 7.19 7.40 7.40 7.23 18772.57 2009-04- 21 7.75 7.36 6.74 7.75 6.22 20743.68 2009-05- 05 7.43 6.56 6.98 7.43 5.70 28244.82 2009-05- 19 7.70 6.57 7.70 7.70 6.23 20234.74 2009-06- 02 6.87 6.67 6.87 6.87 6.64 23262.60 2009-06- 16 7.36 7.00 7.36 7.36 7.44 19290.87 2009-06- 30 6.92 6.72 6.92 6.92 6.73 23617.61 2009-07- 14 7.49 7.32 7.49 7.49 7.58 18017.58 2009-07- 28 7.69 7.44 5.69 7.69 5.70 29193.44 2009-08- 11 7.19 6.24 7.19 7.19 6.72 23863.13 2009-08- 25 7.72 6.74 7.72 7.72 5.72 25138.34

slide-26
SLIDE 26

26 7/26/2010 Confidential

What’s the right price for my products?

Build the model…

CREATE TABLE misc.price_promo_coefs AS SELECT coefs[1] AS intercept_beta ,coefs[2] AS base_price_beta ,coefs[3] AS display_price_beta ,coefs[4] AS feature_display_price_beta ,coefs[5] AS tpr_beta ,r2 FROM ( SELECT mregr_coef(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS coefs ,mregr_r2(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS r2 FROM misc.price_promo ) AS a DISTRIBUTED RANDOMLY;

intercept_beta base_price_beta display_price_beta feature_display_price_beta tpr_beta r2 72804.48332 5049.03841

  • 1388.842417
  • 6203.882026
  • 4801.114351

0.883172235

slide-27
SLIDE 27

27 7/26/2010 Confidential

What’s the right price for my products?

Execute the model…

CREATE OR REPLACE VIEW misc.v_price_promo_fitted AS SELECT volume ,volume_fitted ,100 * abs(volume - volume_fitted)::numeric / volume AS ape FROM ( SELECT p.volume ,c.intercept_beta + p.base_price * c.base_price_beta + p.display_price * c.display_price_beta + p.feature_display_price * c.feature_display_price_beta + p.tpr * c.tpr_beta AS volume_fitted FROM misc.price_promo_coefs c ,misc.price_promo p ) AS a

volume volume_fitted ape 20484.52 20507.88 0.1140 34313.94 31381.52 8.5458 25477.33 29591.06 16.1466 18772.57 19560.80 4.1988 20743.68 23769.63 14.5873 28244.82 27746.83 1.7630 20234.74 24876.55 22.9398 23262.60 23727.72 1.9994 19290.87 18862.64 2.2198 23617.61 23168.44 1.9018 18017.58 17595.93 2.3402 29193.44 26224.39 10.1702 23863.13 23571.29 1.2229 25138.34 27065.91 7.6678 24307.88 23945.45 1.4909 … … …

slide-28
SLIDE 28

28 7/26/2010 Confidential

What’s the right price for my products?

R2 m ape 0.883 5.965

slide-29
SLIDE 29

29 7/26/2010 Confidential

slide-30
SLIDE 30

30 7/26/2010 Confidential 30 7/26/2010 Confidential

Example

What are our customers saying about us?

slide-31
SLIDE 31

31 7/26/2010 Confidential

What are our customers saying about us?

  • How do you discern trends and

categories within thousands of

  • n-line conversations?

– Search for relevant blogs – Construct a ‘fingerprint’ for each document based on word frequencies – Use this to define what it means for documents to be similar, or ‘close’ – Identify ‘clusters’ of documents

slide-32
SLIDE 32

32 7/26/2010 Confidential

What are our customers saying about us?

Method

  • Construct document histograms
  • Transform histograms into document

“fingerprints”

  • Use k-means clustering to classify

documents

slide-33
SLIDE 33

33 7/26/2010 Confidential

What are our customers saying about us?

Constructing document histograms

  • Parsing html files and extracting relevant

data.

  • Using natural language processing to

perform text tokenization and stemming.

  • Cleansing the data of inconsistencies.
  • Transforming unstructured data into

structured data.

slide-34
SLIDE 34

34 7/26/2010 Confidential

What are our customers saying about us?

Constructing a document “fingerprint”

  • Measure the term frequency of words

within a document vs. the frequency that those words occur in all documents.

  • This is a TFxIDF weight: term

frequency-inverse document frequency.

  • Easily calculated based on formulas over

the document histograms.

  • The result is a vector in n-space.
slide-35
SLIDE 35

35 7/26/2010 Confidential

What are our customers saying about us?

K-means clustering

  • An iterative algorithm for finding items

that are similar within an n-dimensional space.

  • Two fundamental steps:

1. Measuring distance to a centroid (the center of a cluster) 2. Moving the center of a cluster to minimize the sum distance to all the members of the cluster.

slide-36
SLIDE 36

36 7/26/2010 Confidential

What are our customers saying about us?

slide-37
SLIDE 37

37 7/26/2010 Confidential

What are our customers saying about us?

slide-38
SLIDE 38

38 7/26/2010 Confidential

What are our customers saying about us?

slide-39
SLIDE 39

39 7/26/2010 Confidential

What are our customers saying about us?

slide-40
SLIDE 40

40 7/26/2010 Confidential

What are our customers saying about us?

slide-41
SLIDE 41

41 7/26/2010 Confidential

What are our customers saying about us?

slide-42
SLIDE 42

42 7/26/2010 Confidential

What are our customers saying about us?

slide-43
SLIDE 43

43 7/26/2010 Confidential

What are our customers saying about us?

slide-44
SLIDE 44

44 7/26/2010 Confidential

What are our customers saying about us?

slide-45
SLIDE 45

45 7/26/2010 Confidential

What are our customers saying about us?

slide-46
SLIDE 46

46 7/26/2010 Confidential

What are our customers saying about us?

slide-47
SLIDE 47

47 7/26/2010 Confidential

What are our customers saying about us?

slide-48
SLIDE 48

48 7/26/2010 Confidential

What are our customers saying about us?

slide-49
SLIDE 49

49 7/26/2010 Confidential

What are our customers saying about us?

  • innovation
  • leader
  • cost
  • competitor
  • technology
  • speed
  • bug
  • installation
  • download
slide-50
SLIDE 50

50 7/26/2010 Confidential

Accessing the data

  • Build the directory list into a set of files that we will access:

id | path | body

  • -----+---------------------------------------+------------------------------------

2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ... id | path | body

  • -----+---------------------------------------+------------------------------------

2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ...

  • For each record in the list "open()" the file and read it in its entirety
  • INPUT:

NAME: filelist FILE:

  • maple:/Users/demo/blogsplog/filelist1
  • maple:/Users/demo/blogsplog/filelist2

COLUMNS:

  • path text
  • MAP:

NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit('/', 1) (id, _) = fname.split('.') body = f.open(path).read() …

slide-51
SLIDE 51

51 7/26/2010 Confidential

Parse the documents into word lists

Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal:

  • MAP:

NAME: extract_terms PARAMETERS: [id integer, body text] RETURNS: [id int, title text, doc _text] FUNCTION: | if 'parser' not in SD: import ... class MyHTMLParser(HTMLParser): ... SD['parser'] = MyHTMLParser() parser = SD['parser'] parser.reset() parser.feed(body) yield (id, parser.title, '{"' + '","'.join(parser.doc) + '"}')

slide-52
SLIDE 52

52 7/26/2010 Confidential

Parse the documents into word lists

Use the HTMLParser library to parse the html documents and extract titles and body contents:

if 'parser' not in SD: from HTMLParser import HTMLparser ... class MyHTMLParser(HTMLParser): def __init(self): HTMLParser.__init__(self) ... def handle_data(self, data): data = data.strip() if self.inhead: if self.tag == 'title': self.title = data if self.inbody: ... parser = SD['parser'] parser.reset() ...

slide-53
SLIDE 53

53 7/26/2010 Confidential

Parse the documents into word lists

Use nltk to tokenize, stem, and remove common terms:

if 'parser' not in SD:

from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ...

slide-54
SLIDE 54

54 7/26/2010 Confidential

Parse the documents into word lists

Use nltk to tokenize, stem, and remove common terms:

if 'parser' not in SD:

from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ...

shell$ gpmapreduce -f blog-terms.yml mapreduce_75643_run_1 DONE sql# SELECT id, title, doc FROM blog_terms LIMIT 5; id | title | doc

  • -----+------------------+-----------------------------------------------------------------

2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... ... shell$ gpmapreduce -f blog-terms.yml mapreduce_75643_run_1 DONE sql# SELECT id, title, doc FROM blog_terms LIMIT 5; id | title | doc

  • -----+------------------+-----------------------------------------------------------------

2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... ...

slide-55
SLIDE 55

55 7/26/2010 Confidential

Create histograms of word frequencies

Extract a term-dictionary of terms that show up in at least ten blogs

sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs FROM ( SELECT id, term, count(*) AS c FROM ( SELECT id, unnest(doc) AS term FROM blog_terms ) term_unnest GROUP BY id, term ) doc_terms WHERE term IS NOT NULL GROUP BY term HAVING count(*) > 10; term | freq | num_blogs

  • ---------+------+-----------

sturdi | 19 | 13 canon | 97 | 40 group | 48 | 17 skin | 510 | 152 linger | 19 | 17 blunt | 20 | 17 sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs FROM ( SELECT id, term, count(*) AS c FROM ( SELECT id, unnest(doc) AS term FROM blog_terms ) term_unnest GROUP BY id, term ) doc_terms WHERE term IS NOT NULL GROUP BY term HAVING count(*) > 10; term | freq | num_blogs

  • ---------+------+-----------

sturdi | 19 | 13 canon | 97 | 40 group | 48 | 17 skin | 510 | 152 linger | 19 | 17 blunt | 20 | 17

slide-56
SLIDE 56

56 7/26/2010 Confidential

Create histograms of word frequencies

Use the term frequencies to construct the term dictionary…

sql# SELECT array(SELECT term FROM blog_term_freq) dictionary; dictionary

  • {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

sql# SELECT array(SELECT term FROM blog_term_freq) dictionary; dictionary

  • {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

…then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary:

sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features; id | term_count

  • ----+----------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ... sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features; id | term_count

  • ----+----------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ...

slide-57
SLIDE 57

57 7/26/2010 Confidential

Create histograms of word frequencies

Note the use of sparse vectors to store the term counts as a histogram:

id | term_count

  • ----+----------------------------------------------------------------------

... 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ... id | term_count

  • ----+----------------------------------------------------------------------

... 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ... id | term_count

  • ----+----------------------------------------------------------------------

... 10 | {0 ,0 ,0 ,2 ,0 ,0 ,0 ,0 ,6 ,0,0,0,...} ... dictionary

  • {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

id | term_count

  • ----+----------------------------------------------------------------------

... 10 | {0 ,0 ,0 ,2 ,0 ,0 ,0 ,0 ,6 ,0,0,0,...} ... dictionary

  • {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...
slide-58
SLIDE 58

58 7/26/2010 Confidential

Transform the blog terms into statistically useful measures

Use the feature vectors to construct TFxIDF vectors (term frequency inverse document frequency). These are a measure of the importance of terms.

sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf

  • ----+-------------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ... sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf

  • ----+-------------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ...

slide-59
SLIDE 59

59 7/26/2010 Confidential

Create document clusters around iteratively defined centroids

sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance

  • ----+-------------------------------------------------------------------+-----+------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145 sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance

  • ----+-------------------------------------------------------------------+-----+------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145

Now that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics. The current example is k-means clustering which requires two operations. First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity:

slide-60
SLIDE 60

60 7/26/2010 Confidential

Create document clusters around iteratively defined centroids

sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid

  • ----+------------------------------------------------------------------------

3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...} sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid

  • ----+------------------------------------------------------------------------

3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}

Next, use an averaging metric to re-center the mean of a cluster: Repeat the previous two operations until the centroids converge, and you have k-means clustering.

slide-61
SLIDE 61

61 7/26/2010 Confidential

Summary

  • Accessing the data (MapReduce)
  • Parse the documents into word lists (MapReduce)
  • Create histograms of word frequencies (SQL)
  • Transform the blog terms into statistically useful measures (SQL)
  • Create document clusters around iteratively defined centroids (SQL window functions)

id | path | body

  • -----+---------------------------------------+------------------------------------

2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”... id | path | body

  • -----+---------------------------------------+------------------------------------

2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”... id | title | doc

  • -----+------------------+-----------------------------------------------------------------

2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... id | title | doc

  • -----+------------------+-----------------------------------------------------------------

2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... id | term_count

  • ----+----------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} id | term_count

  • ----+----------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} id | tfxidf

  • ----+-------------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} id | tfxidf

  • ----+-------------------------------------------------------------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} id | tfxidf | cid | distance

  • ----+-------------------------------------------------------------------+-----+------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 id | tfxidf | cid | distance

  • ----+-------------------------------------------------------------------+-----+------------

2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977

slide-62
SLIDE 62

62 7/26/2010 Confidential

slide-63
SLIDE 63

63 7/26/2010 Confidential 63 7/26/2010 Confidential

MAD Skills in practice

slide-64
SLIDE 64

64 7/26/2010 Confidential

–Extracted data from EDW and

  • ther source systems into new

analytic sandbox –Generated a social graph from call detail records and subscriber data –Within 2 weeks uncovered behavior where “connected” subscribers were seven times more likely to churn than average user

MAD Skills in practice

slide-65
SLIDE 65

65 7/26/2010 Confidential

–Wanted to identify unusual on- line campaigns as part of traffic quality monitoring –Collected data on clicks and impressions for campaigns –Built cumulative distribution functions for clicks and volume –Computed pairwise distances –Identified outliers – e.g. too much revenue based on number of clicks – using simple SQL over very large datasets

MAD Skills in practice