Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath - - PowerPoint PPT Presentation

data analytics
SMART_READER_LITE
LIVE PREVIEW

Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath - - PowerPoint PPT Presentation

Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath http://about.me/rvprasad Click to edit Master title style Is it Analytics or Analysis? Analytics uses analysis to recommend actions or make decisions. Why Data Analysis? Confirm a


slide-1
SLIDE 1

Data Analytics

A (Short) Tour

Venkatesh-Prasad Ranganath http://about.me/rvprasad

Click to edit Master title style

slide-2
SLIDE 2

Is it Analytics or Analysis? Analytics uses analysis to recommend actions

  • r make decisions.
slide-3
SLIDE 3

Why Data Analysis?

Confirm a hypothesis Confirmatory Explore the data Exploratory (EDA)

slide-4
SLIDE 4

Word of Caution – Case of Killer Potatoes?

This is figure 1.5 in the book “Exploring Data” by Ronald K Pearson.

slide-5
SLIDE 5

Word of Caution – Case of Killer Potatoes?

This is figure 1.6 in the book “Exploring Data” by Ronald K Pearson.

slide-6
SLIDE 6

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-7
SLIDE 7

Data Collection – Approaches

Observation Monitoring Interviews Surveys

slide-8
SLIDE 8

Data Collection – Comparing Approaches

Observation Interviews Surveys Monitoring Technique Shadowing Conversation Questionnaire Logging Interactive No Yes No No Simple No No Yes Yes Automatable No No Yes Yes Scalable No No Yes Yes Data Size Small Small Medium Huge Data Format Flexible Flexible Rigid Rigid Data Type Qualitative Qualitative Qualitative Quantitative Real Time Analysis No No No Yes Expensive Yes Yes No No

slide-9
SLIDE 9

Data Collection – Comparing Approaches

Observation Interviews Surveys Monitoring What to capture? Flexible Flexible Fixed Fixed How to capture? Flexible Flexible Fixed Fixed Human Subjects Yes Yes Yes No Transcription Yes Yes Yes/No No SnR High High High Low Involves NLP Unlikely Unlikely Likely Likely Kind of Analysis Confirmatory Confirmatory Confirmatory Exploratory Kind of Techniques Statistical Testing Statistical Testing Statistical Testing Machine Learning

slide-10
SLIDE 10

Data Storage – Choices

  • Flat Files
  • Databases
  • Streaming Data (but there is no storage)
slide-11
SLIDE 11

Data Storage – Flat Files

  • Simple
  • Common / Universal
  • Inexpensive
  • Independent of specific technology
  • Compression friendly
  • Very few choices
  • Plain text, CSV, XML, and JSON
  • Well established
  • Low level data access APIs
  • No support for automatic scale out /

parallel access

  • Unoptimized data access
  • Indices
  • Columnar storage
slide-12
SLIDE 12

Data Storage – Databases

  • High level data access API
  • Support for automatic scale out /

parallel access

  • Optimized data access
  • Indices
  • Columnar storage
  • Well established
  • Complex
  • Niche / Requires experts
  • Optimization
  • Distribution
  • Expensive
  • Dependent on specific technology
  • DB controlled compression
  • Lots of choices
  • SQL, MySQL, PostgreSQL, Maria, Raven,

Couch, Redis, Neo4j, ….

slide-13
SLIDE 13

Data Storage – Streaming

  • Well, there is not storage 
  • Novel
  • Many streaming data sources
  • Breaks traditional data analysis

algorithms

  • No access to the entire data set
  • Too many unknowns
  • Expertise
  • Cost
  • Best practices
  • Accuracy
  • Benefits
  • Deficiencies
  • Ease of use
slide-14
SLIDE 14

Data Storage – Algorithms and Necessity

  • Offline
  • Online
  • Streaming
  • Real-time
  • Flat Files
  • Databases
  • Streaming Data
  • Do we need fast?
  • How fast is quick enough?
  • How often do we need fast?
  • Is it worth the cost?
  • Is it worth the loss of accuracy?
slide-15
SLIDE 15

Data Representation – Structured

  • Easy to process
  • One time schema setup cost
  • Common schema types
  • CSV, XML, JSON, …
  • You can cook up your schema
  • Eases data exploration & analysis
  • Off-the-shelf techniques to handle

data

  • Requires very little expertise
  • Ideal with automatic data collection
  • Ideal for storing quantitative data
  • Rigid
  • Changing schema can be hard
  • Upfront cost to define the schema
slide-16
SLIDE 16

Data Representation – Unstructured

  • Flexible
  • Off-the-shelf techniques to preprocess

data but requires expertise

  • Ideal for manual data collection
  • Requires lots of preprocessing
  • Complicates data exploration and

analysis

  • Requires domain expertise
  • Extracting data semantics is hard
  • Requires schema recovery *
slide-17
SLIDE 17

Data Access – Security

  • Who has access to what parts of the data?
  • What is the access control policy?
  • How do we enforce these policies?
  • What techniques do we employ to enforce these policies?
  • How do we ensure the policies have been enforced?
slide-18
SLIDE 18

Data Access – Privacy

  • Who has access to what parts of the data?
  • Who has access to what aspects of the data?
  • How do you ensure the privacy of the source?
  • What are the access control and anonymization policies?
  • How do we enforce these policies?
  • What techniques do we employ to enforce these policies?
  • How do we ensure the policies have been enforced?
  • How strong is the anonymization policy?
  • Is it possible to recover the anonymized information? If so, how hard?
slide-19
SLIDE 19

Data Scale

  • Nominal
  • Male, Female
  • Equality operation
  • Ordinal
  • Very satisfied, satisfied, dissatisfied, and very dissatisfied
  • Inequalities operations
  • Interval
  • Temperature, dates
  • Addition and subtraction operations
  • Ratio
  • Mass, length, duration
  • Multiplication and division operations
slide-20
SLIDE 20

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-21
SLIDE 21

Data Cleansing

Let’s get our hands dirty!!

The data set is from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).

slide-22
SLIDE 22

Data Cleansing – Common Issues

  • Missing values
  • Extra values
  • Incorrect format
  • Encoding
  • File corruption
  • Incorrect units
  • Too much data
  • Outliers
  • Inliers
slide-23
SLIDE 23

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-24
SLIDE 24

Data Transformation (Feature Engineering)

  • Analyze specific aspects of the data
  • Coarsening data
  • Discretization
  • Changing Scale
  • Normalization
slide-25
SLIDE 25

Data Transformation (Feature Engineering)

  • Analyze specific aspects of the data
  • Coarsening data
  • Discretization
  • Changing Scale
  • Normalization

BMI BMI Categories < 18.5 Underweight 18.5 – 24.9 Normal Weight 25 – 29.9 Overweight > 30 Obesity

slide-26
SLIDE 26

Data Transformation (Feature Engineering)

  • Analyze specific aspects of the data
  • Coarsening data
  • Discretization
  • Changing Scale
  • Normalization

Actual Weight Normalized 78 0.285 88 0.322 62 0.227 45 0.164

slide-27
SLIDE 27

Data Transformation (Feature Engineering)

  • Analyze relations between features of the data
  • Synthesize new features
  • Relating existing features
  • Combining existing features
slide-28
SLIDE 28

Data Transformation

Let’s get out hands dirty!!

slide-29
SLIDE 29

Data Transformation (Feature Engineering)

Keep in mind the following:

  • Scales
  • What the permitted operations?
  • Data Collection
  • What is the trade-offs in data collection?
  • Parsimony
  • Can we get away with simple scales?
slide-30
SLIDE 30

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-31
SLIDE 31

Data Analysis

  • Features
  • Attributes of each datum
  • Labels
  • Expert’s input about datum
  • Data sets
  • Training
  • Validation
  • Test
  • Work flow
  • Model building (training)
  • Model tuning and selection (validation)
  • Error reporting (test)
slide-32
SLIDE 32

Data Analysis – Models

The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.

slide-33
SLIDE 33

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-34
SLIDE 34

Result Validation – Approaches

  • Expert Inputs
  • Cross Validation
  • K-fold cross validation
  • 5x2 cross validation
  • Bootstrapping
slide-35
SLIDE 35

Result Validation – Basic Terms

Classification Actuals X Y X True X (tx) False Y (fy) p = tx + fy Y False X (fx) True Y (ty) n = fx + ty p’ = tx + fx N = p + n

Consider a 2-class classification problem.

slide-36
SLIDE 36

Result Validation – Basic Terms

Now, consider X as positive evidence and Y as negative evidence.

Classification Actuals X Y X True Positive (tp) False Negative (fn) p = tp + fn Y False Positive (fp) True Negative (tn) n = fp + tn p’ = tp + fp N = p + n

slide-37
SLIDE 37

Result Validation – Measures

error = (fp + fn) / N accuracy = (tp + tn) / N tp-rate = tp / p fp-rate = fp / n sensitivity = tp / p = tp-rate specificity = tn / n = 1 – fp-rate precision = tp / p’ recall = tp / p = tp-rate

Classification Actuals X Y X True Positive (tp) False Negative (fn) p = tp + fn Y False Positive (fp) True Negative (tn) n = fp + tn p’ = tp + fp N = p + n

slide-38
SLIDE 38

Result Validation – ROC (Receiver Operating Characteristics)

The figure is from the Wikipedia page about “Receiver Operating Characteristics”.

slide-39
SLIDE 39

Result Validation – Class Confusion Matrix

A B A True A (ta) False A (fa) B False A (fa) True B (tb) A B C D A True A (ta) False B (fb) False C (fc) False D (fd) B False A (fa) True B (tb) False C (fc) False D (fd) C False A (fa) False B (fb) True C (tc) False D (fd) D False A (fa) False B (fb) False C (fc) True D (td)

2 class problem 4 class problem

slide-40
SLIDE 40

Result Validation – Bias and Variance

The figure is from http://scott.fortmann-roe.com/docs/BiasVariance.html.

slide-41
SLIDE 41

Result Validation – Underfitting & Overfitting

The figure is from the book “Modern Multivariate Statistical Techniques” by Alan Julian Izenman.

slide-42
SLIDE 42

Result Validation

Let’s get out hands dirty!!

slide-43
SLIDE 43

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-44
SLIDE 44

Result Presentation (Visual Validation)

  • Numbers
  • Central tendencies – mode, median, and mean
  • Dispersion – range, standard deviation
  • Five number summary
  • min, 1st quartile, median (mean), 3rd quartile, max
  • Margin of error
  • (Confidence) Interval
  • Tables
  • Charts
slide-45
SLIDE 45

Result Presentation – Box-and-Whisker Plots

The figure is from the Wikipedia page about “Box Plot”.

slide-46
SLIDE 46

Result Presentation – Histograms

The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.

slide-47
SLIDE 47

Result Presentation – Line Graphs

The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.

slide-48
SLIDE 48

Result Presentation – Scatter Plots

The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.

slide-49
SLIDE 49

Result Presentation – Heatmaps

The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.

slide-50
SLIDE 50

Result Presentation – Bubble Chart

The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.

slide-51
SLIDE 51

Result Presentation – Word Cloud

The figure is from the book “Data Visualization: A Successful Design Process” by Andy Kirk.

slide-52
SLIDE 52

Typical Data Analytics Work Flow

  • 1. Identify Issue
  • 2. Data Collection, Storage, Representation, and Access
  • 3. Data Cleansing
  • 4. Data Transformation
  • 5. Data Analysis (Processing)
  • 6. Result Validation
  • 7. Result Presentation (Visual Validation)
  • 8. Recommend Action / Make Decision
slide-53
SLIDE 53

Now, are you thinking..

  • What about identifying the issue/question?
  • Know where you are going before you start
  • What about recommending action / making the decision?
  • Information and knowledge aren’t the same
  • Are data X tasks really that important and hard?
  • Garbage in, Garbage out
  • Aren’t data analysis techniques the most important?
  • Smart data (structures) and dumb code works better than the other way

around

slide-54
SLIDE 54

Some Personal Observations

  • Domain Knowledge is crucial
  • Optimizing analysis
  • Improving relevance of results
  • Always prefer
  • Simple solutions over complex solutions
  • Fast solutions over slow solutions
  • Most correct solutions over fully correct solutions (even no solution!!)
  • If you hear “I want it now”, then say “Really? Please Explain”
  • Visualization helps only if the results are relevant
  • Limited data sets can only get you so far
  • Security and privacy are more important than you think
slide-55
SLIDE 55

Got any stories from the trenches?