Data Science 101 Arik Pelkey Pentaho Senior Director Product - - PowerPoint PPT Presentation

data science 101
SMART_READER_LITE
LIVE PREVIEW

Data Science 101 Arik Pelkey Pentaho Senior Director Product - - PowerPoint PPT Presentation

Data Science 101 Arik Pelkey Pentaho Senior Director Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara Agenda This session will provide an introduction to data science fundamentals. What is Data


slide-1
SLIDE 1

Data Science 101

Arik Pelkey Pentaho Senior Director – Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara

slide-2
SLIDE 2

Agenda

This session will provide an introduction to data science fundamentals.

  • What is Data Science?
  • Common Use Cases and Algorithms
  • The Data Science Process
  • Building a Data Science Team
  • The Future
slide-3
SLIDE 3

AI, Machine Learning, and Deep Learning

Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.

  • AI: Getting machines

to do what humans are good at

  • Deep Learning: A type
  • f machine learning
  • Machine Learning:

Feeding an algorithm data to learn and predict something

slide-4
SLIDE 4

Data Science: Solving Problems with Data

Diagram from Drew Conway: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.

Understanding of the underlying assumptions

Algorithms and numerical techniques to derive insights HACKING SKILLS MATH AND STATISTICS KNOWLEDGE

DATA SCIENCE

Danger Zone! Traditional Research Machine Learning

SUBSTANTIVE EXPERIENCE Computer science, data engineering and wrangling, coding Domain knowledge, business acumen, experience, value to the business

slide-5
SLIDE 5

What’s all the fuss?

This stuff was created many many years ago

  • Legendre, Gauss and Galton

early 1800’s

Here is a sample footnote.

  • Thomas Bayes mid 1700’s
  • McCulloch and Pitts early 1940s
  • Bayes Theorem
  • Regression
  • Neural Networks
slide-6
SLIDE 6

Think about All Our Data and Compute

https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.

SKA - 2020 (Square Kilometer Array Telescope) Will generate as much data in a day as the entire PLANET does in a year!

It is still GROWING!

slide-7
SLIDE 7

Here is a sample footnote.

Regression – Looking for a statistical relationship across variables that may give us an estimate

  • f a particular outcome.

Classification – Similar to regression but looking for separations in the data given predefined classes. (Supervised) Clustering – Do not have predefined classes but trying to find groups or sets based upon data at

  • hand. (Unsupervised)

Anomaly Detection – Identification of outliers based upon expected ranges of data.

✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ △ △ △ △ ✕ ✕ ✕ △△ △ ◇ △ △ ◇ ◇ ◇ △△ △ △△ △ △△ △ △△ △ ? ? △△ △

Types of Machine Learning

slide-8
SLIDE 8

Labelled vs Unlabelled

Lets say we want to Classify Houses by Size Unsupervised SIZE is missing! We need to look for similarities in the data and group them into clusters. Given Features or Feature Set Label

FullBath HalfBath Bedrooms Home Age

1 2 56 1 1 3 59 2 1 3 20 2 1 3 19

Size

M L M S

Supervised Learning Use the labels to build a

  • model. Model

used to classify new house size based ONLY on the known feature set.

slide-9
SLIDE 9

More on Machine Learning

Machine Learning is a methodology to create a model based on sample data and use the model to make a prediction or strategy using a more algorithmic approach.

Historical records that contain square feet, number of bathrooms, zip code…. Records that contain the price the house sold for Iterate the algorithm over the combined data to train the model Use the trained model to predict

  • utcome on new records

SUPERVISED LEARNING MODEL

slide-10
SLIDE 10

The Data Science Process: Getting from Raw Data to Outcomes

Joe Blizstein and Hanspeter Pfister created for Harvard Data Science course.

Formal Framework CRISP–DM

Cross Industry Standard Process for Data Mining

The Data Science Workflow

slide-11
SLIDE 11

Specialist Traditional Data Science Team

Data Scientist (DS)

– Prepares data, engineers features, most valuable skill: training models.

Data Engineer (DE)

– Data acquisition focus. Build data pipelines. Not uncommon to have 5:1 ratio DE:DS

Data Analyst (DA)

– Assist DS with data prep

Application architect (AA)

– Design complete solution; deploy and maintain models in production

slide-12
SLIDE 12

Mythical Creatures

slide-13
SLIDE 13

Trends

  • Automation
  • Tools for Citizen Data Scientists
  • Pre-trained models in the cloud

Here is a sample footnote.

slide-14
SLIDE 14

Hiring Guidance

Here is a sample footnote.

slide-15
SLIDE 15

Defining Success

  • Easy for the tangible

– Search order optimization – Recommendation engine or CTR

  • Hard for others

– Lead scoring – Attrition

  • Try to measure direct outcomes
  • Rarely a silver bullet
  • Think ROI

Here is a sample footnote.

slide-16
SLIDE 16

Typical Data Science Project

DS

Understand business

  • bjectives

AA DE DS

ID and procure training data

DA DS

Prepare data and build new features

DS

Train model Deploy models

AA DS

Update models

AA

slide-17
SLIDE 17

Preventive Maintenance: Caterpillar

slide-18
SLIDE 18

Marine Asset Intelligence

Business User (COO) Reporting on Operations and Efficiency Dashboards and Reports on Machine Performance (Onboard and Onshore)

Data Marts

Data Scientist Data Mining and Predictive Maintenance Local Equipment sensor and Server Data Fleet Data via Satellite Cross Department Operations Data Scheduling/ERP

Data Integration Data Integration

slide-19
SLIDE 19

The Future

  • Scaling up / enabling more data scientists
  • Model management
  • Improved productivity
  • Support for containerized applications.

Here is a sample footnote.

slide-20
SLIDE 20

Pentaho ML Orchestration

  • Makes data science

teams more productive

  • Broad support for open

source libraries in various languages

slide-21
SLIDE 21

Summary

  • What is Data Science
  • Common Use Cases and Algorithms
  • The Data Science Process
  • Building a Data Science Team
  • The Future
slide-22
SLIDE 22

Next Steps

Want to learn more?

  • Schedule a Meet the Expert
  • Read Mark Hall’s Machine Learning with Pentaho Blog
slide-23
SLIDE 23