Data Science 101 Arik Pelkey Pentaho Senior Director Product - - PowerPoint PPT Presentation
Data Science 101 Arik Pelkey Pentaho Senior Director Product - - PowerPoint PPT Presentation
Data Science 101 Arik Pelkey Pentaho Senior Director Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara Agenda This session will provide an introduction to data science fundamentals. What is Data
Agenda
This session will provide an introduction to data science fundamentals.
- What is Data Science?
- Common Use Cases and Algorithms
- The Data Science Process
- Building a Data Science Team
- The Future
AI, Machine Learning, and Deep Learning
Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.
- AI: Getting machines
to do what humans are good at
- Deep Learning: A type
- f machine learning
- Machine Learning:
Feeding an algorithm data to learn and predict something
Data Science: Solving Problems with Data
Diagram from Drew Conway: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Understanding of the underlying assumptions
Algorithms and numerical techniques to derive insights HACKING SKILLS MATH AND STATISTICS KNOWLEDGE
DATA SCIENCE
Danger Zone! Traditional Research Machine Learning
SUBSTANTIVE EXPERIENCE Computer science, data engineering and wrangling, coding Domain knowledge, business acumen, experience, value to the business
What’s all the fuss?
This stuff was created many many years ago
- Legendre, Gauss and Galton
early 1800’s
Here is a sample footnote.
- Thomas Bayes mid 1700’s
- McCulloch and Pitts early 1940s
- Bayes Theorem
- Regression
- Neural Networks
Think about All Our Data and Compute
https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.
SKA - 2020 (Square Kilometer Array Telescope) Will generate as much data in a day as the entire PLANET does in a year!
It is still GROWING!
Here is a sample footnote.
Regression – Looking for a statistical relationship across variables that may give us an estimate
- f a particular outcome.
Classification – Similar to regression but looking for separations in the data given predefined classes. (Supervised) Clustering – Do not have predefined classes but trying to find groups or sets based upon data at
- hand. (Unsupervised)
Anomaly Detection – Identification of outliers based upon expected ranges of data.
✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ ✕ △ △ △ △ ✕ ✕ ✕ △△ △ ◇ △ △ ◇ ◇ ◇ △△ △ △△ △ △△ △ △△ △ ? ? △△ △
Types of Machine Learning
Labelled vs Unlabelled
Lets say we want to Classify Houses by Size Unsupervised SIZE is missing! We need to look for similarities in the data and group them into clusters. Given Features or Feature Set Label
FullBath HalfBath Bedrooms Home Age
1 2 56 1 1 3 59 2 1 3 20 2 1 3 19
Size
M L M S
Supervised Learning Use the labels to build a
- model. Model
used to classify new house size based ONLY on the known feature set.
More on Machine Learning
Machine Learning is a methodology to create a model based on sample data and use the model to make a prediction or strategy using a more algorithmic approach.
Historical records that contain square feet, number of bathrooms, zip code…. Records that contain the price the house sold for Iterate the algorithm over the combined data to train the model Use the trained model to predict
- utcome on new records
SUPERVISED LEARNING MODEL
The Data Science Process: Getting from Raw Data to Outcomes
Joe Blizstein and Hanspeter Pfister created for Harvard Data Science course.
Formal Framework CRISP–DM
Cross Industry Standard Process for Data Mining
The Data Science Workflow
Specialist Traditional Data Science Team
Data Scientist (DS)
– Prepares data, engineers features, most valuable skill: training models.
Data Engineer (DE)
– Data acquisition focus. Build data pipelines. Not uncommon to have 5:1 ratio DE:DS
Data Analyst (DA)
– Assist DS with data prep
Application architect (AA)
– Design complete solution; deploy and maintain models in production
Mythical Creatures
Trends
- Automation
- Tools for Citizen Data Scientists
- Pre-trained models in the cloud
Here is a sample footnote.
Hiring Guidance
Here is a sample footnote.
Defining Success
- Easy for the tangible
– Search order optimization – Recommendation engine or CTR
- Hard for others
– Lead scoring – Attrition
- Try to measure direct outcomes
- Rarely a silver bullet
- Think ROI
Here is a sample footnote.
Typical Data Science Project
DS
Understand business
- bjectives
AA DE DS
ID and procure training data
DA DS
Prepare data and build new features
DS
Train model Deploy models
AA DS
Update models
AA
Preventive Maintenance: Caterpillar
Marine Asset Intelligence
Business User (COO) Reporting on Operations and Efficiency Dashboards and Reports on Machine Performance (Onboard and Onshore)
Data Marts
Data Scientist Data Mining and Predictive Maintenance Local Equipment sensor and Server Data Fleet Data via Satellite Cross Department Operations Data Scheduling/ERP
Data Integration Data Integration
The Future
- Scaling up / enabling more data scientists
- Model management
- Improved productivity
- Support for containerized applications.
Here is a sample footnote.
Pentaho ML Orchestration
- Makes data science
teams more productive
- Broad support for open
source libraries in various languages
Summary
- What is Data Science
- Common Use Cases and Algorithms
- The Data Science Process
- Building a Data Science Team
- The Future
Next Steps
Want to learn more?
- Schedule a Meet the Expert
- Read Mark Hall’s Machine Learning with Pentaho Blog