data science 101
play

Data Science 101 Arik Pelkey Pentaho Senior Director Product - PowerPoint PPT Presentation

Data Science 101 Arik Pelkey Pentaho Senior Director Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara Agenda This session will provide an introduction to data science fundamentals. What is Data


  1. Data Science 101 Arik Pelkey Pentaho Senior Director – Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara

  2. Agenda This session will provide an introduction to data science fundamentals. • What is Data Science? • Common Use Cases and Algorithms • The Data Science Process • Building a Data Science Team • The Future

  3. AI, Machine Learning, and Deep Learning • AI : Getting machines to do what humans are good at • Machine Learning : Feeding an algorithm data to learn and predict something • Deep Learning : A type of machine learning Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.

  4. Data Science: Solving Problems with Data Computer science, Algorithms and HACKING MATH AND data engineering and SKILLS STATISTICS numerical Machine KNOWLEDGE wrangling, coding techniques to Learning derive insights DATA SCIENCE Danger Traditional Zone! Research Understanding of the underlying assumptions Domain knowledge, SUBSTANTIVE business acumen, experience, EXPERIENCE value to the business Diagram from Drew Conway: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.

  5. What ’ s all the fuss? This stuff was created many many years ago • Thomas Bayes mid 1700’s • Bayes Theorem • Regression • Legendre, Gauss and Galton early 1800’s • McCulloch and Pitts early 1940s • Neural Networks Here is a sample footnote.

  6. Think about All Our Data and Compute SKA - 2020 (Square Kilometer Array Telescope ) It is still GROWING! Will generate as much data in a day as the entire PLANET does in a year! https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.

  7. Types of Machine Learning Regression – Looking for Classification – Similar to ✕ ✕ a statistical relationship regression but looking for ✕ ✕ ✕ ✕ ✕ across variables that separations in the data ✕ △ △ may give us an estimate given predefined classes. ✕ △ △ △ of a particular outcome. (Supervised) Clustering – Do not have Anomaly Detection – ✕ ◇ △△ predefined classes but Identification of outliers △ ✕ ◇ ? ✕ △△ △△ △ ◇ △ trying to find groups or based upon expected △△ ◇ △ △△ △ sets based upon data at ranges of data. △△ △ ? hand. (Unsupervised) △ Here is a sample footnote.

  8. Labelled vs Unlabelled Lets say we want to Classify Houses by Size Supervised Learning Given Features or Feature Set Use the labels to build a Label FullBath HalfBath Bedrooms Home Age Size model. Model 1 0 2 56 M used to classify 1 1 3 59 L new house size 2 1 3 20 M based ONLY on 2 1 3 19 S the known feature set. Unsupervised SIZE is missing! We need to look for similarities in the data and group them into clusters.

  9. More on Machine Learning Machine Learning is a methodology to create a model based on sample data and use the model to make a prediction or strategy using a more algorithmic approach. SUPERVISED LEARNING MODEL Historical records that contain square feet, number of bathrooms, zip code…. Records that contain the price the house sold for Iterate the algorithm over the combined data to train the model Use the trained model to predict outcome on new records

  10. The Data Science Process: Getting from Raw Data to Outcomes Formal Framework CRISP–DM The Data Science Workflow Cross Industry Standard Process for Data Mining Joe Blizstein and Hanspeter Pfister created for Harvard Data Science course.

  11. Specialist Traditional Data Science Team Data Scientist (DS) – Prepares data, engineers features, most valuable skill: training models. Data Engineer (DE) – Data acquisition focus. Build data pipelines. Not uncommon to have 5:1 ratio DE:DS Data Analyst (DA) – Assist DS with data prep Application architect (AA) – Design complete solution; deploy and maintain models in production

  12. Mythical Creatures

  13. Trends • Automation • Tools for Citizen Data Scientists • Pre-trained models in the cloud Here is a sample footnote.

  14. Hiring Guidance Here is a sample footnote.

  15. Defining Success • Easy for the tangible – Search order optimization – Recommendation engine or CTR • Hard for others – Lead scoring – Attrition • Try to measure direct outcomes • Rarely a silver bullet • Think ROI Here is a sample footnote.

  16. Typical Data Science Project DS DS DS DS DS DE DA AA AA AA Understand ID and Prepare data Train Deploy Update business procure and build model models models objectives training data new features

  17. Preventive Maintenance: Caterpillar

  18. Marine Asset Intelligence Data Scientist Fleet Data via Data Mining and Satellite Predictive Maintenance Data Data Integration Integration Data Business User (COO) Reporting on Marts Operations and Local Equipment Efficiency sensor and Server Data Dashboards and Reports on Machine Performance Cross Department (Onboard and Operations Data Onshore) Scheduling/ERP

  19. The Future • Scaling up / enabling more data scientists • Model management • Improved productivity • Support for containerized applications. Here is a sample footnote.

  20. Pentaho ML Orchestration • Makes data science teams more productive • Broad support for open source libraries in various languages

  21. Summary • What is Data Science • Common Use Cases and Algorithms • The Data Science Process • Building a Data Science Team • The Future

  22. Next Steps Want to learn more? • Schedule a Meet the Expert • Read Mark Hall’s Machine Learning with Pentaho Blog

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend