lecture 1 introduction to cs109a
play

Lecture #1: Introduction to CS109A aka STAT121A, AC209A, CSCIE-109A - PowerPoint PPT Presentation

Lecture #1: Introduction to CS109A aka STAT121A, AC209A, CSCIE-109A CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner 1 Lecture Outline Why data science? Why taking CS109A? What is data science?


  1. Lecture #1: Introduction to CS109A aka STAT121A, AC209A, CSCIE-109A CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner 1

  2. Lecture Outline • Why data science? Why taking CS109A? • What is data science? • What is this class and what it is not? • The data science process • Example CS109A, P ROTOPAPAS , R ADER , T ANNER 2

  3. Why? Jobs! CS109A, P ROTOPAPAS , R ADER , T ANNER 3

  4. Why? Jobs! CS109A, P ROTOPAPAS , R ADER , T ANNER 4

  5. Why? Why do I love data science? Why are you here? CS109A, P ROTOPAPAS , R ADER , T ANNER 5

  6. Memes ! CS109A, P ROTOPAPAS , R ADER , T ANNER 6

  7. Why? Why are you here? CS109A, P ROTOPAPAS , R ADER , T ANNER 7

  8. What is data science? CS109A, P ROTOPAPAS , R ADER , T ANNER 8

  9. A little bit of history CS109A, P ROTOPAPAS , R ADER , T ANNER 9

  10. History Long time ago (thousands of years) science was only empirical and people counted stars CS109A, P ROTOPAPAS , R ADER , T ANNER 10

  11. History (cont) Long time ago (thousands of years) science was only empirical and people counted stars or crops CS109A, P ROTOPAPAS , R ADER , T ANNER 11

  12. History (cont) Long time ago (thousands of years) science was only empirical and people counted stars or crops and used the data to create machines to describe the phenomena CS109A, P ROTOPAPAS , R ADER , T ANNER 12

  13. History (cont) Few hundred years: theoretical approaches, try to derive equations to describe general phenomena. CS109A, P ROTOPAPAS , R ADER , T ANNER 13

  14. History (cont) About a hundred years ago: computational approaches CS109A, P ROTOPAPAS , R ADER , T ANNER 14

  15. History (cont) And then … . data science In both data science and machine learning we extract pattern and insights from data. • Inter-disciplinary • Data and task focused • Resource aware • Adaptable to changes in the environment and needs CS109A, P ROTOPAPAS , R ADER , T ANNER 15

  16. The Potential of Data Science Urban Planning Disease Diagnosis Predicting and planning for resource needs Detecting malaria from blood smears Agriculture Drug Discovery Quickly discovering new drugs for COVID Precision agriculture CS109A, P ROTOPAPAS , R ADER , T ANNER 16

  17. The Potential of Data Science Gender Bias Racial Bias Risk models used in US courts have Some DS models for evaluate job shown to be biased against non- applications show bias in favor of white defendants male candidate CS109A, P ROTOPAPAS , R ADER , T ANNER 17

  18. What? The Data Science Process Ask an interesting question Get the Data Explore the Data Model the Data Communicate/Visualize the Results CS109A, P ROTOPAPAS , R ADER , T ANNER 18

  19. What? The Data Science Process Ask an interesting question What is the scientific goal? Get the Data What would you do if you had all of the data? What do you want to predict or estimate? Explore the Data Model the Data Communicate/Visualize the Results CS109A, P ROTOPAPAS , R ADER , T ANNER 19

  20. What? The Data Science Process Ask an interesting question How were the data sampled? Get the Data Which data are relevant? Are there privacy issues? Explore the Data Model the Data Communicate/Visualize the Results CS109A, P ROTOPAPAS , R ADER , T ANNER 20

  21. What? The Data Science Process Ask an interesting question Plot the data. Get the Data Are there anomalies or egregious issues? Are there patterns? Explore the Data Model the Data Communicate/Visualize the Results CS109A, P ROTOPAPAS , R ADER , T ANNER 21

  22. What? The Data Science Process Ask an interesting question Build a model. Get the Data Fit the model. Validate the model. Explore the Data Model the Data Communicate/Visualize the Results CS109A, P ROTOPAPAS , R ADER , T ANNER 22

  23. What? The Data Science Process Ask an interesting question What did we learn? Get the Data Do the results make sense? Can we effectively tell a story? Explore the Data Model the Data Communicate/Visualize the Results CS109A, P ROTOPAPAS , R ADER , T ANNER 23

  24. What? The material of the course will integrate the five key facets of an investigation using data: 1. data collection; data wrangling, cleaning, and sampling to get a suitable data set. 2. data management; accessing data quickly and reliably. 3. exploratory data analysis; generating hypotheses and building intuition. 4. prediction or statistical learning. 5. communication; summarizing results through visualization, stories, and interpretable summaries. CS109A, P ROTOPAPAS , R ADER , T ANNER 24

  25. Goal of the course Practice Theory Impact 1. Key Machine Learning 1. Implement ML and 1. Solving real-life concept problems using DS deep learning models using python 2. Important metrics for 2. Evaluating the social libraries evaluation impact of DS 3. Handling different kinds of data 2. Using free online tools and resources 4. Extracting insights for data science from analysis of the models CS109A, P ROTOPAPAS , R ADER , T ANNER 25

  26. Weeks 8: Data Weeks 1-2: Data Data Imputation Data Formats + Web Scraping PCA Pandas Weeks 9-10: Trees Weeks 3-5: Regression Decision Trees Bagging kNN Regression Random Forest Linear Regression Boosting Methods Multi and Poly Regression Model Selection and Cross Validations Weeks 11-12: Neural Networks Inference Multi-Layer Perceptron Bootstrap Architecture of NN Ridge and Lasso Regularization Fitting NN, backprop and SGD Regularization of NN Weeks 6-7: Classification Weeks 13-14 kNN Classification Logistic Regression Ethics Multi-class Classification Model Interpretation CS109A, P ROTOPAPAS , R ADER , T ANNER 26

  27. After CS109A AC295 CS109B A. Neural Networks: A. Productions Data Science, from notebooks to the cloud • CNNs B. Big models, transfer • RNNs learning and architecture • Generative models learning B. Unsupervised Clustering C. Visualization tools for C. Piecewise Linear interpreting models Regression D. Bayesian Modeling CS109A, P ROTOPAPAS , R ADER , T ANNER 27

  28. Not an exclusive list • CS171 (Visualization) • CS182 (AI) • CS181 (ML) • CS 187 (NLP) • Stat 110 (Probability) • Stat 111 (Inference) • Stat 139 (Linear Models) • Stat 149 (generalized linear models) • Stat 131 (Time Series) • Stat 171 (Stochastic Processes) • Stat 195 (Statistical Machine Learning). CS109A, P ROTOPAPAS , R ADER , T ANNER 28

  29. Who? Instructors Pavlos Protopapas Kevin Rader Chris Tanner Scientific director Senior preceptor in Lecturer at Institute of Institute of Applied Statistics Applied Computational Computational Science Science CS109A, P ROTOPAPAS , R ADER , T ANNER 29

  30. Who? Marios Mattheakis Chris Gumb Eleni Kaxiras Section Leader Head TF Supportive Instructor Post-doctoral Fellow Graduate student of Data Assistant Director for IACS Science at Harvard Data Science and Extension School Computation at SEAS CS109A, P ROTOPAPAS , R ADER , T ANNER 30

  31. Who? Teaching Fellows CS109A, P ROTOPAPAS , R ADER , T ANNER 31

  32. Course Components CS109A, P ROTOPAPAS , R ADER , T ANNER 32

  33. Lectures , Advanced Sections, Sections and Office Hours During lecture will cover the material which you will need to complete the homework, and to survive the rest of your life in CS109A. We will use a mix of notes and exercises via edstem. 1. Lecture notes and associated notebooks will be posted before lecture on GitHub and on edstem. 2. Lectures will be video taped (and live streamed) and posted approximately within 24 hours on web page. We will have two ‘shows’ , morning and matinee (A, B) A: Morning Mon/Wed/Fri 9:00-10:15am @Zoom B: Matinee Mon/Wed/Fri 3:00-4:15pm @Zoom CS109A, P ROTOPAPAS , R ADER , T ANNER 33

  34. Lecture format • Quiz ASYNCHRONOUS • Finish exercises from previous lecture • Reading or Video Watching for next lecture Questions from asynchronous material, review of quiz and homework Live Lecture SYNCHRONOUS Q&A The “ hands-on exercises ” Hands-on exercises in breakout rooms part will be longer during the Friday lecture as opposed to Discussion about the exercises Mon/Wed. … Repeat Summary and conclusions CS109A, P ROTOPAPAS , R ADER , T ANNER 34 34

  35. Advanced Sections, Sections and Office Hours Lectures are supplemented by 1.25 hours sections led by teaching fellows. There are two types of sections: • Standard Sections will be a mix of review of material and practice problems similar to the homework. Friday 1:30-2:45 pm, and Mon 8:30-9:45 pm @zoom • Advanced Sections (A-Sections) will cover advanced topics like the mathematical underpinnings of the methods seen in lectures and labs. Weds 12:01 pm - 1:15 pm @zoom. A-sections are required for AC209 students. Note: Sections are not held every week. Consult the course calendar for exact dates. CS109A, P ROTOPAPAS , R ADER , T ANNER 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend