Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book
Arun Kumar
1
DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML - - PowerPoint PPT Presentation
DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1 DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1.
1
2
3
4
CrowdFlower Data Science Report 2016 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
5
CrowdFlower Data Science Report 2016 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
6
Kaggle State of ML and Data Science Survey 2018
7
IDC-Alteryx State of Data Science and Analytics Report 2019
8
9
Raw data sources/repos
Analytics/ML- ready data
10
Raw data sources/repos
Analytics/ML- ready data
11
Raw data sources/repos
12
13
Raw data sources/repos
14
Raw data sources/repos
Analytics/ML- ready data
15
16
17
https://www.tensorflow.org/tfx/guide
18
https://eng.uber.com/michelangelo/
19
Raw data sources/repos
Analytics/ML- ready data
20
21
22
FullName Age City Sate Aisha Williams 27 San Diego CA LastName FirstName MI Age Zipcode Williams Aisha R 27 92122
23
24
25
Raw data sources/repos
Analytics/ML- ready data
26
27
28
https://www.snorkel.org/blog/weak-supervision
29
30
http://cidrdb.org/cidr2019/papers/p58-ratner-cidr19.pdf
31
https://medium.com/the-official-integrate-ai-blog/transfer-learning-explained-7d275c1e34e2
32
33
34
https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf
35
https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf
36
https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf
37
38
39
UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is
NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …
40
UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is
NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …
41
UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is
NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …
42
UserID State Date Upvotes Comment Label 143 CA … … …
NY … … … + 143 CA … … … + … … … … … …
UserID Age Name 304 40 … 23 25 … 143 33 … … … …
43
UserID State Date Upvotes Comment Label 143 CA … … …
NY … … … + 143 CA … … … + … … … … … …
44
F1 F2 F3 Label 3 2 …
20 … + 5 10 … + … … … … F1 F2 F3 F11 F12 F13 F22 F23 F33 Label 3 2 … 9 6 … 4 … …
20 … 16 80 … 400 … … + 5 10 … 25 50 … 100 … … + … … … … … … … … … …
45
UserID State Date Upvotes Comment Label … … … … … …
State Upvotes Comment Label … … … … Upvotes Comment Label … … …
46
UserID State Date Upvotes Comment Label … … … … … …
F1 F2 F3 Label 0.3 4.2
…
47
UserID State Date Upvotes Comment Label 143 CA 4/3/19 1539 “This restaurant is
NY 11/7/19 5020 “Not too bad!” + 98 WI 2/8/20 402 “Pretty rad” + … … … … … …
48
… Comment Label … “This restaurant is not good”
“Good good!” + … “Pretty rad” + … … …
… sucks good … Label … 1 1 …
2 … + … … + … … … … …
49
50
https://medium.com/@karpathy/software-2-0-a64152b37c35
51
52
53
54
55
56
57
58
59
60
61
62
http://dmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf