Mason Chen Black Belt, Stanford OHS
1st Place Best Contributed Paper, 2018 JMP Discovery Summit, CARY NC
THE ANTI- HEART DISEASE WARRIORS
1
STEAMS THE ANTI- HEART DISEASE WARRIORS Mason Chen Black Belt, - - PowerPoint PPT Presentation
STEAMS THE ANTI- HEART DISEASE WARRIORS Mason Chen Black Belt, Stanford OHS 1 st Place Best Contributed Paper, 2018 JMP Discovery Summit, CARY NC 1 Project Scope and Presentation Flow Many people like chocolate, but have some concerns
1st Place Best Contributed Paper, 2018 JMP Discovery Summit, CARY NC
1
2
S T E A M S
Literature Research
& Science
Types
Imputation of Cocoa%
Neural Setting
t di diseases es might need to eat chocolate, but do not know which one to eat.
3
S T E A M S
Mason C., “STEAMS” Methodology
ucting Chocolat late Science Research”, submi mitte ted to to NSTA STEM Expo Ingred redien ents s and Nutri rition tions Chocolate late Prod
ucts Chocolate late Proc
ess s (Technology
eeri ring) g) Anti-Oxi Oxidan ant
Science
Clusteri tering Prod
ucts (Arti tifi ficial al Intell llige gence) e) Clusteri tering Nutri rition tions (Arti tifi ficial al Intell llige gence) e)
Linkage ge Choices es
Sing ngle Complete Cent ntroid
Clusteri tering Algori rithm thm (Mat ath) DSD Optimi mizat zation
(Statis tistics) s) Neural ral Imputatio ation (Arti tifi ficial al Intell llige gence) e)
1 1 2 3 3 4 5
S T E M S T E A M M S S T E A M
De Develop
ath & S Sci cience ce Found ndat ation ion (Stan anford ford OHS)
Math, Physics, Biology, Chemistry, JAVA, Stati atisti tics Literature Research/Writing
Certify ify 6 Professi
Certific ificat ates es:
IBM SPSS Statistics IBM Modeler DA/DM DM (IBM000129876) IASSC YB/GB/BB BB (GR764000541MC) JMP Stati atistical tical Thin inki king g (2018 Goal) JMP DOE OE (2019 Goal) JMP Scrip ipt t Specialist (2019 Goal)
Enhan ance ce STEAM AMS S Skil ills
JM JMP/Pr P/Pro, Python Latex Paper Proceedings Oral/Poster Presentations Team Building
Learning “STEAMS” techniques help motivate school learning on project-based and practical way
4
S T E A M S
Fun, Re Real al Hands ds-On On
2017-2018 2018 Con
ference rences s
IEOM/A /ASQ SQ STEM, , Palo
AQI Six Sigma a Nashv shvil ille, le, TN ASA JSM, , Baltim imor
e, MD IEOM OM Bogota,
ia IEOM OM Rabat at, , Moroc
IWSM, M, Groning
en, Nether herlands lands
JMP,
, Prag ague ue, , Czech Republi ublic
FSDM, , Hualien alien, , Taiwa wan IEOM, , Bandu dung ng, Indon
esia ISF, , Cairns, irns, Australia alia ASA JSM, , Vancou
ver, , CAN ISF, , Boulde lder, , CO ASA, SDSS, , Reston,
IBC, Barcelona elona, Spain IWSM, M, Bristol, stol, UK ISCB, , Melbourne
Australia alia
JMP, Cary, y, NC
FSDM, , Bangkok, gkok, Thail iland nd IEOM, , DC
5
S T E A M S
IEOM/A /ASQ SQ STEM, , Santa a Clara, CA IEOM, , Paris, is, Franc ance
6 JMP13 13 >> Analy lyze e >> Fit t Y by X > >> Nonpar Den ensi sity ty
Mason C., (2018 July) y) “Multivariate Statist stics of
xidan ant Chocolate”, SMS IWSM Bristol tol Proc
eedings, gs, Vol 2 37 37-40 40
Chocolate te has not be been n proven en harmful ul.
Life Expectancy8 (2015 Estimate) Median ian: : 74.75 #9: 82.50 #32: 80.57 #31: 80.68 #33: 80.54 #20: 81.70 #15: 81.98 #13: 82.15 #24: 81.23 #43: 79.68 #19: 81.75
7 JMP13 13 >> Analy lyze e >> Fit t Y by X > >> Nonpar ar Density nsity
Dark chocolate is a powerful source of
equal to that of an apple, it has the highest amount of antioxidant.
E N GI GI N E E RI RI N G
Anti-Oxi Oxidan ant t Capaci acity ty/gram /gram
Lower Cardi diovas vascul cular r Heart t Dise sease se (CHD) risk if taking 2 Chocolate servings per week (1 serving = 30 g)
disease)
AF associated cardiovascular disease
https://heart.bmj.com/content/103/15/1163 https://www.bmj.com/content/343/bmj.d4488
8
T E C H N O L O G Y
properties.
▪ Consists of two phenyl rings (A and B) and a heterocyclic ring (C).
▪ Flavones, flavanol, flavanones, isoflavones, anthocyanidins, chalcones, catechins
to toe.
9
S CI CI E N C E
▪ Free radicals are atoms with odd number of electrons ▪ Antioxidants reduce free radical formation ▪ Reactive free radicals causes cells mal-function ▪ Excess free radicals damages blood vessel ▪ After the oxidation of free radicals, LDL (Low-density Lipoprotein)can cause CVD (Cardiovascular Disease)
▪ The oxidized components attract macrophages which absorb & deposit Cholesterol
10
S CI CI E N C E
nefits: fits:
– A lot of soluble fiber – A lot of minerals: iron, magnesium, copper, manganese, potassium, phosphorus, zinc,
selenium
– Powerful source of antioxidant – Improve blood flow and lower blood pressure – Increases HDL (good cholesterol) and decreases LDL (bad cholesterol) – Lower risk of cardiovascular disease (CVD) – Improve brain function1
ncerns: ns:
– Causes migraines – Increases chance of ki
kidney ney stone nes
– Side effects from caffeine such as irregular
ar heartbe tbeat at
11
T E C H N O L O G Y
12
(3 (3) JMP 13 >> A
Analy lyze ze >> Clusteri tering >> Cluste ter Varia riable bles
(1 (1) JMP 13 >> Analy
lyze e >> Distribution istribution
(2 (2) JMP 13 >> Analyze
yze >> Multivari tivariate ate Metho thods ds >> Multivari tivariate ate
(4 (4) JMP 13 >> Analyze
yze >> Multivari tivariate ate Metho thods ds >> Pr Prin incip ciple le Components ents >> Bi-Plo Plot
y), “Choose Healt lthy Chocolate”, IEOM Europe
ris Proc
edings, gs, 434-441 441
Most Dark Chocolate (Qualitat tative Clustering Criteria) has:
Chocolate Product Nutrition data has indicated that Dark Chocolate is healthier than the Milk and White Chocolate
13 JMP 13 >> Analy lyze e >> Distribution istribution
S T A T I S T I C S
negative correlation of -0.9162.
positive correlation of 0.7722. Any other better way to cluster nutritions?
14 JMP 13 >> Analy lyze e >> Multivariate tivariate Metho thods ds >> Multivari tivariate ate
SC SC I EN EN CE CE
Pair-Wise se Pearson arson Correlat relation
Clustering Nutritions can interpret the relevant Chocolate Science insight well:
Cluster 1: the higher the saturated fat, the higher the total fat, and the higher the calories. Cluster 2: Calcium/Cholesterol, and Cocoa percent have a negative correlation. Cluster 3: the higher the sugar, the higher the carbohydrates. Cluster 4: Iron and dietary fiber are positively correlated.
15 JMP 13 >> Analy lyze e >> Clusteri tering >> Cl Cluster ster Vari riab able les
SC SC I E N CE CE & AI AI
Comm mmon n Sense se
Signal nal Noise se S-N Ra Ratio tio
16
1st Cluster: Cocoa Percent, Dietary Fiber, and Iron are near each other (Higher for Dark Chocolate) 3rd Cluster: Calcium, Sugar, and Cholesterol are near each other (Higher for Milk/White Chocolate)
JMP 13 >> Analy lyze e >> Multivariate tivariate Methods thods >> Princip ciple le Componen nents ts >> Bi-Pl Plot
M A T H H & AI AI
2nd Cluster: Total Fat, Saturated Fat, and Calories
Visual isualizati ization
17
E N GI GI N E E RI RI N G
Technology” Literature Research well (STEAMS).
18 JMP 13 >> Analy lyze e >> Clusteri tering >> Hiera erarchic rchical al Cluster ter JMP 13 >> Analy lyze e >> Dist stribution ribution JMP 13 >> Analy lyze e >> Clusteri tering >> Hiera erarchi chical cal Cluster ter >> Column mn Summary mary JMP 13 >> Analy lyze e >> Clusteri tering >> Hiera erarchi chical cal Cluster ter >> Clustering stering Distance istance Method thod JM JMP 13 >> Analy lyze e >> Cluste terin ring g >> Hierarchi erarchical cal Cluste ter >> Constel tellatio lation Plot JMP 13 >> Analy lyze e >> Multi Multivaria variate te Methods thods >> P Prin incip ciple le Components ents >> Ei Eigen envalues values JM JMP 13 >> Analy lyze e >> Distribution istribution
), “Statistics Applicat ation
late Science with Heart rt Disease”, ASA SDSS Proc
eeding
Objective: find a way to identify healthy chocolate products for Heart rt Disease e patients.
dark chocolate split between the first and second cluster.
19 JM JMP 13 >> Analy lyze e >> Clusteri tering >> H Hie ierarc archical hical Cluster ster JMP 13 >> Analy lyze e >> Dist stribution ribution
S T A TI TI S TI TI C S AI AI
Cluste ter too smal all, l, miss ssing ing 39 po points ts
lthy? y?
20 JMP 13 >> Analy lyze e >> Clusteri tering >> H Hie ierar rarchical chical Cluster ster >> Column Summary ary
A I I & T E C H N O L O G Y
Clustering patterns dependent on the cluster number observations, cluster variance, and outlier
21 JM JMP 13 >> Analy lyze e >> Cluste terin ring g >> Hierarchi erarchical cal Cluste ter
Lin inkag age Choic ices
Single Compl mplet ete Cent ntro roid id
M A T H
Avera rage ge ANOVA VA (MS) Mi Min Center er-Cent enter er Ma Maxim imum um
22
M A T H
Depending on the data distribution, selecting an appropriate Clustering Distance algorithm is critical to Clustering Pattern Analysis
Cent ntro roid id
Single (Join Larger Variances) Ward (Join Smaller Observations)
23
JMP 13 >> Analy lyze e >> Clusteri tering >> H Hie ierar rarchical chical Cluste ter >> Constel tellatio lation Plot
AI AI & M A T H
Risk sk on Mis is- class ssifi ification cation of Healthy althy Chocolate
From both the scree plot and PCA eigenvalues (80% Pareto), we can pick 4 clusters
24 JM JMP 13 >> Analy lyze e >> Multivaria tivariate te Metho thods ds >> Pri rincip ciple le Components ents >> Ei Eigenval envalues ues
M A T H
Clustering pattern result is highly dependent on the number of clusters
4
25
JMP 13 >> Analy lyze e >> Distribution istribution
S T A TI TI S TI TI C S
Better tter Choice ice Worse se Choice ice
26 JMP 13 >> Analy lyze e >> Clusteri tering >> Hiera erarchi chical cal Cluster ter >> Miss issing g Valu lue e Imputa tation tion JMP 13 >> Analy lyze e >> Predictive dictive Model eling >> Parti tition tion JMP 13 >> Analy lyze e >> Predictive dictive Model eling >> Neural ral JMP 13 >> Analy lyze e >> Pre redictiv dictive e Model eling >> Neural al >> Dia iagram ram JMP 13 >> Analy lyze e >> Scree eening ing >> Explore re Missing issing Valu lues es
Neural ural Network work Algori
sing Value Imput utat ation
for Chocolate late Science Rese searc arch“ submitt tted ed to to SIAM SDM19 19
information (m (most are Milk C Chocolate ates)
27 JMP 13 >> Analy lyze e >> Screenin ening g >> Explore re Miss issing Valu lues es
AI I & M A T H
r better er im impu puta tation ion method
ceptron n (hid idden en nodes) es) with one layer.
ain ad advan anta tage: can efficiently model different response surfaces given enough hidden nodes and layers.
ain dis isad advan vanta tage: e: results are not easily interpretable, since there are intermediate layers (Blac ack Box)
Standard ndard JMP Editi ition:
y TanH activa ivati tion n functio tion
t with th one hidden dden laye yer. 28
AI AI
TanH TanH More re Powerf erful l Ex Exponen enti tial al Tra ransf nsforma mation tion than an PCA Linear near Transf nsform rmation ation Percep ceptr tron
Validation are above 0.7.
(typical Over-fit concern for Neural).
top three predictors for predicting the Cocoa%
29 JMP 13 >> Analy lyze e >> Predi dictive ctive Mode deli ling g >> Neural ral
AI AI & S T A TI TI S TI TI C S
Stronge ger
30 JMP 13 >> Analy lyze e >> Predi dictive ctive Model eling >> Neural ral >> D Dia iagr gram am
AI AI
Hig igher er Sensit itiv ivit ity: y:
rd Node
e
colat ate e Type pe
Neural ral TanH Neural al TanH
31 JM JMP 13 >> DOE OE >> Definiti finitive ve Scre reenin ening Desi esign gn JMP 13 >> DOE OE >> Desi sign n Dia iagnostic nostic >> Evaluate uate Desi esign gn JMP 13 >> Analy lyze e >> Fit Model el JM JMP 13 >> Analy lyze e >> Fit Y b by X JMP 13 >> Analy lyze e >> Distribution istribution JMP 13 >> Save e Script pt >> To D Data ta Table ble
r To Scri ript pt Window dow >> Ed Edit/ it/Save/ Save/Run Scrip ipt
Opti timi mize ze Neural al Networ
rith thm of
tation
tted ed to to 2019 ASA ENAR Spring Meeting
32
Four DO DOE E Inpu put t Va Varia iables: s:
under Validation Method
Two DO DOE Ou E Output put Respo pons nses: s:
e Import portan ant- Neural al Over-fi fit) t) Object ctive: ive: optimize Neural settings to resolve
and Validation for Coco coa a Mis issin ing g Imput putat ation ion JMP Neural al Va Valid idat ation ion Method
s:
dbac ack: : randomly divides the original data into the training and validation (holdback portion) sets.
: divides the data into K subsets. Eac ach K set used to validate the model fit on the rest
model giving the best val alid idat ation ion stat atis isti
for small data sets (makes efficient use of limited data)
JMP 13 >> DOE OE >> Definiti finitive ve Scree eenin ning Desi esign gn
E N GI GI N E E RI RI N G
33
14 DS DSD D Runs
DSD Ru D Runs s is is saf afer er on Power er Power er Test of Sig ign (> 90%) Correl relat ation ion of Confoundin
g (<0.3 0.3) Un Unif ifor
mity ty of Predic edicti tion
Power er
JMP 13 >> DOE OE >> Desi sign n Dia iagnostic nostic >> Evaluate uate Desi esign gn
E N GI GI N EE EE RI RI N G
Add d Fou
r Ra Rando dom Corner rner Poin ints ts
34
Constr truct ct Model el Effec fects ts:
sted ed” under Validation Method
sponse nse Surface ace (RS) JMP 13 >> Analy lyze e >> Fit Model el 34
S T A T I S T IC IC S
Resol
ve Neural al Over er-Fit Fit
Opt ptim imal al Neura ral Settin ing: g:
sample size and favor validation)
5 Kfold ld numbers (24/5 ~ 5 data for validation set)
d= 5 to improve reproducibility
Hid idde den n Nodes es is best (Constrained by 15 input variables for one layer)
quare are fit on predicting Cocoa%
JMP 13 >> Analy lyze e >> Fit Y b by X 35
S T A T I S T I C S
Nested ed
Future e Work rk:
quare > 1 100%, not followi wing Normal mal Dist stribution ribution?
der r Confide fidence ce Interval: erval: Smal all l Vali alida dation tion Datas ataset, et, or Outli liers ers?
36
S T A T I S T I C S
T
V
T T T
Holdb dbac ack Porti tion
Kfold
K=5, , Select ct the Best t am among g 5 Mode dels ls Consider ider Neural al Over er-fi fit t (lower wer Va Valid idat ation ion R-Squar uare) e)
is lar arge, small size in each K cluster, making validation Over-Fit concern worse
is smal all, losing advantage of using Kfold over Holdback
n total al sam ampl ple e siz ize is is smal aller, er, may ay pr prefer fer Kfold
wit ith smal aller er K number ber Coin inci ciden dence ce wit ith Four Hid idden den Nodes? es?
transforming the 15 Input Nutrition Variables
al relat ated ed to PCA Eig igen al algorit ithm hm (Tan anH ~ Lin inear ar)? Why Kfold
r Holdb dbac ack? k?
Origi iginal nal Neural ral Setti ting Optimal timal DSD SD Neural al Setti ting
Indi dica cati ting ng 4th
th Choco
colat ate e Type pe- Fruit it Choco colat late e (Vi Vitam amin in C)
37
E N GI GI N E E RI RI N G
Validation R-Square improved by 20%
JMP 13 >> Analy lyze e >> Predi dictive ctive Mode deli ling g >> Neural ral
Fruit it
Default: Missing Value Imputation Optimal Neural (Predicted Cocoa %)
38
E N GI GI N E E RI RI N G
Reduc uce e Risk sk Misc isclassi lassifi ficat ation ion of unhea ealth thy y Dark ark Chocolates
Achievements: ✓ Adopted and Integrated “STEAMS” methodology successfully ✓ Learned Chocolate Products, Nutrition Anti-Oxidant Science ✓ Applied Multivariate Statistics, Clustering and Neural Algorithms ✓ Conducted DSD optimization on Resolving Neural Overfit Future JMP Research: Investigate “Fruit” Chocolate Type, Outlier Effect JMP Pro Partition: Bootstrap Forrest, Boosted Tree, K-Nested, Naïve Bayes JMP Pro Neural: Deep Learning, Hidden Layer Structure, Fitting Options Certify JMP Script Specialist
39
40
JAVA VA/La /Latex tex Advisor: isor: Dr
ing Huan ang Bio iology
Writ itin ing g Advis visor:
OE Pat atric ick k Giu iulia iano Bio iology
visor
: Dr
Chen Chen Robotic
s Advis visor:
RE Rolan and d Jones Stat atis istics// ics//MAT MATH/A /AI I Advis visor:
Dr. Char arles es Chen STEAMS AMS Advisor: isor: ASQ Fellow w Dr Dr. John n Flai aig STEAM AMS S Advisors: isors: Stan anfor ford OHS STEAM AMS S Advisor: isor: JMP
Chuck Boiler
STEAM AMS S Advisor: isor: ASA Dr Dr. Chris is Bar arker er STEAM AMS S Advisor: isor: IEO EOM M Dr
i Ahad ad, IEOM Dr
Don Reim imer er
40
E N GI GI N E E RI RI N G
Optimize the Number of Hidden Nodes:
Little room for further improvement on setting the relative importance between Training Set and Validation Set
Conduct a 2-Factor & 3-Level Full Factorial on comparing the relative importance in (1,2) range
0.9)
3 hidden nodes favor training set and 4 hidden nodes favor validation set
42
Partition( Y( :Cocoa_Percent ),
X( :Type, :Name( "Calories (g)" ), :Name( "Calories_from_Fat (g)" ), :Name( "Total_Fat (g)" ), :Name( "Saturated_Fat (g)" ), :Trans_Fat, :Name( "Cholesterol (mg)" ), :Name( "Sodium (mg)" ), :Name( "Carbs (g)" ), :Name( "Dietary_Fiber (g)" ), :Name( "Sugar (g)" ), :Name( "Protein (g)" ), :Vitamin_A, :Vitamin_C, :Calcium, :Iron ),
Minimum Size Split( 3 ), Validation Portion( 0.6 ),
Split History( 1 ), Informative Missing( 1 ), Column Contributions( 1 ), Initial Splits( :Name( "Cholesterol (mg)" ) >= 5 ), SendToReport( Dispatch( {}, "Partition", FrameBox, {Frame Size( 480, 56 )} ) ) );
Neural( Y( :Cocoa_Percent ),
X( :Name( "Calories (g)" ), :Name( "Calories_from_Fat (g)" ), :Name( "Total_Fat (g)" ), :Name( "Saturated_Fat (g)" ), :Trans_Fat, :Name( "Cholesterol (mg)" ), :Name( "Sodium (mg)" ), :Name( "Carbs (g)" ), :Name( "Dietary_Fiber (g)" ), :Name( "Sugar (g)" ), :Name( "Protein (g)" ), :Vitamin_A, :Vitamin_C, :Calcium, :Iron, :Type ),
Informative Missing( 0 ), Validation Method( "KFold", 5 ), Fit( NTanH( 4 ), Diagram( 1 )
),
Fit Model( Y( :Name( "R-Square of Training Set" ), :Name( "R-Square of Validaiton Set" ) ),
Effects( :Validation Setting[:Vaidation Method], :Vaidation Method, :Random Seed, :Hidden Nodes & RS, :Vaidation Method * :Random Seed, :Vaidation Method * :Hidden Nodes, :Random Seed * :Hidden Nodes, :Hidden Nodes * :Hidden Nodes ),
Personality( "Standard Least Squares" ),
Emphasis( "Effect Screening" ), :Name( "R-Square of Training Set" ) << {Summary of Fit( 0 ), Analysis of Variance( 0 ), Lack of Fit( 0 ), Sorted Estimates( 0 ), Plot Actual by Predicted( 1 ), Plot Regression( 0 ), Plot Residual by Predicted( 1 ), Plot Studentized Residuals( 1 ), Plot Effect Leverage( 0 ), Box Cox Y Transformation( 1 )}, :Name( "R-Square of Validaiton Set" ) << {Summary of Fit( 0 ), Analysis of Variance( 0 ), Lack of Fit( 0 ), Sorted Estimates( 0 ), Plot Actual by Predicted( 1 ), Plot Regression( 0 ), Plot Residual by Predicted( 1 ), Plot Studentized Residuals( 1 ), Plot Effect Leverage( 0 ), Box Cox Y Transformation( 1 )} ),
42 JMP 13 >> Save e Script pt >> To D Data ta Table ble
To Script pt Window dow >> Edit/Save/ it/Save/Run Run Scrip ipt
AI AI