 
              Course Summary Course Summary Introduction: Introduction: – Basic problems and questions in machine learning. Basic problems and questions in machine learning. – Linear Classifiers Linear Classifiers – – Na Naï ïve Bayes ve Bayes – Logistic Regression Logistic Regression – – LMS LMS – Five Popular Algorithms Five Popular Algorithms – – Decision trees (C4.5) Decision trees (C4.5) – Neural networks (backpropagation) Neural networks (backpropagation) – – Probabilistic networks (Na Probabilistic networks (Naï ïve Bayes; Mixture models) ve Bayes; Mixture models) – – Support Vector Machines (SVMs) Support Vector Machines (SVMs) – – Nearest Neighbor Method Nearest Neighbor Method – Theories of Learning: Theories of Learning: – – PAC, Bayesian, Bias PAC, Bayesian, Bias- -Variance analysis Variance analysis Optimizing Test Set Performance: Optimizing Test Set Performance: – Overfitting, Penalty methods, Holdout Methods, Ensembles Overfitting, Penalty methods, Holdout Methods, Ensembles – Sequential Data Sequential Data – Hidden Markov models, Conditional Random Fields; Hidden Markov Hidden Markov models, Conditional Random Fields; Hidden Markov – SVMs SVMs
Course Summary Course Summary Goal of Learning Goal of Learning Loss Functions Loss Functions Optimization Algorithms Optimization Algorithms Learning Algorithms Learning Algorithms Learning Theory Learning Theory Overfitting and the Triple Tradeoff Overfitting and the Triple Tradeoff Controlling Overfitting Controlling Overfitting Sequential Learning Sequential Learning Statistical Evaluation Statistical Evaluation
Goal of Learning Goal of Learning Classifier: ŷ ŷ = f( = f( x ) “ “Do the right thing! Do the right thing!” ” Classifier: x ) Conditional probability estimator: P(y| x ) Conditional probability estimator: P(y| x ) Joint probability estimator: P( x ,y) Joint probability estimator: P( x ,y) – compute conditional probability at compute conditional probability at – classification time classification time
Loss Functions Loss Functions Cost matrices and Bayesian decision Cost matrices and Bayesian decision theory theory – Minimize expected loss Minimize expected loss – – Reject option Reject option – ∑ k Log Likelihood: ∑ –I(y=k) log P(y=k| I(y=k) log P(y=k| x ,h) Log Likelihood: k – x ,h) 0/1 loss: need to approximate 0/1 loss: need to approximate – squared error squared error – – mutual information mutual information – – margin slack ( margin slack (“ “hinge loss hinge loss” ”) ) –
Optimization Algorithms Optimization Algorithms None: direct estimation of µ µ , Σ , P(y), P( None: direct estimation of , Σ , P(y), P( x x | y) | y) Gradient Descent: LMS, logistic Gradient Descent: LMS, logistic regression, neural networks, CRFs regression, neural networks, CRFs Greedy Construction: Decision trees Greedy Construction: Decision trees Boosting Boosting None: nearest neighbor None: nearest neighbor
Learning Algorithms Learning Algorithms LMS LMS Logistic Regression Logistic Regression Multivariate Gaussian and LDA Multivariate Gaussian and LDA Naï ïve Bayes (gaussian, discrete, kernel density ve Bayes (gaussian, discrete, kernel density Na estimation) estimation) Decision Trees Decision Trees Neural Networks (squared error and softmax) Neural Networks (squared error and softmax) k- -nearest neighbors nearest neighbors k SVMs (dot product, gaussian, and polynomial SVMs (dot product, gaussian, and polynomial kernels) kernels) HMMs/CRFs/averaged perceptron HMMs/CRFs/averaged perceptron
The Statistical Problem: Overfitting The Statistical Problem: Overfitting Goal: choose h h to optimize to optimize test set test set performance performance Goal: choose Triple tradeoff: sample size, test set accuracy, Triple tradeoff: sample size, test set accuracy, complexity complexity – For fixed sample size, there is an accuracy/complexity tradeoff For fixed sample size, there is an accuracy/complexity tradeoff – Measures of complexity: Measures of complexity: – |H|, VC dimension, log P(h), || |H|, VC dimension, log P(h), || w ||, number of nodes in tree – w ||, number of nodes in tree Bias/Variance analysis Bias/Variance analysis – Bias: systematic error in – Bias: systematic error in h h – Variance: high disagreement between different – Variance: high disagreement between different h h ’ ’s s 2 + variance + noise (square loss, log loss) – test error = Bias – test error = Bias 2 + variance + noise (square loss, log loss) – test error = Bias + unbiased – test error = Bias + unbiased- -variance variance – – biased biased- -variance (0/1 variance (0/1 loss) loss) Most accurate hypothesis on training data is not usually Most accurate hypothesis on training data is not usually most accurate on test data most accurate on test data Most accurate hypothesis on test data may be Most accurate hypothesis on test data may be deliberately wrong (i.e., biased) deliberately wrong (i.e., biased)
Controlling Overfitting Controlling Overfitting Penalty Methods Penalty Methods – Pessimistic pruning of decision trees Pessimistic pruning of decision trees – – Weight decay Weight decay – – Weight elimination Weight elimination – – Maximum Margin Maximum Margin – Holdout Methods Holdout Methods – Early stopping for neural networks Early stopping for neural networks – – Reduce Reduce- -error pruning error pruning – Combined Methods (use CV to set penalty level) Combined Methods (use CV to set penalty level) – Cost Cost- -complexity pruning complexity pruning – – CV to choose pruning confidence, weight decay level, SVM CV to choose pruning confidence, weight decay level, SVM – σ parameters C and σ parameters C and Ensemble Methods Ensemble Methods – Bagging Bagging – – Boosting Boosting –
Off- -The The- -Shelf Criteria Shelf Criteria Off Boosted Boosted Criterion LMS Logistic LDA Trees Nets NNbr SVM NB Criterion LMS Logistic LDA Trees Nets NNbr SVM NB Trees Trees Mixed data no no no yes no no no yes yes Mixed data no no no yes no no no yes yes Missing values Missing values no no no no yes yes yes yes no no some some no no yes yes yes yes Outliers no yes no yes yes yes yes disc yes Outliers no yes no yes yes yes yes disc yes Monotone transforms Monotone transforms no no no no no no yes yes some some no no no no disc disc yes yes Scalability yes yes yes yes yes no no yes yes Scalability yes yes yes yes yes no no yes yes Irrelevant inputs no no no some no no some some yes Irrelevant inputs no no no some no no some some yes Linear combinations Linear combinations yes yes yes yes yes yes no no yes yes some some yes yes yes yes some some Interpretable yes yes yes yes no no some yes no Interpretable yes yes yes yes no no some yes no Accurate Accurate yes yes yes yes yes yes no no yes yes no no yes yes yes yes yes yes
What We’ ’ve Skipped ve Skipped What We Unsupervised Learning Unsupervised Learning – Given examples Given examples X – X i i – Find: P( Find: P( X ) – X ) – Clustering Clustering – – Dimensionality Reduction Dimensionality Reduction –
What We Skipped (2) What We Skipped (2) Reinforcement Learning: Agent interacting Reinforcement Learning: Agent interacting with an environment with an environment – At each time step t At each time step t – Agent perceives current state s Agent perceives current state s of environment of environment Agent choose action to perform according to a Agent choose action to perform according to a π (s) = π policy: : a a = (s) policy Action is executed, environment moves to new Action is executed, environment moves to new state s’ ’ and returns reward r and returns reward r state s π to maximizes long term sum of Goal: Find π – Goal: Find to maximizes long term sum of – rewards rewards
What We Skipped (3): What We Skipped (3): Semi- -Supervised Learning Supervised Learning Semi Learning from a mixture of supervised and Learning from a mixture of supervised and unsupervised data unsupervised data In many applications, unlabeled data is In many applications, unlabeled data is very cheap very cheap – BodyMedia BodyMedia – – Task Tracer Task Tracer – – Natural Language Processing Natural Language Processing – – Computer Vision Computer Vision – How can we use this? How can we use this?
Research Frontier Research Frontier More complex data objects More complex data objects – sequences, images, networks, relational databases sequences, images, networks, relational databases – More complex runtime tasks More complex runtime tasks – planning, scheduling, diagnosis, configuration planning, scheduling, diagnosis, configuration – Learning in changing environments Learning in changing environments Learning online Learning online Combining supervised and unsupervised Combining supervised and unsupervised learning learning Multi- -agent reinforcement learning agent reinforcement learning Multi Cost- -sensitive learning; imbalanced classes sensitive learning; imbalanced classes Cost Learning with prior knowledge Learning with prior knowledge
Recommend
More recommend