Introduction Generalization Overview of the main methods Resources
Pattern Recognition
Bertrand Thirion and John Ashburner
Bertrand Thirion and John Ashburner Pattern Recognition
Pattern Recognition Bertrand Thirion and John Ashburner Bertrand - - PowerPoint PPT Presentation
Introduction Generalization Overview of the main methods Resources Pattern Recognition Bertrand Thirion and John Ashburner Bertrand Thirion and John Ashburner Pattern Recognition Introduction Definitions Generalization Classification and
Introduction Generalization Overview of the main methods Resources
Bertrand Thirion and John Ashburner
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
supervised learning: The data comes with additional attributes that we want to predict = ⇒ classification and regression. unsupervised learning: No target values. Discover groups of similar examples within the data (clustering). Determine the distribution of data within the input space (density estimation). Project the data down to two or three dimensions for visualization.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
We have a training dataset of n observations, each consisting of an input xi and a target yi. Each input, xi, consists of a vector of p features. D = {(xi, yi)|i = 1, .., n} The aim is to predict the target for a new input x∗.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Targets (y) are categorical labels. Train with D and use result to make best guess
Classification Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Targets (y) are categorical labels. Train with D and compute P(y∗ = k|x∗, D).
Probabilistic classification Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Targets (y) are continuous real variables. Train with D and compute p(y∗|x∗, D).
10 20 30 40 50 60 70
31 14 23 31 63 55 14 58 35 27 Feature 1 Feature 2 Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Multi-class classification when there are more than two possible categories. Ordinal regression for classification when there is some
Chu, Wei, and Zoubin Ghahramani. “Gaussian processes for ordinal regression.” In Journal of Machine Learning Research, pp. 1019-1041. 2005.
Multi-task learning when there are multiple targets to predict, which may be related. etc
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Multinomial Logistic regression Theoretically optimal. Expensive optimization. One-versus-all classification [SVMs] Among several hyperplane, choose the one with maximal margin. = ⇒ recommended One-versus-one classification Vote across each pair of class. Expensive, not optimal.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
−3 −2 −1 1 2 −2 −1 1 2 Feature 1 Feature 2
Not nice smooth separations. Lots of sharp corners. May be improved with K-nearest neighbours.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality
2 4 6 8 10 12 14 16 18 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Circle area = π r2 Sphere volume = 4/3 π r3 Number of dimensions Volume of hyper−sphere (r=1/2)
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
“Everything should be kept as simple as possible, but no simpler.”
— Einstein (allegedly)
Complex models (with many estimated parameters) usually explain training data better than simpler models. Simpler models often generalise better to new data than nore complex models. Need to find the model with the optimal bias/variance tradeoff.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
Real Bayesians don’t cross-validate (except when they need to). P(M|D) = p(D|M)P(M) p(D) The Bayes factor allows the plausibility of two models (M1 and M2) to be compared: K = p(D|M1) p(D|M2) =
This is usually too costly in practice, so approximations are used.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
Some approximations/alternatives to the Bayesian approach: Laplace approximations: find the MAP/ML solution and use a Gaussian approximation to the parameter uncertainty. Minimum Message Length (MML): an information theoretic approach. Minimum Description Length (MDL): an information theoretic approach based on how well the model compresses the data. Akaike Information Criterion (AIC): −2 log p(D|θ) + 2k, where k is the number of estimated parameters. Bayesian Information Criterion (BIC): −2 log p(D|θ) + k log q, where q is the number of
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
Inner cross-validation loop used to evaluate model’s performance
Safe, but costly. Supported by some libraries (e.g. scikit-learn). Some estimators have path model, hence allow faster evaluation (e.g. LASSO). Randomized techniques also exist, sometimes more efficient. Caveat: Inner cross-validation loop = outer cross-validation loop for parameter evaluation.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
Root-mean squared error for point predictions. Correlation coefficient for point predictions. Log predictive probability can be used for probabilistic predictions. Expected loss/risk for point predictions for decision making.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
Wikipedia contributors, “Sensitivity and specificity,” Wikipedia, The Free Encyclopedia, http: //en.wikipedia.org/w/index. php?title=Sensitivity_and_ specificity&oldid=655245669 (accessed April 9, 2015). Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
The Receiver operating characteristic (ROC) curve is a plot of true-positive rate (sensitivity) versus false-positive rate (1-specificity) over the full range of possible thresholds. The area under the curve (AUC) is the integral under the ROC curve.
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
ROC Curve (AUC=0.9769) 1−Specificity Sensitivity Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures
Some data are more easily classified than others. Probabilistic classifiers provide a level of confidence for each prediction. p(y∗|x∗, y, X, θ) Quality of predictions can be assessed using the test log predictive probability:
1 m m
log2 p(y∗i =ti|x∗i, y, X, θ) After subtracting the baseline measure, this shows the average bits
Rasmussen & Williams. “Gaussian Processes for Machine Learning”, MIT Press (2006). http://www.gaussianprocess.org/gpml/ Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Only one rule: No tool wins in all situations.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
P(y =k|x) = P(y =k)p(x|y =k)
Ground truth Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
P(y =k|x) = P(y =k)p(x|y =k)
Assumes: P(x|y =k) = N(x|µk, Σ)
p(x,y=0) = p(x|y=0) p(y=0) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x,y=1) = p(x|y=1) p(y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x) = p(x,y=0) + p(x,y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(y=0|x) = p(x,y=0)/p(x) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Model has 2p + p(p − 1) parameters to estimate (two means and a single covariance). Number of observations is pn (size of inputs).
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
P(y =k|x) = P(y =k)p(x|y =k)
Assumes different covariances: P(x|y =k) = N(x|µk, Σk)
p(x,y=0) = p(x|y=0) p(y=0) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x,y=1) = p(x|y=1) p(y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x) = p(x,y=0) + p(x,y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(y=0|x) = p(x,y=0)/p(x) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Model has 2p + 2p(p − 1) parameters to estimate (two means and two covariances). Number of observations is pn.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
P(y =k|x) = P(y =k)p(x|y =k)
Assumes that features are independent: p(x|y =k) =
p(xi|y =k)
p(x,y=0) = p(x|y=0) p(y=0) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x,y=1) = p(x|y=1) p(y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x) = p(x,y=0) + p(x,y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(y=0|x) = p(x,y=0)/p(x) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Model has variable number of parameters to estimate, but the above example has 3p. Number of observations is pn.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
A simple way to do regression is by: f (x∗) = wTx∗ Assuming Gaussian noise on y, the ML estimate of w is by: ˆ w = (XTX)−1XTy where X =
x2 . . . xn T , and y =
y2 . . . yn T Model has p parameters to estimate. Number of observations is n (number of targets). Usually needs dimensionality reduction, with (eg) SVD.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
We may have prior knowledge about various distributions: p(y∗|x∗, w) =N(wTx∗, σ2) p(w) =N(0, Σ0) Therefore, p(w|y, X) =N(σ−2B−1XTy, B−1), where B = σ−2XTX + Σ−1 Maximum a posteriori (MAP) estimate of w is by: ˆ w = σ−2B−1XTy, where B = σ−2XTX + Σ−1
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
We may have prior knowledge about various distributions: p(y∗|x∗, w) =N(wTx∗, σ2) p(w) =N(0, Σ0) Therefore, p(w|y, X) =N(σ−2B−1XTy, B−1), where B = σ−2XTX + Σ−1 Predictions are made by integrating out the uncertainty of the weights, rather than estimating them: p(y∗|x∗, y, X) =
p(y∗|x∗, w)p(w|y, X)dw =N(σ−2xT
∗ B−1XTy, xT ∗ B−1x∗)
Estimated parameters may be σ2, and parameters encoding Σ0.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
B−1 =
−1 invert a p × p matrix =Σ0 − Σ0XT(Iσ2 + XΣ0XT)−1XΣ0 invert an n × n matrix
Wikipedia contributors, “Woodbury matrix identity,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Woodbury_matrix_identity&oldid=638370219 (accessed April 1, 2015). (A + UCV)−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1. Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
The predicted distribution is: p(y∗|x∗, y, X) =N(kTC−1y, c − kTC−1k) where: C =XΣ0XT + Iσ2 k =XΣ0x∗ c =xT
∗ Σ0x∗ + σ2
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Sometimes, we want alternatives to C = XΣ0XT + Iσ2. Nonlinearity is achieved by replacing the matrix K = XΣ0XT with some function of the data that gives a positive definite matrix encoding similarities. eg k(xi, xj) = θ1 + θ2xi · xj + θ3 exp
2θ2
4
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Non-linear methods are useful in low-dimension to adapt the shape of decision boundaries. For large p, small n problems, nonlinear methods do not seem to help much. Nonlinearity also reduces interpretability.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Regression Continuous targets: y ∈ R Usually assume a Gaussian distribution: p(y|x, w) = N(f (x, w), σ2) where σ2 is a variance. Binary Classification Categorical targets: y ∈ {0, 1} Usually assume a binomial distribution: p(y|x, w) = σ(f (x, w))y(1 − σ(f (x, w)))1−y where σ is a squashing function.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
For binary classification: p(y∗ = 1|x∗, w) = σ(f (x∗, w)) where σ is some squashing function, eg: Logistic sigmoid function (inverse of Logit). Normal CDF (inverse of Probit).
−5 5 0.5 1 f* σ(f*) Logistic function −2 2 0.5 1 f* σ(f*) Inverse Probit function (Normal CDF)
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Integrating over the uncertainty
allows probabilistic predictions further from the training data. This is not usually done for methods such as the relevance-vector machine (RVM).
Rasmussen, Carl Edward, and Joaquin Quinonero-Candela. “Healing the relevance vector machine through augmentation.” In Proceedings of the 22nd international conference on Machine learning, pp. 689-696. ACM, 2005.
Simple Logistic Regression Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 2 4 −7 −6 −5 −4 −3 −2 −1 Hyperplane Uncertainty Feature 1 Feature 2 Bayesian Logistic Regression Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 Bayesian Logistic Regression Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Making probabilistic predictions involves:
1 Computing the distribution of a latent variable corresponding
to the test data (cf regression): p(f∗|x∗, y, X) =
p(f∗|x∗, f)p(f|y, X)df
2 Using this distribution to give a probabilistic prediction:
P(y∗ = 1|x∗, y, X) =
σ(f∗)p(f∗|x∗, y, X)df∗ Unfortunately, these integrals are analytically intractable, so approximations are needed.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Approximate methods for probabilistic classification include: The Laplace Approximation (LA). Fastest, but less accurate. Expectation Propagation (EP). More accurate than the Laplace approximation, but slightly slower. MCMC methods. The “gold standard”, but very slow to draw lots of random samples.
Nickisch, Hannes, and Carl Edward Rasmussen. “Approximations for Binary Gaussian Process Classification.” Journal of Machine Learning Research 9 (2008): 2035-2078. Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
t = σ(f (x∗)) where σ is some squashing function, eg: Logistic function (inverse of Logit). Normal CDF (inverse of Probit). Hinge loss (support vector machines)
−5 5 0.5 1 f* σ(f*) Logistic function −2 2 0.5 1 f* σ(f*) Inverse Probit function (Normal CDF)
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
In practice, the hinge and logistic losses yield a convex estimation problem and are preferred. minw
n
L(yi, Xi, w) + λR(w) (M-estimators framework) L is the loss function (hinge, logistic, quadratic...) R is the regularizer (typically a norm on w) λ > 0 balances the two terms L and R convex → unique minimizer (SVMs, ℓ2-logistic, ℓ1-logistic).
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
SVMs are reasonably fast, accurate and easy to tune (C = 103 is a good default, no dramatic failure). Multi-class: One-versus-one, One-versus all.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Combining predictions from weak learners. Bootstrap aggregating (bagging)
Train several weak classifiers, with different models or randomly drawn subsets of the data. Average their predictions with equal weight.
Boosting
A family of approaches, where models are weighted according to their accuracy. AdaBoost is popular, but has problems with target noise.
Bayesian model averaging
Really a model selection method. Relatively ineffective for combining models.
Bayesian model combination
Shows promise.
Monteith, et al. “Turning Bayesian model averaging into Bayesian model combination.” Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011. Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Reduce sequentially the bias of the combined estimator. Examples: AdaBoost, Gradient Tree Boosting, ...
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging
Build several estimators independently and average their
Examples: Bagging methods, Forests of randomized trees, ...
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources
The Elements of Statistical Learning: Data Mining, Inference, and Prediction Trevor Hastie, Robert Tibshirani, Jerome Fried(2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/ An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (2013) http://www-bcf.usc.edu/%7Egareth/ISL/ Introduction to Machine Learning Amnon Shashua (2008) http://arxiv.org/pdf/0904.3664.pdf
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources
Bayesian Reasoning and Machine Learning David Barber (2014) http://www.cs.ucl.ac.uk/staff/d.barber/brml/ Gaussian Processes for Machine Learning Carl Edward Rasmussen and Christopher K. I. Williams (2006) http://www.gaussianprocess.org/gpml/chapters/ Information Theory, Inference, and Learning Algorithms David J.C. MacKay (2003) http: //www.inference.phy.cam.ac.uk/itila/book.html
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources
Kernel Machines http://www.kernel-machines.org/ The Gaussian Processes Web Site includes links to
SVM - Support Vector Machines includes links to software. http://www.support-vector-machines.org/ Pascal Video Lectures http://videolectures.net/pascal
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources
Spider Object orientated environment for machine learning in MATLAB. GPML Gaussian processes for supervised learning. Pronto MATLAB ML tbx for neuroimaging. GUI. Implements many ML concepts. Continuity with SPM.
Bertrand Thirion and John Ashburner Pattern Recognition
Introduction Generalization Overview of the main methods Resources
Scikit-learn Generic ML in Python. Complete, high-quality, well-documented, reference implementations. Nilearn Python interface to Scikit-learn for Neuroimaging. Easy-to-use/install. Good viz. Pymvpa Python tool for ML. Advanced stuff (Pipelines, Hyperalignment).
Bertrand Thirion and John Ashburner Pattern Recognition