Data Science in the Wild, Spring 2019
Learning and Imbalanced Data
January 28, 2019 David Rimshnick
Learning and Imbalanced Data January 28, 2019 David Rimshnick - - PowerPoint PPT Presentation
Learning and Imbalanced Data January 28, 2019 David Rimshnick Data Science in the Wild, Spring 2019 What is data imbalance? Unequal distribution of data towards a certain characteristic Target variable Classification: Certain
Data Science in the Wild, Spring 2019
January 28, 2019 David Rimshnick
Data Science in the Wild, Spring 2019
1/28/2019 2 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
1/28/2019 3 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
negative!
instead of learning underlying pattern
1/28/2019 4 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
until dataset is balanced
until dataset is balanced
1/28/2019 5 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
dataset is balanced
1/28/2019 6 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
dataset is balanced
learning scalable pattern
1/28/2019 7 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
1/28/2019 8 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
1/28/2019 9 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
different than the majority class
smooth
1/28/2019 10 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
artificial or synthetic data
sampling Technique)
the features of the original instance and the nearest neighbor
1/28/2019 11 Learning and Imbalanced Data
Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.
Data Science in the Wild, Spring 2019
1/28/2019 12 Learning and Imbalanced Data
Image Source: Beckmann, Marcelo, Nelson FF Ebecken, and Beatriz SL Pires de Lima. "A KNN undersampling approach for data balancing." Journal of Intelligent Learning Systems and Applications 7.04 (2015): 104.
Data Science in the Wild, Spring 2019
1 𝑞𝑗∗𝐷 where 𝑞𝑗 is the sample
proportion of class 𝑗, and 𝐷 is the number of classes
are mishandled by model
1/28/2019 13 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
1 𝑞𝑗∗𝐷 where 𝑞𝑗 is the sample
proportion of class 𝑗, and 𝐷 is the number of classes
are mishandled by model
1/28/2019 14 Learning and Imbalanced Data
Data Science in the Wild, Spring 2019
Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009
1/28/2019 Learning and Imbalanced Data 15