Watch the intro video
Note: if you can't see the video, you might need to allow cookies or disable the add blocker.
Soledad Galli, PhD
Instructor
Sole is a lead data scientist, instructor and developer of open source software. She created and maintains the Python library for feature engineering Feature-engine, which allows us to impute data, encode categorical variables, transform, create and select features. Sole is also the author of the book "Python Feature engineering Cookbook" by Packt editorial.
Course Description
Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques which you can use with imbalanced datasets to improve the performance of your machine learning models.
If you are working with imbalanced datasets right now and want to improve the performance of your models, or you simply want to learn more about how to tackle data imbalance, this course will show you how.
We'll take you step-by-step through engaging video tutorials and teach you everything you need to know about working with imbalanced datasets. Throughout this comprehensive course, we cover almost every available methodology to work with imbalanced datasets, discussing their logic, their implementation in Python, their advantages and shortcomings, and the considerations to have when using the technique. Specifically, you will learn:
- Under-sampling methods at random or focused on highlighting certain sample populations
- Over-sampling methods at random and those which create new examples based of existing observations
- Ensemble methods that leverage the power of multiple weak learners in conjunction with sampling techniques to boost model performance
- Cost sensitive methods which penalize wrong decisions more severely for minority classes
- The appropriate metrics to evaluate model performance on imbalanced datasets
By the end of the course, you will be able to decide which technique is suitable for your dataset, and / or apply and compare the improvement in performance returned by the different methods on multiple datasets.
This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.
In addition, the code is updated regularly to keep up with new trends and new Python library releases.
Example Curriculum
- Introduction to Performance Metrics (3:22)
- Accuracy (4:21)
- Accuracy - Demo (5:39)
- Precision, Recall and F-measure (13:32)
- Install Yellowbrick
- Precision, Recall and F-measure - Demo (10:04)
- Confusion tables, FPR and FNR (6:03)
- Confusion tables, FPR and FNR - Demo (7:32)
- Balanced Accuracy (3:49)
- Balanced accuracy - Demo (2:43)
- Geometric Mean, Dominance, Index of Imbalanced Accuracy (4:29)
- Geometric Mean, Dominance, Index of Imbalanced Accuracy - Demo (9:28)
- ROC-AUC (7:26)
- ROC-AUC - Demo (4:46)
- Precision-Recall Curve (7:08)
- Precision-Recall Curve - Demo (2:47)
- Comparison of ROC and PR curves - Optional
- Additional reading resources (Optional)
- Probability (4:32)
- Metrics for Mutliclass (11:04)
- Metrics for Multiclass - Demo (8:55)
- PR and ROC Curves for Multiclass (5:16)
- PR Curves in Multiclass - Demo (8:40)
- ROC Curve in Multiclass - Demo (7:13)
- Under-Sampling Methods - Introduction (5:21)
- Random Under-Sampling - Intro (4:23)
- Random Under-Sampling - Demo (10:11)
- Condensed Nearest Neighbours - Intro (8:03)
- Condensed Nearest Neighbours - Demo (7:25)
- Tomek Links - Intro (4:43)
- Tomek Links - Demo (3:05)
- One Sided Selection - Intro (4:38)
- One Sided Selection - Demo (3:00)
- Edited Nearest Neighbours - Intro (5:01)
- Edited Nearest Neighbours - Demo (4:02)
- Repeated Edited Nearest Neighbours - Intro (4:39)
- Repeated Edited Nearest Neighbours - Demo (3:00)
- All KNN - Intro (6:16)
- All KNN - Demo (5:50)
- Neighbourhood Cleaning Rule - Intro (6:14)
- Neighbourhood Cleaning Rule - Demo (1:55)
- NearMiss - Intro (3:47)
- NearMiss - Demo (3:53)
- Instance Hardness Threshold - Intro (9:20)
- Instance Hardness Threshold - Demo (16:21)
- Instance Hardness Threshold Multiclass Demo (7:44)
- Undersampling Method Comparison (7:44)
- Wrapping up the section (5:18)
- Setting up a classifier with under-sampling and cross-validation (10:54)
- Summary Table
- Over-Sampling Methods - Introduction (3:41)
- Random Over-Sampling (5:00)
- Random Over-Sampling - Demo (4:55)
- ROS with smoothing - Intro (6:39)
- ROS with smoothing - Demo (4:36)
- SMOTE (9:26)
- SMOTE - Demo (2:35)
- SMOTE-NC (9:02)
- SMOTE-NC - Demo (2:56)
- SMOTE-N (19:25)
- SMOTE-N Demo (7:20)
- ADASYN (7:11)
- ADASYN - Demo (3:17)
- Borderline SMOTE (7:47)
- Borderline SMOTE - Demo (3:13)
- SVM SMOTE (16:40)
- Resources on SVMs
- SVM SMOTE - Demo (4:32)
- K-Means SMOTE (13:01)
- K-Means SMOTE - Demo (3:29)
- Over-Sampling Method Comparison (5:50)
- Wrapping up the section (9:30)
- How to Correctly Set Up a Classifier with Over-sampling (5:24)
- Setting Up a Classifier - Demo (4:13)
- Summary Table
- Cost-sensitive Learning - Intro (7:27)
- Types of Cost (10:55)
- Obtaining the Cost (4:28)
- Cost Sensitive Approaches (1:52)
- Misclassification Cost in Logistic Regression (3:35)
- Misclassification Cost in Decision Trees (4:02)
- Cost Sensitive Learning with Scikit-learn (7:13)
- Find Optimal Cost with hyperparameter tuning (3:33)
- Bayes Conditional Risk (13:44)
- MetaCost (8:03)
- MetaCost - Demo (3:40)
- Optional: MetaCost Base Code (6:39)
- Additional Reading Resources
- Probability Calibration (6:41)
- Probability Calibration Curves (5:56)
- Probability Calibration Curves - Demo (9:37)
- Brier Score (3:06)
- Brier Score - Demo (7:07)
- Under- and Over-sampling and Cost-sensitive learning on Probability Calibration (5:10)
- Calibrating a Classifier (5:25)
- Calibrating a Classifier - Demo (6:20)
- Calibrating a Classfiier after SMOTE or Under-sampling (8:05)
- Calibrating a Classifier with Cost-sensitive Learning (3:31)
- Probability: Additional reading resources