Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques to improve the performance of machine learning models trained with imbalanced datasets.
What are imbalanced datasets?
Imbalanced datasets are those typically used in classification problems where one of the target classes is extremely under-represented. When this happens, we talk about a class imbalance. The class with a small number of samples is called the minority class, and the class or classes with plenty of data are called the majority class or classes.
Imbalanced datasets are a common occurrence in data science. Examples of imbalanced datasets are those used for fraud detection or medical diagnosis.
Why is class imbalance a problem?
Most machine learning algorithms assume balanced class distributions. Thus, training classifiers on imbalanced data will naturally bias the model towards the majority class.
In addition, because the number of samples for the minority class is small, rules to accurately predict these classes are hard to find. Thus, observations belonging to the minority class most often end up being misclassified by the classification models.
Fortunately, there are various ways in which we can improve the performance of classifiers trained on data with imbalanced classes, including resampling, cost-sensitive learning, and ensemble methods.
What will you learn in this online course?
In this course, you will learn multiple methods to improve the performance of machine learning models trained on imbalanced data and decrease the misclassification of the minority class or classes.
The course is divided into the following sections:
- Evaluation metrics
- Resampling methods
- Cost-sensitive learning
- Ensemble algorithms
You will learn suitable metrics to assess imbalanced classification models trained with imbalanced datasets. You will learn about the roc-curve and the roc-auc. You will create a confusion matrix, find true positives, true negatives, false positives, and false negatives, and then use them to calculate other metrics like precision, recall, and the f1-score. You will also learn about specific performance metrics to assess imbalanced classification models, like the imbalanced accuracy, among others.
Some of these metrics are geared toward binary classification problems. Other metrics can handle multi-class targets out-of-the-box. You will learn when you can use each metric and why in your classification tasks.
Next, you will learn about resampling methods, including under-sampling and over-sampling.
Among the under-sampling methods, you will learn random under-sampling and cleaning methods based on k-nearest neighbors, like tomek links and nearmiss.
Among the over-sampling techniques, you will learn random over-sampling and methods that create new data points, like the synthetic minority over-sampling technique (SMOTE) and its variations. SMOTE creates synthetic data, that is, new data, and therefore avoids the mere duplication of samples introduced by random over-sampling.
Resampling methods are usually classified as data preprocessing methods because they change the distribution of the training dataset. In particular, the aim of resampling techniques is to create balanced datasets with a similar distribution across the different classes.
You will learn how to correctly set up the resampling strategy, modifying the training dataset and leaving a test set untouched with the original class distribution, to correctly perform the model validation in a similar setting to how it will be used in the real world.
Next, you will learn how to introduce class weights to perform cost sensitive learning. Cost sensitive learning uses the original dataset to train the models, without changing the class distribution. It aims to compensate for the misclassification of the minority class by penalizing harder the mistakes the classifier makes when classifying these observations.
Finally, we will carry out specific bagging and boosting algorithms designed to handle imbalanced data.
By the end of the course, you will be able to decide which technique is suitable for your dataset, and/or apply and compare the boost in performance returned by the different methods on multiple datasets.
Feature engineering with Python
Throughout the tutorials, we will use Python as the main language. We will implement the resampling methods with the open-source library imbalanced learn (imblearn) and the cost-sensitive techniques with Scikit-learn (sklearn).
Who is this course for?
If you are working with imbalanced datasets right now and want to boost the performance of your classifiers, or you simply want to learn more about how to handle imbalanced data, this course will show you how.
To get the most out of this course, you need to have basic knowledge of machine learning and familiarity with the most common predictive models, like linear and logistic regression, decision trees, and random forests. You also need to be familiar with the Python open-source libraries Pandas, Numpy, and Scikit-learn.
This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.