Machine learning with imbalanced data

Find out what you will learn throughout the course (if the video does not show, please allow cookies in your browser).

What you'll learn

 Random under- and over-sampling.

 Cleaning under-sampling methods.

► Create synthetic data using SMOTE.

 Cost-sensitive learning

 Ensemble methods for imbalanced data.

► Performance evaluation metrics for imbalanced data.

 Apply methods Python open source libraries.

 More than 5k students enrolled.

 More than 450 student reviews.

 Average course rating: 4.7 out of 5.

What you'll get

11+ hs. of video lectures

Presentations, quizzes and assignments.

Jupyter notebooks with code.

► Instructor support through Q&A.

Access in PC and mobile.

Lifetime access to content.

30 days money back guarantee

So you can buy with confidence.

Course description

Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques to improve the performance of machine learning models trained with imbalanced datasets.

What are imbalanced datasets?

Imbalanced datasets are those typically used in classification problems where one of the target classes is extremely under-represented. When this happens, we talk about a class imbalance. The class with a small number of samples is called the minority class, and the class or classes with plenty of data are called the majority class or classes.

Imbalanced datasets are a common occurrence in data science. Examples of imbalanced datasets are those used for fraud detection or medical diagnosis.

Why is class imbalance a problem?

Most machine learning algorithms assume balanced class distributions. Thus, training classifiers on imbalanced data will naturally bias the model towards the majority class.

In addition, because the number of samples for the minority class is small, rules to accurately predict these classes are hard to find. Thus, observations belonging to the minority class most often end up being misclassified by the classification models.

Fortunately, there are various ways in which we can improve the performance of classifiers trained on data with imbalanced classes, including resampling, cost-sensitive learning, and ensemble methods.

What will you learn in this online course?

In this course, you will learn multiple methods to improve the performance of machine learning models trained on imbalanced data and decrease the misclassification of the minority class or classes.

The course is divided into the following sections:

  • Evaluation metrics
  • Resampling methods
  • Cost-sensitive learning
  • Ensemble algorithms

Evaluation metrics

You will learn suitable metrics to assess imbalanced classification models trained with imbalanced datasets. You will learn about the roc-curve and the roc-auc. You will create a confusion matrix, find true positives, true negatives, false positives, and false negatives, and then use them to calculate other metrics like precision, recall, and the f1-score. You will also learn about specific performance metrics to assess imbalanced classification models, like the imbalanced accuracy, among others.

Some of these metrics are geared toward binary classification problems. Other metrics can handle multi-class targets out-of-the-box. You will learn when you can use each metric and why in your classification tasks.

Resampling techniques

Next, you will learn about resampling methods, including under-sampling and over-sampling.

Among the under-sampling methods, you will learn random under-sampling and cleaning methods based on k-nearest neighbors, like tomek links and nearmiss.

Among the over-sampling techniques, you will learn random over-sampling and methods that create new data points, like the synthetic minority over-sampling technique (SMOTE) and its variations. SMOTE creates synthetic data, that is, new data, and therefore avoids the mere duplication of samples introduced by random over-sampling.

Resampling methods are usually classified as data preprocessing methods because they change the distribution of the training dataset. In particular, the aim of resampling techniques is to create balanced datasets with a similar distribution across the different classes.

You will learn how to correctly set up the resampling strategy, modifying the training dataset and leaving a test set untouched with the original class distribution, to correctly perform the model validation in a similar setting to how it will be used in the real world.

Cost-sensitive learning

Next, you will learn how to introduce class weights to perform cost sensitive learning. Cost sensitive learning uses the original dataset to train the models, without changing the class distribution. It aims to compensate for the misclassification of the minority class by penalizing harder the mistakes the classifier makes when classifying these observations.

Ensemble methods

Finally, we will carry out specific bagging and boosting algorithms designed to handle imbalanced data.

By the end of the course, you will be able to decide which technique is suitable for your dataset, and/or apply and compare the boost in performance returned by the different methods on multiple datasets.

Feature engineering with Python

Throughout the tutorials, we will use Python as the main language. We will implement the resampling methods with the open-source library imbalanced learn (imblearn) and the cost-sensitive techniques with Scikit-learn (sklearn).

Who is this course for?

If you are working with imbalanced datasets right now and want to boost the performance of your classifiers, or you simply want to learn more about how to handle imbalanced data, this course will show you how.

Course prerequisites

To get the most out of this course, you need to have basic knowledge of machine learning and familiarity with the most common predictive models, like linear and logistic regression, decision trees, and random forests. You also need to be familiar with the Python open-source libraries Pandas, Numpy, and Scikit-learn.

To wrap-up

This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.

Soledad Galli, PhD


Sole is a lead data scientist, instructor and developer of open source software. She created and maintains the Python library for feature engineering Feature-engine, which allows us to impute data, encode categorical variables, transform, create and select features. Sole is also the author of the book "Python Feature engineering Cookbook" by Packt editorial.

Course Curriculum

Available in days
days after you enroll
  Machine Learning with Imbalanced Data: Overview
Available in days
days after you enroll
  Evaluation Metrics
Available in days
days after you enroll
Available in days
days after you enroll
Available in days
days after you enroll
  Over and Undersampling
Available in days
days after you enroll
  Ensemble Methods
Available in days
days after you enroll
  Cost Sensitive Learning
Available in days
days after you enroll
  Probability Calibration
Available in days
days after you enroll
  Putting it all together
Available in days
days after you enroll
  Next steps
Available in days
days after you enroll

Frequently Asked Questions

When does the course begin and end?

You can start taking the course from the moment you enroll. The course is self-paced, so you can watch the tutorials and apply what you learn whenever you find it most convenient.

For how long can I access the course?

The courses have lifetime access. This means that once you enroll, you will have unlimited access to the course for as long as you like.

What if I don't like the course?

There is a 30-day money back guarantee. If you don't find the course useful, contact us within the first 30 days of purchase and you will get a full refund.