Imbalanced Data

Myths, Mistakes and Modern Solutions

Imbalanced Data

Myths, Mistakes and Modern Solutions

eBook

Class imbalance isn’t a problem—poor methodology is.

SMOTE, once a go-to solution, is frequently misapplied, introducing bias rather than solving it.

This book challenges outdated practices and provides rigorous, data-driven alternatives. We focus on selecting the right tools—threshold tuning, real costs (not class frequencies), and strategic evaluation metrics—to build models that work.

Ready to learn from mistakes, move beyond myths and master modern solutions? Let’s begin.

Book Description

Class imbalance isn't a problem.

Contrary to popular belief, imbalanced data does not inherently harm model performance—poor methodology does. SMOTE, once a go-to solution, is frequently misapplied, introducing bias rather than solving it.

Similarly, the default 0.5 probability threshold persists despite its misalignment with real-world costs and decision requirements.

This book isn’t about "fixing imbalance". It’s about choosing the right strategy—or none at all—based on data, domain knowledge, and suitable performance metrics.

We focus not on balancing datasets for the sake of it, but on selecting the right tools—threshold tuning, real costs, and strategic evaluation metrics—to build models that work in practice, not just in theory.

If you’re done with oversimplified rules and ready to engineer solutions that actually work, let’s begin.

Table of Content

Introduction to Imbalanced Data
1.1 Imbalanced Datasets: What Are They?
When Should We Consider a Dataset Imbalanced?
Why Are Imbalanced Datasets Different?
1.2 What Factors Influence the Classification of Imbalanced Datasets?
Problem 1: Using Misleading Evaluation Metrics
Problem 2: Not Enough Minority-Class Examples
Problem 3: Poor Class Separability
Problem 4: Choosing the Wrong Model
1.3 The Downside of Resampling
1.4 Prediction is Not Classification
1.5 How to Approach Imbalanced Learning
1.6 Myths, Mistakes and Modern Solutions
1.7 References
2 Metrics that Matter (and Pitfalls to Avoid)
2.1 Understanding the Output of Machine Learning Models
2.2 Understanding What Metrics Measure
2.3 Classification is Not Prediction (Again)
2.4 The Damage of Using Classification Metrics
2.5 Choosing the Right Metric
2.6 Classification Metrics
Confusion Matrix
Accuracy and Balanced Accuracy
Precision, Recall and F1-score
Matthews Correlation Coefficient (MCC)
False Positive and False Negative Rates
Choosing the Right Threshold
Model and Threshold Optimisation in Python
2.7 Threshold Independent Metrics for Ranking
ROC and ROC-AUC
The ROC Curve Myth for Imbalanced Datasets
2Precision-Recall Curve and PR-AUC
The PR Curve Myth for Imbalanced Datasets
Model Selection With Ranking Metrics in Python
2.8 Myths, Mistakes and Modern Solutions
2.9 References
3 Probability Calibration: When 70% Means 70%
3.1 Calibrated Probabilities: What are They?
3.2 Assessing Probability Calibration: Reliability Diagrams
3.3 What Makes Calibration Assessment Hard
Data Size
Class Separability
Class Imbalance
3.4 What Breaks Probability Calibration
Model Choice Affects Calibration
Cost-Sensitive Learning Has a Cost on Calibration
Resampling Breaks Calibration
3.5 Scoring Functions: Training Models to Be Calibrated
The Brier Score
Negative Log-Likelihood (Log Loss)
Brier Score or Log Loss?
3.6 Calibration: Correcting Biased probabilities
Platt Scaling: Sigmoid-Based Recalibration
Isotonic Regression: Flexible Monotonic Recalibration
Plat Scaling vs Isotonic Regression
3.7 Recalibrating Models in Python
3-way Data Split
Recalibration with Cross-validation
3.8 Myths, Mistakes and Modern Solutions
3.9 References
4 Oversampling and SMOTE: A False Promise
Coming soon
5 Undersampling and Cleaning Methods: The Illusion of Better Models
Coming soon
6 Cost Sensitive Learning: Costs Are Business, Not Statistics
Coming soon

👉 epub and pdf copies

👉 Paperbak with our partners

👉 200 pages

👉 English

Author

Soledad Galli, PhD

Sole is a lead data scientist, instructor, and developer of open source software. She created and maintains the Python library Feature-engine, which allows us to impute data, encode categorical variables, transform, create, and select features. Sole is also the author of the"Python Feature Engineering Cookbook," published by Packt.

What our readers say

eBook Pricing

$27.99

PRE-ORDER BOOK

Lock in your copy today and get access the moment it's released.

Pre-order the book today and get full access as soon as it's released. We're targeting end of July for the launch.

Imbalanced Data

Myths, Mistakes and Modern Solutions

Imbalanced Data

Myths, Mistakes and Modern Solutions

eBook

Book Description

Table of Content

👉 epub and pdf copies

👉 Paperbak with our partners

👉 200 pages

👉 English

Author

Soledad Galli, PhD

What our readers say

eBook Pricing

PRE-ORDER BOOK

Lock in your copy today and get access the moment it's released.

Can't afford it? Get in touch.

Paperback

Get a Paperback copy with our Partner Lulu Press.

This site uses cookies

Imbalanced Data

Myths, Mistakes and Modern Solutions

Imbalanced Data

Myths, Mistakes and Modern Solutions

eBook

Book Description

Table of Content

👉 epub and pdf copies

👉 Paperbak with our partners

👉 200 pages

👉 English

Author

Soledad Galli, PhD

What our readers say

eBook Pricing

PRE-ORDER BOOK

Lock in your copy today and get access the moment it's released.

Can't afford it? Get in touch.

Paperback

Get a Paperback copy with our Partner Lulu Press.