Imbalanced Data
Myths, Mistakes and Modern Solutions
Imbalanced Data
Myths, Mistakes and Modern Solutions
eBook
Class imbalance isn’t a problem—poor methodology is.
SMOTE, once a go-to solution, is frequently misapplied, introducing bias rather than solving it.
This book challenges outdated practices and provides rigorous, data-driven alternatives. We focus on selecting the right tools—threshold tuning, real costs (not class frequencies), and strategic evaluation metrics—to build models that work.
Ready to learn from mistakes, move beyond myths and master modern solutions? Let’s begin.
Book Description
Class imbalance isn't a problem.
Contrary to popular belief, imbalanced data does not inherently harm model performance—poor methodology does. SMOTE, once a go-to solution, is frequently misapplied, introducing bias rather than solving it.
Similarly, the default 0.5 probability threshold persists despite its misalignment with real-world costs and decision requirements.
This book isn’t about "fixing imbalance". It’s about choosing the right strategy—or none at all—based on data, domain knowledge, and suitable performance metrics.
We focus not on balancing datasets for the sake of it, but on selecting the right tools—threshold tuning, real costs, and strategic evaluation metrics—to build models that work in practice, not just in theory.
If you’re done with oversimplified rules and ready to engineer solutions that actually work, let’s begin.
Table of Content
- Introduction to Imbalanced Data
- 1.1 Imbalanced Datasets: What Are They?
- When Should We Consider a Dataset Imbalanced?
- Why Are Imbalanced Datasets Different?
- 1.2 What Factors Influence the Classification of Imbalanced Datasets?
- Problem 1: Using Misleading Evaluation Metrics
- Problem 2: Not Enough Minority-Class Examples
- Problem 3: Poor Class Separability
- Problem 4: Choosing the Wrong Model
- 1.3 The Downside of Resampling
- 1.4 Prediction is Not Classification
- 1.5 How to Approach Imbalanced Learning
- 1.6 Myths, Mistakes and Modern Solutions
- 1.7 References
- 2 Metrics that Matter (and Pitfalls to Avoid)
- 2.1 Understanding the Output of Machine Learning Models
- 2.2 Understanding What Metrics Measure
- 2.3 Classification is Not Prediction (Again)
- 2.4 The Damage of Using Classification Metrics
- 2.5 Choosing the Right Metric
- 2.6 Classification Metrics
- Confusion Matrix
- Accuracy and Balanced Accuracy
- Precision, Recall and F1-score
- Matthews Correlation Coefficient (MCC)
- False Positive and False Negative Rates
- Choosing the Right Threshold
- Model and Threshold Optimisation in Python
- 2.7 Threshold Independent Metrics for Ranking
- ROC and ROC-AUC
- The ROC Curve Myth for Imbalanced Datasets
- 2Precision-Recall Curve and PR-AUC
- The PR Curve Myth for Imbalanced Datasets
- Model Selection With Ranking Metrics in Python
- 2.8 Myths, Mistakes and Modern Solutions
- 2.9 References
- 3 Probability Calibration: When 70% Means 70%
- 3.1 Calibrated Probabilities: What are They?
- 3.2 Assessing Probability Calibration: Reliability Diagrams
- 3.3 What Makes Calibration Assessment Hard
- Data Size
- Class Separability
- Class Imbalance
- 3.4 What Breaks Probability Calibration
- Model Choice Affects Calibration
- Cost-Sensitive Learning Has a Cost on Calibration
- Resampling Breaks Calibration
- 3.5 Scoring Functions: Training Models to Be Calibrated
- The Brier Score
- Negative Log-Likelihood (Log Loss)
- Brier Score or Log Loss?
- 3.6 Calibration: Correcting Biased probabilities
- Platt Scaling: Sigmoid-Based Recalibration
- Isotonic Regression: Flexible Monotonic Recalibration
- Plat Scaling vs Isotonic Regression
- 3.7 Recalibrating Models in Python
- 3-way Data Split
- Recalibration with Cross-validation
- 3.8 Myths, Mistakes and Modern Solutions
- 3.9 References
- 4 Oversampling and SMOTE: A False Promise
- Coming soon
- 5 Undersampling and Cleaning Methods: The Illusion of Better Models
- Coming soon
- 6 Cost Sensitive Learning: Costs Are Business, Not Statistics
- Coming soon
👉 epub and pdf copies
👉 Paperbak with our partners
👉 200 pages
👉 English
Author
Soledad Galli, PhD
Sole is a lead data scientist, instructor, and developer of open source software. She created and maintains the Python library Feature-engine, which allows us to impute data, encode categorical variables, transform, create, and select features. Sole is also the author of the"Python Feature Engineering Cookbook," published by Packt.
More about Sole on LinkedIn.