• Soledad Galli

Best Resources to Learn Feature Engineering for Machine Learning

Updated: Feb 15


Data in its raw format is almost never ready to be used in machine learning. But, we can transform these data to build features that are suitable to train machine learning models. Data pre-processing or feature engineering are typically the stages where data scientists devote most of their effort in a machine learning project.


As Pedro Domingos said in the article “A few useful things to know about machine learning”: “At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used”.


"At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."

Feature engineering and data pre-processing are also, for many of us, the most interesting parts of the data science project, where we can combine our creativity and intuition with domain knowledge to create meaningful features.


Some aspects of feature engineering are domain-specific, we need to know a few things about the data and the business area, or organisation’s purpose to derive useful features. But a big chunk of feature engineering is also quite repetitive and can be automated. Many of the techniques to perform the more repetitive aspect of feature engineering are used across organisations and in many data science competitions, and they include procedures to remove missing data, to encode categorical variables or to extract features from text, to name a few. More and more, feature engineering practices are being consolidated, and many organisations adopt similar practices to clean and prepare the data.



Feature engineering is the stage where data scientists devote most of their effort in a machine learning project


Surprisingly, even though feature engineering is a crucial part of any machine learning pipeline, and also the most time consuming, it is barely covered in the extensive catalogue of machine learning online courses. Only recently were released online courses and books that cover this topic specifically.



Should I learn feature engineering?


Once you’ve made a start with data science and machine learning courses, you are familiar with of-the-shelf machine learning algorithms like regression, decision trees, random forests, and you are relatively comfortable with programming either in Python or R, one of the next logical steps is to gain exposure to feature engineering techniques.


Once you’ve started with a few data science projects either in your organisation or in a data competition website, you will soon realise, how much needs to be done before utilising the data to train an algorithm. So sooner or later, you will need to become familiar with various feature engineering techniques.



What exactly do I need to learn?


There are a few fundamentals of feature engineering. The first one is missing data imputation. Libraries like Scikit-learn, do not support missing data as inputs, therefore, we need to replace missing values with a number. If you are an R user, many R machine learning packages, will allow you to pass data with missing values, but certainly not all of them, so learning a few missing data imputation techniques is quite handy.

In categorical variables, the values of the variable are strings instead of numbers. Some libraries, like Scikit-learn, do not support strings as inputs, so it is also useful to have a few categorical encoding techniques in your tool belt.


Some machine learning models make assumptions on the distribution of the data, for example that the variables are normally distributed. Therefore, we often utilise mathematical transformations or discretisation to obtain a Gaussian distribution or a more homogeneous value spread.


Some machine learning algorithms are sensitive to feature magnitude, for example linear models, support vector machines, neural networks and distance based algorithms like PCA and k-means clustering. In these cases, we tend to scale the variables as well.


In many datasets, there are dates as variables. Dates or datetime variables are not fed as such into machine learning models, instead we derive more useful information by extracting new features from them, like for example time elapsed between 2 variables.


In some specific cases, some of the variables are GIS coordinates, which again, would provide us with more information if we pre-processed them, than if we use them in their raw format. Finally, often our data are texts, or images, and we can as well create features from them, to use in machine learning.


So in a nutshell, feature engineering refers to techniques to perform:

  1. Missing data imputation

  2. Categorical variable encoding

  3. Numerical variable transformation

  4. Discretisation

  5. Engineering of datetime variables

  6. Engineering of coordinates - GIS data

  7. Feature extraction from text

  8. Feature extraction from images

  9. Feature extraction from time series

  10. New feature creation by combining existing variables


That sounds like a lot to take on, so..



Where can I learn about feature engineering?


In this post, I describe the best, and almost the only, available resources to learn feature engineering for machine learning. Some of these resources are very comprehensive, covering almost every widely used technique, therefore providing the student with a wider repertoire of techniques that are suitable for different scenarios, algorithms and datasets. Some of the resources are more focused on the techniques that are more main stream, trying to quickly empower the reader or the student to crack on with data pre-processing. Let’s dive in…



Disclaimer:
Two of the recommendations in the article are our course in Udemy "Feature Engineering for Machine Learning" and our book from Packt "Python Feature Engineering Cookbook". Although there are not affiliate links for these resources in the article I do receive royalties for sales.
Opinions in this article are my own and I do not become financial compensation from any of the links included in this article (except those mentioned in the precedent paragraph). The article does not contain affiliate links.



Contents


Online Courses

  1. Feature Engineering for Machine Learning, Udemy​​

  2. How to Win a Data Science Competition: Learn from top Kagglers, Coursera

  3. Feature Engineering for Machine Learning in Python, Datacamp

  4. Feature Engineering, Coursera

Articles and other free resources

  1. The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009)

  2. ​Beating Kaggle the Easy Way


Books

  1. Python Feature Engineering Cookbook, Packt

  2. Feature Engineering for Machine Learning Models, O’Reilly​

  3. Feature Engineering Made Easy, Packt

  4. Feature Engineering and Selection: A Practical Approach for Predictive Models


Feature Engineering libraries

  1. Scikit-learn

  2. Categorical-Encoder

  3. Feature-Engine

  4. Featuretools


Comprehensive Blogs about Feature Engineering




Online Courses


Top Recommendation ✔️

1. Feature Engineering for Machine Learning, Udemy​​


Feature Engineering for Machine Learning is the most comprehensive online course on feature engineering, covering almost every feature engineering technique known and widely used today. In the course Feature Engineering for Machine Learning you will learn multiple procedures for:


  1. Missing data imputation: mean, median, mode, arbitrary, end of tail and random sample imputation.

  2. Categorical variable encoding: one-hot, ordinal, mean encoding, weight-of evidence

  3. Numerical variable transformation: log, reciprocal, exponential, Box-Cox and Yeo-Johnson

  4. Variable discretisation: equal width, equal-frequency, discretisation with trees

  5. Outlier removal: removal, capping, Winsorisation

  6. Feature Scaling: standardisation, MinMax scaling, robust scaling, norm scaling and more

  7. Engineering of datetime variables

  8. Engineering of mixed numerical and categorical variables

  9. Real-life examples


Feature Engineering for Machine Learning teaches multiple techniques for each one of the topics mentioned in our list above, discussing the assumptions made by the technique, its advantages and limitations, and full Python code to implement the technique using Python open source libraries like pandas, NumPy and Scikit-learn and the newly released library Feature-Engine, which was created as part of the course.


Feature Engineering for Machine Learning starts addressing the characteristics of variables and how these may affect the performance of multiple machine learning algorithms. Then it discusses the assumptions made by various machine learning algorithms, and how feature transformations can improve model performance.


Then, Feature Engineering for Machine Learning continues to introduce the multiple techniques for feature engineering. And finally, the course provides real-life examples with end-to-end pipelines of feature transformation.


Feature Engineering for Machine Learning lectures include videos where the instructor discusses the advantages, limitations and implications of each technique, each accompanied by a Jupyter Notebook with the code to implement these techniques in Python.


Courses in Udemy are not for free, but you can get it at a discounted price using frequently released vouchers.



2. How to Win a Data Science Competition: Learn from Top Kagglers, Coursera

How to Win a Data Science Competition: Learn from Top Kagglers is tailored to students seeking to enter and win data science competitions. The authors themselves have won various competitions in Kaggle, and in the course they explain several of the techniques they used to engineer and select their variables, and build and tune their machine learning models.


How to Win a Data Science Competition: Learn from Top Kagglers includes 3 sections on feature engineering. In the first section, they describe basic techniques to impute missing data, transform numerical variables, encode categorical variables and work with dates and coordinates. They also teach how to create bag of words from text and how to create features with Word2vec and CNNs.


In the second section the authors cover extensively a technique for categorical encoding called mean (or target) encoding. And in the final section on feature engineering, they move on to describe how to capture feature interactions, and how to create new statistics and distance based features.


Courses on Coursera can be audited for free, or you can choose to pay a fee if you wish a certification and access to the full material and practices.




3. Feature Engineering for Machine Learning in Python, Datacamp


Feature Engineering for Machine Learning in Python, is a hands-on course that teaches many aspects of feature engineering for categorical and continuous variables, and text data. The course discusses some techniques for variable discretisation, missing data imputation, and for categorical variable encoding. It also discusses procedures for variable transformation, feature scaling and outlier removal.


Feature Engineering for Machine Learning in Python is composed of 4 chapters. The first chapter is available for free, but the remaining 3 require payment of a fee.




4. Feature Engineering, Coursera


The course Feature Engineering in Coursera introduces a few feature engineering techniques, focusing mainly on how to implement these techniques using the Google Cloud Platform, how to select good features and how to do feature pre-processing at scale. Some students have complained that some of the notebooks would not run as expected, however the overall rating of the course is very good.


Courses on Coursera can be audited for free, or you can choose to pay a fee if you wish a certification and access to the full material and practices.




​Articles and other free resources

Excellent Resource ✔️

1. The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009)


The 2009 Knowledge Discovery in Data Competition (KDD Cup 2009) is a series of articles published after the 2009 KDD competition, where the winners and runner-up competitors described the data pre-processing techniques they used to prepare the data to build the machine learning models.


The competition aimed to predict a highly unbalanced target, and the data contained a multitude of categorical and highly cardinal variables, as well as features with missing values. Therefore, the different articles describe multiple creative solutions to tackle these data issues.


For imputation of missing data, the authors used mean and median imputation together with adding a binary missing indicator. The authors also discuss several different ways of coping with the categorical encoding and high cardinality of the variables. A few of the solutions used discretisation, including one solution that sorted the data into buckets using decision trees, and created new features by combining variables, also with decision trees, to pick up the feature interactions.


Certainly a very interesting series of articles for those dealing with datasets with thousands of variables, with a mix of categorical and numerical features with missing information.


​2. Beating Kaggle the Easy Way


Beating Kaggle the Easy Way is the master thesis of a student in the Technische Universitaet Darmstadt, where the student explores multiple feature engineering techniques, in various data science competitions available on Kaggle.


The goal of the thesis was to get the best possible results, with minimal effort, across various data competitions, by re-using the feature engineering pipeline built for the first competition. This may sound a bit cheeky, however, in practice we do use the same, or very similar techniques across projects to make our machine learning models more predictive.


In the thesis, the student describes various data pre-processing and data cleaning techniques, and feature transformations that they used across competitions.


Although it may sound a bit daunting to read a thesis, this work is actually quite amenable and easy to follow, and quite enlightening as well, if you are just starting as a data scientist, so I highly recommend you give it a go.



Books


1. Python Feature Engineering Cookbook, Packt


In the book Python Feature Engineering Cookbook, I provide the most extensive battery of feature engineering techniques, focusing on the practical implementation in Python and leveraging the power of pandas, Scikit-learn's newer tools for feature transformation, the open-source package Feature-engine that I created and other powerful Python packages for feature engineering like Category Encoders and Featuretools. The books is based on the "Feature Engineering for Machine Learning" course we teach on Udemy, and expands the battery of techniques to cover feature creation by combination of variables, and feature extraction from time series, transaction data and text. Differently from our Udemy course, which shows pros and cons of each techniques and the considerations to have around their use, Python Feature Engineering Cookbook dives straight in the implementation of the techniques, in a manner that is compatible with training models and creating deployment ready machine learning pipelines. Specifically, the book covers:

  1. Missing data imputation

  2. Categorical variable encoding: the widest battery of encoding techniques, including rare label handling

  3. Numerical variable transformation: discretisation, scaling, log and power transformation techniques

  4. Text: creation of features to capture text complexity, bag-of-words and TF-IDF with or without n-grams, and text cleaning techniques

  5. Time series and transaction data: extracting features that capture signal complexity

  6. Feature creation: creating features with mathematical combinations, PCA and polynomial expansion



2. Feature Engineering for Machine Learning Models, O’Reilly​


In the book Feature Engineering for Machine Learning Models, the authors teach various feature engineering techniques, focusing on the practical application with exercises in Python using pandas, NumPy, Scikit-learn and Matplotlib. Specifically, the book covers:

  1. Numerical variables: discretisation, scaling, log and power transforms

  2. Categorical variables: one hot encoding, feature hashing and bin-counting

  3. Text: bag-of-words, n-grams, and phrase detection

  4. PCA

  5. Creating features with k-means

  6. Extracting features from images



3. Feature Engineering Made Easy, Packt


Feature Engineering Made Easy covers various aspects of feature engineering, including imputation of missing data, categorical encoding, numerical feature transformation, extraction of features from text and images, and feature creation with PCA. Feature Engineering Made Easy includes examples that guide the reader through the implementation of these techniques in Python.


Feature Engineering Made Easy capitalises in understanding the data at hand, so it includes a few chapters on data exploration as well.




4. Feature Engineering and Selection: A Practical Approach for Predictive Models


For R users, the book Feature Engineering and Selection: A Practical Approach for Predictive Models, is a good alternative. The book covers many aspects of feature engineering, including imputing missing data, categorical encoding, and numerical feature transformation.


I personally find the book is a bit extensive in text, as the authors try to provide their experience when using those methods, at the expense of code examples on how to implement the techniques.




Feature Engineering libraries


1. Scikit-learn


Scikit-learn, the industry standard Python library for machine learning, has recently released multiple transformers or classes for feature engineering, including transformers for missing data imputation, categorical encoding, discretisation and variable transformation.


With the SimpleImputer class, we can perform mean, median, mode and arbitrary imputation, while the IterativeImputer class (still in experimental mode) allows us to do multivariate imputation. Scikit-learn also includes the OneHotEncoder for one hot encoding and the LabelEncoder to replace categories by integers. With the KBinsDiscretiser we can discretise our variables, and with the PowerTransformer we can apply Yeo-Johnson and Box-Cox transformations.


As with any other Scikit-learn transformer, the feature engineering classes do not allow to select which variables we want to process with each technique. But, the developers have also released the ColumnTransformer class, that can be used to do exactly so.


Scikit-learn classes were recently released, so you may not have heard of them yet, but you certainly will in the coming months as the community starts to adopt their use.



2. Categorical-Encoder


The categorical-encoder is the most extensive Python package for categorical variable encoding, including some common procedures like one hot encoding and weight of evidence, as well as more complex ways of encoding variables like BaseN and feature hashing. If you want to learn more about these particular categorical encoding techniques, there is a good explanation in the blog Smarter Ways to Encode Categorical Data for Machine Learning.



3. Feature-Engine


Feature-Engine is an open source python package that was created as part of the Udemy course Feature Engineering for Machine Learning.



Feature Engine - Python Package for Feature Engineering


Feature-Engine contains multiple transformers or classes for missing data imputation, categorical encoding, discretisation and variable transformation. At the moment, Feature-Engine’s battery of transformers is a bit more extensive than the one offered by Scikit-learn, and it has a few nice perks:

  1. It returns a pandas dataframe, so you can easily do a feature transformation and continue with your data exploration.

  2. It is very user friendly, allowing you to specify within the transformer which are the variables that you want to pre-process.

  3. It can be integrated into the Scikit-learn pipeline with a single line of code.

It is to note that the package is at the very early stages and it has only one maintainer, so bug fixes, adoption and extension of the functionality might be slow.



4. Featuretools


Featuretools has become the Python library for pre-processing of transaction or time-series data. With Featuretools, we need only determine a time window, and the package functionality will derive new features from different aggregation procedures of the time-data or transactions, like finding maximum and minimum values, mean, median and standard deviation, among others.


Featuretools works alongside pandas and Scikit-learn, so the library can be easily incorporated in traditional data science workflows.



Comprehensive Blogs about Feature Engineering


STAY UP TO DATE

Get the latest tutorials, releases and demos!

Privacy statement: By providing us with your email address, you are giving us permission to contact you with news related to our courses, books, open-source packages, and related notifications.

We will not share your information with third-parties. You can unsubscribe anytime. For more info, read our full Privacy Policy.

© 2018 - 2020 Train In Data

  • YouTube - Grey Circle
  • Soledad Galli - Twitter
  • LinkedIn - Grey Circle