Feature Engineering for Machine Learning: A Comprehensive Overview
Updated: Jan 2
Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning.
Data in its raw format is almost never suitable for use to train machine learning algorithms. Instead, data scientists devote a substantial amount of time to pre-process the variables to use them in machine learning.
Why do we need to engineer features?
There are various reasons why we engineer features:
Some machine learning libraries do not support missing values or strings as inputs, for example Scikit-learn.
Some machine learning models make assumptions about the distributions of the variables, for example linear models.
Some machine learning models are sensitive to the magnitude of the features, for example linear models, SVMs and neural networks and all distance based algorithms like PCA and nearest neighbours.
Some algorithms are sensitive to outliers, for example linear models and adaboost
Some variables provide almost no information in their raw format, for example dates
Often variable pre-processing allows us to capture more information, which can boost algorithm performance, for example target mean encoding of categorical variables
Frequently variable combinations are more predictive than variables in isolation, for example the sum or the mean of a group of variables.
Some variables contain information about transactions, providing time-stamped data, and we may want to aggregate them into a static view.
As you can see, feature engineering is an umbrella term that includes multiple techniques to perform everything from filling missing values, to encoding categorical variables, to variable transformation, to creating new variables from existing ones.
In this post, I highlight the main feature engineering techniques to process the data and leave it ready to use for machine learning. I describe what each technique entails, and say a few words about when we should use each technique.
For code, step-by-step tutorials, additional information and real-life examples of feature engineering, you might be interested in the online course “Feature Engineering for Machine Learning”.
Table of Contents
Missing Data Imputation
Date and Time Engineering
Aggregating Transaction Data
1. Missing Data Imputation
Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a complete dataset that can be used to train machine learning models.
There are multiple techniques for missing data imputation:
Complete Case Analysis
Mean / Median / Mode Imputation
Random Sample Imputation
Replacement by Arbitrary Value
End of Distribution Imputation
Missing Value Indicator
1.1 Complete case analysis
Complete case analysis implies analysing only those observations in the dataset that contain values in all the variables. In other words, in complete case analysis we remove all observations with missing values. This procedure is suitable when there are few observations with missing data in the dataset. But, if the dataset contains missing data across multiple variables, or some variables contain a high proportion of missing observations, we can easily remove a big chunk of the dataset, and this is undesired.
1.2 Mean / Median / Mode Imputation
We can replace missing values with the mean, the median or the mode of the variable. Mean / median / mode imputation is widely adopted in organisations and data competitions. Although in practice this technique is used in almost every situation, the procedure is suitable if data is missing at random and in small proportions. If there are a lot of missing observations, however, we will distort the distribution of the variable, as well as its relationship with other variables in the dataset. Distortion in the variable distribution may affect the performance of linear models.
For categorical variables, replacement by the mode, is also known as replacement by the most frequent category.
1.3 Random Sample imputation
Random sample imputation refers to randomly selecting values from the variable to replace the missing data. This technique preserves the variable distribution, and is well suited for data missing at random. But, we need to account for randomness by adequately setting a seed. Otherwise, the same missing observation could be replaced by different values in different code runs, and therefore lead to a different model predictions. This is not desired when using our models within an organisation.
1.4 Replacement by Arbitrary Value
Replacement by an arbitrary value, as its names indicates, refers to replacing missing data by any, arbitrarily determined value, but the same value for all missing data. Replacement by an arbitrary value is suitable if data is not missing at random, or if there is a huge proportion of missing values. If all values are positive, a typical replacement is -1. Alternatively, replacing by 999 or -999 are common practice. We need to anticipate that these arbitrary values are not a common occurrence in the variable. Replacement by arbitrary values however may not be suited for linear models, as it most likely will distort the distribution of the variables, and therefore model assumptions may not be met.
For categorical variables, this is the equivalent of replacing missing observations with the label “Missing” which is a widely adopted procedure.
1.5 End of Distribution Imputation
End of tail imputation involves replacing missing values by a value at the far end of the tail of the variable distribution. This technique is similar in essence to imputing by an arbitrary value. However, by placing the value at the end of the distribution, we need not look at each variable distribution individually, as the algorithm does it automatically for us. This imputation technique tends to work well with tree-based algorithms, but it may affect the performance of linear models, as it distorts the variable distribution.
1.6 Missing indicator
The missing indicator technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise. One thing to notice is that we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation. These 2 techniques in combination tend to work well with linear models. But, adding a missing indicator expands the feature space and, as multiple variables tend to have missing values for the same observations, many of these newly created binary variables could be identical or highly correlated.
There are, in addition, multivariate techniques for missing data imputation, like MICE (Multivariate Imputation with Chained Equations) and hot deck imputation, that I will not cover in the post, but will be covered in future releases of the course “Feature Engineering for Machine Learning”.
2. Categorical Encoding
Categorical variable encoding is an umbrella term for techniques used to transform the strings or labels of categorical variables into numbers. There are multiple techniques available to us:
One hot encoding
Count and Frequency encoding
Target encoding / Mean encoding
Weight of Evidence
Rare label encoding
2.1 One hot encoding
One hot encoding (OHE) creates a binary variable for each one of the different categories present in a variable. These binary variables take 1 if the observation shows a certain category or 0 otherwise. OHE is suitable for linear models. But, OHE expands the feature space quite dramatically if the categorical variables are highly cardinal, or if there are many categorical variables. In addition, many of the derived dummy variables could be highly correlated.
2.2 Count and Frequency Encoding
In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.
2.3 Target / Mean Encoding
In target encoding, also called mean encoding, we replace each category of a variable, by the mean value of the target for the observations that show a certain category. For example, we have the categorical variable “city”, and we want to predict if the customer will buy a TV provided we send a letter. If 30 percent of the people in the city “London” buy the TV, we would replace London by 0.3.
This technique has 3 advantages:
it does not expand the feature space,
it captures some information regarding the target at the time of encoding the category, and
it creates a monotonic relationship between the variable and the target.
Monotonic relationships between variable and target tend to improve linear model performance.
2.4 Ordinal encoding
In ordinal encoding we replace the categories by digits, either arbitrarily or in an informed manner. If we encode categories arbitrarily, we assign an integer per category from 1 to n, where n is the number of unique categories. If instead, we assign the integers in an informed manner, we observe the target distribution: we order the categories from 1 to n, assigning 1 to the category for which the observations show the highest mean of target value, and n to the category with the lowest target mean value.
2.5 Weight of evidence
Weight of evidence (WOE) is a technique used to encode categorical variables for classification. WOE is the natural logarithm of the probability of the target being 1 divided the probability of the target being 0. WOE has the property that its value will be 0 if the phenomenon is random; it will be bigger than 0 if the probability of the target being 0 is bigger, and it will be smaller than 0 when the probability of the target being 1 is greater.
WOE transformation creates a nice visual representation of the variable, because by looking at the WOE encoded variable, we can see, category by category, whether it favours the outcome of 0, or of 1. In addition, WOE creates a monotonic relationship between variable and target, and leaves all the variables within the same value range.
2.6 Rare Label encoding
Categories that are present only in a small proportion of the observations, tend to be grouped into an umbrella category like “Other” or “Rare”. This procedure tends to improve machine learning model generalisation, in particular for tree based methods, and also operationalisation of the models in production.
There are additional methods of categorical encoding, like Binary Encoding and Feature Hashing, which I will not cover in this post, but are covered in the course “Feature Engineering for Machine Learning”. For more information on these techniques you can also visit Will McGinnis' blog.
3. Variable transformation
Some machine learning models assume that the variables are normally distributed. Other models may benefit from a more homogeneous spread of values across the value range. If variables are not normally distributed, we can apply a mathematical transformation to enforce this distribution. Typically used mathematical transformations are:
Logarithm transformation - log(x)
Reciprocal transformation - 1 / x
Square root transformation - sqrt(x)
Exponential transformation - exp(x)
Box-Cox and Yeo-Johnson are adaptations of exponential transformations that span over several exponents, and are therefore more likely to achieve the desired result. You can find the transformation formulas in this article.
When applying mathematical transformations we need to be mindful of the variable values. For example, logarithm and square root only support positive values, and the reciprocal transformation is not defined for 0.
Discretisation refers to sorting the values of the variable into bins or intervals, also called buckets. There are multiple ways to discretise variables:
Equal width discretisation
Equal Frequency discretisation
Discretisation using decision trees
4.1 Equal width discretisation
In equal width discretisation, the bins or interval limits are determined so that each interval is of the same width. This is accomplished by subtracting the minimum value from the maximum value of the variable, and dividing that range into the amount of bins desired, say 10. Next, we sort the observations in those bins. Note however, that if the distribution is skewed, this technique does not improve the spread of the values.
4.2 Equal frequency discretisation
In equal frequency discretisation, the boundaries of the intervals are determined so that each bin contains the same number of observations. This is a better solution if we want to spread the values evenly across all bins. The usual approach is to use the percentiles, or quartiles to determine the intervals.
4.3 Discretisation with decision trees
Discretisation with decision trees involves sorting the observations into the tree end leaves, after training a decision tree. Different leaves will contain different number of observations, so it does not preserve frequency like equal frequency discretisation. And also, each node is not itself an interval, instead a prediction value. However, discretisation with decision trees can improve model performance by creating monotonic relationships that already capture some of the predictive power of the variable.
Outliers are values that are unusually high or unusually low respect to the rest of the observations of the variable. There are a few techniques for outlier handling:
Treating outliers as missing values
Top / bottom / zero coding
5.1 Outlier removal
Outlier removal refers to removing outlier observations from the dataset. Outliers, by nature are not abundant, so this procedure should not distort the dataset dramatically. But if there are outliers across multiple variables, we may end up removing a big portion of the dataset.
5.2 Treating outliers as missing values
We can treat outliers as missing information, and carry on any of the imputation methods described earlier in the post.
5.3 Top /bottom / zero coding
Top or bottom coding are also known as Winsorisation or outlier capping. The procedure involves capping the maximum and minimum values at a predefined value. This predefined value can be arbitrary, or it can be derived from the variable distribution.
How can we derive the maximum and minimum values? If the variable is normally distributed we can cap the maximum and minimum values at the mean plus or minus 3 times the standard deviation. If the variable is skewed, we can use the inter-quantile range proximity rule or cap at the top and bottom percentiles.
Discretisation handles outliers automatically, as outliers are sorted into the terminal bins, together with the other higher or lower value observations. The best approaches are equal frequency and tree based discretisation.
6. Feature Scaling
Many machine learning algorithms are sensitive to the magnitude of the variables, therefore it is common practice to set all features within the same scale. There are multiple ways of feature scaling:
Maximum Absolute Scaling
Scaling to unit length
Feature standardisation involves subtracting the mean from each value and dividing by the standard deviation. Feature standardisation makes the variables have 0 value mean and unit-variance and it is suitable if the variables are normally distributed.
6.2 Min-Max Scaling
Min-Max Scaling, or Min-Max normalisation, consists in re-scaling the variable to 0-1, which is achieved by subtracting the minimum from each value and dividing by the value range. The value range is calculated as the maximum minus the minimum value of the variable. Min-Max Scaling offers a good alternative to Standardisation when variables are skewed.
6.3 Maximum Absolute Scaling
Maximum Absolute Scaling involves scaling the features between 0 and 1, by dividing each value of the variable by the maximum value.
6.4 Robust Scaling
Robust Scaling involves removing the median from each value and dividing by the inter-quantile range, which is given by the difference between the 75th and 25th quantiles. The procedure is similar in essence to Min-Max Scaling, but offers a better value spread for highly skewed variables.
6.5 Mean normalisation
In mean normalisation, we remove from each value the mean value, and divide by the value range, that is the difference between the maximum and minimum value.
6.6 Scaling to unit length
Scaling to unit length refers to transforming the values of variable so that the complete variable vector has length one. In scaling to unit length, we divide each value of the variable by the Euclidean length or the norm of the variable.
7. Date and Time Engineering
In dealing with date and time variables, normally we extract information like year, month, day, day of the week, is weekend, time of the day, is morning, is afternoon, among others. In addition, we normally extract information from multiple date time variables in combination, for example age from date of birth and date of transaction, or time elapsed between 2 dates, just to name a few.
8. Feature Creation
Feature creation refers to creating new features from existing ones. These can be done generally by aggregating features using the mean, maximum and minimum values, sum and differences. We could also perform polynomial and other non-linear combinations of the features.
Much of feature creation involves knowledge of the variables at hand to derive new features that are meaningful to people, if they are to be used in organisations. For data competitions, any brute force approach to create variables that are not necessarily comprehensible may give us and edge in the competition.
Feature creation is more commonly seen in Natural Language Processing, when creating bag of words or frequency tables from the words that appear in the text.
9. Aggregating Transaction Data
Transaction data refers to information recorded from transactions. For example, we can keep records of every sale done in a shop, or the balances in our bank and credit accounts throughout the months of the year. In order to use transaction data to predict static outcomes, we normally aggregate these variables into a static view. Common ways of aggregating these variables include determining a time window, for example the last 6 months, and finding the maximum value transaction, the minimum value transaction, the mean, the sum, the standard deviation, among others.
I have gathered many of these techniques for feature engineering in the course “Feature Engineering for Machine Learning” which is available in Udemy. In this course, you will find more in depth explanations of the feature engineering techniques, with demonstrations of their impact on variables, as well as code to implement the procedures using open source packages like pandas, NumPy, Scikit-learn and the recently released library Feature Engine.
You can find more resources to learn about feature engineering in the article Best Resources to Learn Feature Engineering for Machine Learning.
I hope you enjoyed the post and thank you for sharing it!