• Soledad Galli

Feature-engine: A new open source Python package for feature engineering

Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features for use in machine learning models. Feature-engine simplifies and streamlines the implementation of and end-to-end feature engineering pipeline, by allowing the selection of feature subsets within its transformers, and returning dataframes for easy data exploration. Feature-engine's transformers preserve Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform data.




Feature engineering


Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning. Feature engineering includes procedures to impute missing data, encode categorical variables, transform or discretise numerical variables, put features in the same scale, combine features into new variables, extract information from dates, aggregate transactional data, or to derive features from time series, text or even images. There are many techniques that we can use at each of these feature engineering steps, and our choice depends on the characteristics of the variables in our data set, as well as, on the algorithms we intend to use.


For more details about feature engineering techniques, visit:


Challenges of deploying a machine learning pipeline

Machine learning models take in a bunch of input variables and output a prediction. Yet, the raw data collected and stored by multiple organisations is almost never suitable to be directly fed into a machine learning model. Instead, we perform an extensive amount of transformations to leave the variables in a shape that can be understood by these algorithms. The collection of variable transformations are commonly referred to as feature engineering.


There are well established Python libraries that contain of-the-shelf machine learning algorithms for supervised and unsupervised learning, like Scikit-learn, pyearth, TensorFlow and Keras. But there were not, up to recently, libraries that could support the feature engineering transformations required to feed the data to those algorithms. These meant that code for the feature engineering steps needed to be manually written, and then often re-written to make the code suitable for production, in case we decided to deploy the pipeline to say, score live data. This process is not efficient or reproducible.


By using well-established open source Python libraries, we can make model development and deployment more efficient and reproducible. Established open source packages provide quality tools, which use removes the task of coding from our hands, improving team performance and collaboration. In addition, open source packages tend to be extensively tested, and thus prevent the introduction of bugs and guarantee reproducibility.


In the last few years, open source Python libraries began to support the implementation of feature engineering techniques as part of the machine learning pipeline. The library Featuretools support an exhaustive array of functions to work with transactional data and time series; the library Category encoders supports various alternative ways to encode categorical variables, beyond the popular one hot encoding; while Scikit-learn supports a wide array of transformations for imputation, categorical encoding, discretisation and variable transformation. Feature-engine is the newest of the open source Python libraries, yet supports the most exhaustive battery of transformations, while allows the selection of feature subsets directly at the transformer, thus making engineering pipelines much easier to code and deploy.



Feature-engine


Feature-engine is an open source Python library that simplifies and streamlines the implementation of and end-to-end feature engineering pipeline. Feature-engine preserves Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform the data. Many feature engineering techniques, need to learn parameters from the data, like statistical values or encoding mappings, to transform incoming data. The Scikit-learn functionality with the fit and transform methods makes Feature-engine easy use and easy to learn. Feature-engine’s transformers also store the learned parameters, and can be used within the Scikit-learn Pipeline.


Feature engine supports multiple transformers for missing data imputation, categorical variable encoding, discretisation, variable transformation and outlier handling, thus providing the most exhaustive array of techniques for feature engineering. More specifically, Feature-engine supports the following techniques for each engineering aspect of a variable:


Imputation methods

  • Mean and median imputation: MeanMedianImputer

  • Random sample imputation: RandomSampleImputer (Exclusive)

  • Imputation with arbitrary values: ArbitraryNumberImputer

  • Imputation with values at the end of the distribution: EndTailImputer (Exclusive)

  • Imputation with the most frequent category: CategoricalVariableImputer

  • Imputation with the string ‘Missing’: CategoricalVariableImputer

  • Addition of binary missing indicators: AddMissingIndicator

Categorical encoding methods

  • One hot encoding: OneHotCategoricalEncoder

  • One hot encoding of frequent categories: OneHotCategoricalEncoder (Exclusive)

  • Frequency or count encoding: CountFrequencyCategoricalEncoder (Exclusive)

  • Ordinal encoding: OrdinalCategoricalEncoder

  • Monotonic ordinal encoding: OrdinalCategoricalEncoder (Exclusive)

  • Target mean encoding: MeanCategoricalEncoder

  • Weight of evidence: WoERatioCategoricalEncoder

  • Grouping of rare labels: RareLabelCategoricalEncoder (Exclusive)


Discretisation methods

  • Discretisation in equal frequency intervals: EqualFrequencyDiscretiser

  • Discretisation in equal width intervals: EqualWidthDiscretiser

  • Discretisation with Decision trees: DecisionTreeDiscretiser (Exclusive)


Variable transformation methods

  • Logarithmic transformation: LogTransformer

  • Reciprocal transformation: ReciprocalTransformer

  • Exponential transformations: PowerTransformer

  • Box-Cox transformation: BoxCoxTransformer

  • Yeo-Johnson transformation: YeoJohnsonTransformer


Outlier handling methods(Exclusive)

  • Outlier removal: OutlierTrimmer (Exclusive)

  • Outlier capping or censoring: Winsorizer, ArbitraryOutlierCapper (Exclusive)


What is unique about Feature-engine?


Feature-engine has the following characteristics that differentiate it from other available open source packages:

  1. Feature-engine contains the most exhaustive battery of feature engineering transformations

  2. Feature-engine allows the selection of variables to transform directly at the transformer

  3. Feature-engine takes in a dataframe and returns a dataframe suitable both for data exploration and production or deployment

  4. Feature-engine is compatible with the Scikit-learn pipeline, thus all engineering transformations can be stored in a single Python pickle

  5. Feature-engine automatically recognizes numerical and categorical variables

  6. Feature-engine will alerts when transformations are not possible, for example if applying logarithm to negative variables or divisions by variables with 0s as values


1) Feature-engine’s exhaustive variable transformation toolkit


Feature-engine hosts all-round transformations to leave the data ready for machine learning. In addition to the widely used imputation techniques like mean, median, mode and arbitrary imputation, which are also supported by Scikit-learn, Feature-engine also supports imputation with values at the end of the distribution, and imputation by random sampling.


Feature-engine also offers a variety of exclusive techniques for categorical variable encoding. On top of the widely used one hot encoding and ordinal encoding, supported by Scikit-learn, and of target mean encoding and weight of evidence, supported by category encoders, Feature-engine also offers count and frequency encoding, monotonic ordinal encoding and probability ratio encoding.


Feature-engine also offers functionality to handle rare labels, like one hot encoding of frequent categories or grouping infrequent categories under a common new label defined by the user.


Feature-engine hosts most mathematical transformations and discretisation techniques available in Scikit-learn, and it has the additional functionality to use decision trees to transform a variable into discrete numbers. Finally, Feature-engine is, to the best of our knowledge, the only open source library with functionality to remove or censor outliers.



2) Feature engine allows the selection of variables directly at the transformer


One of the reasons why Feature-engine’s transformers are so convenient, is because they allow us to select which variables we wish to transform with each technique, directly at the transformer. This way, we can specify the group of variables which, for example, we want to impute with the mean, and the group of variables to impute with the mode, directly within these transformers, without the need to slice the dataframe manually or use alternative transformers. Code examples will follow later on in the blog.



3) Feature-engine returns a dataframe


All Feature-engine transformers return dataframes as outputs. This means that after transforming our dataset, we do not need to worry about variable names and column order as we would do with the NumPy arrays returned by Scikit-learn. With Feature-engine, we can continue to leverage the power of pandas for data analysis and visualisation even after transforming our dataset, allowing for data exploration before and after transforming the variables.



4) Feature-engine is compatible with the Scikit-learn pipeline


Feature-engine transformers are compatible with the Scikit-learn pipeline. This allows the implementation of many feature engineering steps within a single Scikit-learn pipeline prior to training a machine learning algorithm, or obtaining its predictions from raw data. With Feature-engine, we can store an entire machine learning series of transformations into a single object that can be saved and retrieved at a later stage, or placed in memory, for live scoring. Code examples later on in the blog.



5) Feature-engine automatically recognizes numerical and categorical variables


Feature-engine automatically recognizes numerical and categorical variables, thus, preventing the risk of inadvertently applying categorical encoding to numerical variables or numerical imputation techniques to categorical variables.


This functionality also allows to run the transformers without indicating which variables to transform; Feature-engine transformers are intelligent enough to apply numerical transformations to numerical variables and categorical transformations to categorical variables, so that, returning very quickly, and without a lot of data manipulation a benchmark machine learning pipeline on a given dataset.



6) Feature-engine alerts when transformations are not possible for certain variables


Feature-engine will alert when transformations are not possible. For categorical encoding, for example, Feature-engine will signal the unexpected / unintended introduction of missing values. For variable transformations, Feature-engine will alert when logarithm is being applied on negative variables or when reciprocal transformations are applied on variables with 0s as values. This way, Feature-engine helps identify issues with the variables early on during the development of a machine learning engineering pipeline, so that we can choose a more suitable technique.


How to use Feature-engine


In the rest of the blog, we will show examples of how to use Feature-engine transformers for missing data imputation, categorical encoding, discretisation and variable transformation. Let’s begin by missing data imputation, which is typically the first step of a machine learning pipeline.


Feature-engine transformers learn parameters from data when the method fit() is used, and store this parameters within their attributes. These values can then be retrieved to transform new data. In the following sections, we will show how to instantiate and fit a transformer, and how to use a trained transformer to transform a train and a test set. For more details, please refer to their documentation.


Missing data imputation


Missing data imputation refers to replacing missing observations by a statistical parameter derived from the available values of the variable. As an example of Feature-engine’s imputation capabilities, we will perform median imputation. Feature-engine’s MeanMedianImputer automatically selects all numerical variables in the dataset for imputation, ignoring the categorical variables. The transformer also offers the option to select the variables to impute, as we will show below.


In the walk through below, you can see the implementation of the imputer using the median as the imputation_method on predicting variables on both the test and train datasets. Mean imputation can be implemented similarly by simply replacing “median” with “mean” for imputation_method. If you wish to run the code below, first download and prepare de dataset as indicated here.


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.missing_data_imputers import MeanMedianImputer

# Load dataset
data = pd.read_csv('creditApprovalUCI.csv')
 
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0)
 
# Set up the imputer
median_imputer = MeanMedianImputer(imputation_method='median',
                        variables=[A2,A3,A8,A11,A15])
# fit the imputer
median_imputer.fit(X_train)
 
# transform the data
X_train= median_imputer.transform(X_train)
X_test= median_imputer.transform(X_test)

After running the above code, the training set will not contain missing values in the variables A2, A3, A8, A11 and A15, and the output will be a dataframe, that allow us to continue with data exploration, to for example, understand the effect of this transformation in the variables distribution.


Categorical encoding


Categorical encoding includes techniques to transform variables that contain strings as values, into numerical variables. To demonstrate how to use Feature-engine’s categorical encoders, we will perform Count encoding, that is, we will replace the categories by the number of times they appear in the train set. We will use the titanic dataset, which is publicly available in OpenML.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import categorical_encoders as ce
 
# Load dataset
def load_titanic():
  data = pd.read_csv('https://www.openml.org/data/get_csv/16826755  /phpMYEkMl')
  data = data.replace('?', np.nan)
  data['cabin'] = data['cabin'].astype(str).str[0]
  data['pclass'] = data['pclass'].astype('O')
  data['embarked'].fillna('C', inplace=True)
  return data
 
data = load_titanic()
 
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
  data.drop(['survived', 'name', 'ticket'], axis=1),
  data['survived'], test_size=0.3, random_state=0)
 
# set up the encoder
encoder = ce.CountFrequencyCategoricalEncoder(
        encoding_method='frequency',
        variables=['cabin', 'pclass', 'embarked'])
 
# fit the encoder
encoder.fit(X_train)
 
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

Feature-engine learns the category-to-string mappings from the train set, and stores them in the attribute encoder_dict_. The output is a dataframe, where the variables cabin, pclass and embarked are now numbers instead of strings.


Discretisation


Discretisation involves sorting the values of continuous variables into discrete intervals, also called bins or buckets. Here, we will show how to perform discretisation using decision trees, a technique supported exclusively by Feature-engine. We will use the house prices dataset, which is available on Kaggle.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import discretisers as dsc
 
# Load dataset
data = data = pd.read_csv('houseprice.csv')
 
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
 data.drop(['Id', 'SalePrice'], axis=1),
 data['SalePrice'], test_size=0.3, random_state=0)
 
# set up the discretisation transformer
disc = dsc.DecisionTreeDiscretiser(
    cv=3,
    scoring='neg_mean_squared_error',
    variables=['LotArea', 'GrLivArea'],
    regression=True)
 
# fit the transformer
disc.fit(X_train, y_train)
 
# transform the data
train_t= disc.transform(X_train)
test_t= disc.transform(X_test)

The output of the variable transformation is a discrete variable, where each of the discrete values, is the prediction returned by the decision tree based of the variable original value.



Mathematical Transformation


Mathematical transformations refer to the transformation of the original variable by applying any mathematical function, typically to try and obtain a Gaussian distribution. Here, we will demonstrate how to implement the Box-Cox transformation with Feature-engine:


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import variable_transformers as vt
 
# Load dataset
data = data = pd.read_csv('houseprice.csv')
 
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
  data.drop(['Id', 'SalePrice'], axis=1),
 data['SalePrice'], test_size=0.3, random_state=0)
 
# set up the variable transformer
tf = vt.BoxCoxTransformer(variables = ['LotArea', 'GrLivArea'])
 
# fit the transformer
tf.fit(X_train)
 
# transform the data
train_t= tf.transform(X_train)
test_t= tf.transform(X_test)

Outlier Handling


Outliers are those variables of the variable that are extremely unusual given the rest of the values of said variable. Among its functionality, Feature-engine allows us to remove or censor outliers, based on the Gaussian approximation, the inter-quartile range proximity rule or the percentiles. Here, we will demonstrate how to censor outliers by finding the variable limits using the IQR:



import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine import outlier_removers as outr
 
# Load dataset
def load_titanic():
   data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
   data = data.replace('?', np.nan)
   data['cabin'] = data['cabin'].astype(str).str[0]
   data['pclass'] = data['pclass'].astype('O')
   data['embarked'].fillna('C', inplace=True)
   data['fare'] = data['fare'].astype('float')
   data['fare'].fillna(data['fare'].median(), inplace=True)
   data['age'] = data['age'].astype('float')
   data['age'].fillna(data['age'].median(), inplace=True)
   return data
 
data = load_titanic()
 
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
  data.drop(['survived', 'name', 'ticket'], axis=1),
  data['survived'], test_size=0.3, random_state=0)
 
# set up the capper
capper = outr.Winsorizer(
  distribution='gaussian', tail='right', fold=3,
  variables=['age', 'fare'])
 
 # fit the capper
 capper.fit(X_train)
 
 # transform the data
 train_t= capper.transform(X_train)
 test_t= capper.transform(X_test)

The output is a dataframe, where the values of the variables age and fare that were beyond the boundaries of the distribution determined by the IQR, are now replaced by those boundaries.



Assembling Feature-engine transformers into the Scikit-learn pipeline


In the precedent sections, we showed how to implement each technique individually. When we build machine learning models, we usually perform various transformations to the variables. We can place all Feature-engine transformers within a Scikit-learn pipeline, to smooth data transformation and algorithm training, as well as easily score new raw data. In the following code snippet, we perform a complete feature engineering pipeline to the house prices dataset, and then build a Lasso regression to predict house price, leveraging the power of the Scikit-learn pipeline:


import pandas as pd
import numpy as np
 
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline as pipe
from sklearn.preprocessing import MinMaxScaler
 
from feature_engine import categorical_encoders as ce
from feature_engine import discretisers as dsc
from feature_engine import missing_data_imputers as mdi
 
# load dataset
data = pd.read_csv('houseprice.csv')
 
# drop some variables
data.drop(labels=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Id'],
           axis=1, inplace=True)
 
# make a list of categorical variables
categorical = [var for var in data.columns if data[var].dtype == 'O']
 
# make a list of numerical variables
numerical = [var for var in data.columns if data[var].dtype != 'O']
 
# make a list of discrete variables
discrete = [ var for var in numerical if len(data[var].unique()) < 20]
 
# categorical encoders work only with object type variables
# to treat numerical variables as categorical, we need to re-cast them
data[discrete]= data[discrete].astype('O')
 
# continuous variables
numerical = [
 var for var in numerical if var not in discrete
 and var not in ['Id', 'SalePrice']
 ]
 
# separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data.SalePrice,
    test_size=0.1,
    random_state=0)
 
# set up the pipeline
price_pipe = pipe([
# add a binary missing indicator
 ('continuous_var_imputer', mdi.AddMissingIndicator(variables = ['LotFrontage'])),
 
 # replace NA by the median
 ('continuous_var_median_imputer', mdi.MeanMedianImputer(
 imputation_method='median', variables = ['LotFrontage', 'MasVnrArea'])),
 
 # replace NA by adding the label "Missing"
 ('categorical_imputer', mdi.CategoricalVariableImputer(variables = categorical)),
 
 # disretise continuous variables using trees
 ('numerical_tree_discretiser', dsc.DecisionTreeDiscretiser(
 cv = 3, scoring='neg_mean_squared_error', variables = numerical, regression=True)),
 
 # remove rare labels in categorical and discrete variables
 ('rare_label_encoder', ce.RareLabelCategoricalEncoder(
 tol = 0.03, n_categories=1, variables = categorical+discrete)),
 
 # encode categorical and discrete variables using the target mean
 ('categorical_encoder', ce.MeanCategoricalEncoder(variables = categorical+discrete)),
 
 # scale features
 ('scaler', MinMaxScaler()),
 
 # Lasso
 ('lasso', Lasso(random_state=2909, alpha=0.005))
 ])
 
 # train feature engineering transformers and Lasso
 price_pipe.fit(X_train, np.log(y_train))
 
 # predict
 pred_train = price_pipe.predict(X_train)
 pred_test = price_pipe.predict(X_test)

Note in the code above, how we indicate which variables to transform within each of Feature-engine transformers. And also note, how easy it is to train the algorithm, and to obtain predictions, once all transformers are assembled within a pipeline. If we want to deploy these pipeline, we need only place 1 Python object in memory to do the job, or save and retrieve only 1 Python pickle, that contains the entire, pre-trained machine learning pipeline.



Bonus: Scikit-learn wrapper


Scikit-learn transformers like the SimpleImputer or any of the variable scalers like the StandardScaler or the MinMaxScaler, transform the entire input dataset and return a NumPy array. If we want to apply these transformers to a subset of features, we can use the Scikit-learn wrapper available in Feature-engine. Here is how to do it:



import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.wrappers import SklearnTransformerWrapper
 
# Load dataset
data = pd.read_csv('houseprice.csv')
 
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
 data.drop(['Id', 'SalePrice'], axis=1),
 data['SalePrice'], test_size=0.3, random_state=0)
 
# set up the wrapper with the SimpleImputer
imputer = SklearnTransformerWrapper(
    transformer = SimpleImputer(strategy='mean'),
    variables = ['LotFrontage', 'MasVnrArea'])

# fit the wrapper + SimpleImputer
imputer.fit(X_train)
 
# transform the data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Feature-engine’s Scikit-learn wrappers allows the application of most scikit-learn transformers to a selected feature subspace, returning a dataframe.


Closing remarks


Feature engineering is the process of taking a dataset and constructing explanatory variables, or predictor features, that are then passed onto the prediction model to train a machine learning algorithm. It is a crucial step in all machine learning models, but can be challenging and time consuming if you aren’t already deeply familiar with the knowledge domain.


Open source libraries with of-the-shelf algorithms for feature engineering and data transformation have a major edge over manually encoding the transformation steps, as they enhance reproducibility while minimising the amount of coding required by the data scientist.


There is a growing number of open source libraries for variable transformation, which focus on different types of raw data, or engineering techniques, like Featuretools, Category encoders, Scikit-learn and Feature-engine. All of these libraries will help you streamline your data preparation pipelines.


In this blog, we explored the salience of Feature-engine, and its exhaustive battery of techniques for missing data imputation, categorical variable encoding, variable transformation, discretisation and outlier handling, and provided a few examples that show how easy it is to use.


To know more about Feature-engine visit its dedicated documentation. To stay alert of new Feature-engine releases register at trainindata. For an overview on feature engineering techniques included in Feature-engine visit the blog “Feature engineering: A comprehensive overview”. For code implementations of feature engineering with Feature-engine and other libraries check the book “Python Feature Engineering Cookbook. Finally, for and in-depth understanding of each engineering technique, its advantages and shortcomings, their effect of the variables and the dataset, and when to apply each transformation, visit the course “Feature Engineering on Machine Learning ”.


Thanks for reading!

STAY UP TO DATE

Get the latest tutorials, releases and demos!

Privacy statement: By providing us with your email address, you are giving us permission to contact you with news related to our courses, books, open-source packages, and related notifications.

We will not share your information with third-parties. You can unsubscribe anytime. For more info, read our full Privacy Policy.

© 2018 - 2020 Train In Data

  • YouTube - Grey Circle
  • Soledad Galli - Twitter
  • LinkedIn - Grey Circle