Feature-engine: A new open source Python package for feature engineering
Feature-engine is an open source Python library with the most exhaustive battery of transformers to engineer features for use in machine learning models. Feature-engine simplifies and streamlines the implementation of and end-to-end feature engineering pipeline, by allowing the selection of feature subsets within its transformers, and returning dataframes for easy data exploration. Feature-engine's transformers preserve Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform data.
Feature engineering is the process of using domain knowledge of the data to transform existing features or to create new variables from existing ones, for use in machine learning. Feature engineering includes procedures to impute missing data, encode categorical variables, transform or discretise numerical variables, put features in the same scale, combine features into new variables, extract information from dates, aggregate transactional data, or to derive features from time series, text or even images. There are many techniques that we can use at each of these feature engineering steps, and our choice depends on the characteristics of the variables in our data set, as well as, on the algorithms we intend to use.
For more details about feature engineering techniques, visit:
our previous blog “Feature engineering for machine learning: a comprehensive overview”
the course “Feature Engineering for Machine Learning ”
the book “Python Feature Engineering Cookbook”
Challenges of deploying a machine learning pipeline
Machine learning models take in a bunch of input variables and output a prediction. Yet, the raw data collected and stored by multiple organisations is almost never suitable to be directly fed into a machine learning model. Instead, we perform an extensive amount of transformations to leave the variables in a shape that can be understood by these algorithms. The collection of variable transformations are commonly referred to as feature engineering.
There are well established Python libraries that contain of-the-shelf machine learning algorithms for supervised and unsupervised learning, like Scikit-learn, pyearth, TensorFlow and Keras. But there were not, up to recently, libraries that could support the feature engineering transformations required to feed the data to those algorithms. These meant that code for the feature engineering steps needed to be manually written, and then often re-written to make the code suitable for production, in case we decided to deploy the pipeline to say, score live data. This process is not efficient or reproducible.
By using well-established open source Python libraries, we can make model development and deployment more efficient and reproducible. Established open source packages provide quality tools, which use removes the task of coding from our hands, improving team performance and collaboration. In addition, open source packages tend to be extensively tested, and thus prevent the introduction of bugs and guarantee reproducibility.
In the last few years, open source Python libraries began to support the implementation of feature engineering techniques as part of the machine learning pipeline. The library Featuretools support an exhaustive array of functions to work with transactional data and time series; the library Category encoders supports various alternative ways to encode categorical variables, beyond the popular one hot encoding; while Scikit-learn supports a wide array of transformations for imputation, categorical encoding, discretisation and variable transformation. Feature-engine is the newest of the open source Python libraries, yet supports the most exhaustive battery of transformations, while allows the selection of feature subsets directly at the transformer, thus making engineering pipelines much easier to code and deploy.
Feature-engine is an open source Python library that simplifies and streamlines the implementation of and end-to-end feature engineering pipeline. Feature-engine preserves Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform the data. Many feature engineering techniques, need to learn parameters from the data, like statistical values or encoding mappings, to transform incoming data. The Scikit-learn functionality with the fit and transform methods makes Feature-engine easy use and easy to learn. Feature-engine’s transformers also store the learned parameters, and can be used within the Scikit-learn Pipeline.
Feature engine supports multiple transformers for missing data imputation, categorical variable encoding, discretisation, variable transformation and outlier handling, thus providing the most exhaustive array of techniques for feature engineering. More specifically, Feature-engine supports the following techniques for each engineering aspect of a variable:
Mean and median imputation: MeanMedianImputer
Random sample imputation: RandomSampleImputer (Exclusive)
Imputation with arbitrary values: ArbitraryNumberImputer
Imputation with values at the end of the distribution: EndTailImputer (Exclusive)
Imputation with the most frequent category: CategoricalVariableImputer
Imputation with the string ‘Missing’: CategoricalVariableImputer
Addition of binary missing indicators: AddMissingIndicator
Categorical encoding methods
One hot encoding: OneHotCategoricalEncoder
One hot encoding of frequent categories: OneHotCategoricalEncoder (Exclusive)
Frequency or count encoding: CountFrequencyCategoricalEncoder (Exclusive)
Ordinal encoding: OrdinalCategoricalEncoder
Monotonic ordinal encoding: OrdinalCategoricalEncoder (Exclusive)
Target mean encoding: MeanCategoricalEncoder
Weight of evidence: WoERatioCategoricalEncoder
Grouping of rare labels: RareLabelCategoricalEncoder (Exclusive)
Discretisation in equal frequency intervals: EqualFrequencyDiscretiser
Discretisation in equal width intervals: EqualWidthDiscretiser
Discretisation with Decision trees: DecisionTreeDiscretiser (Exclusive)
Variable transformation methods
Logarithmic transformation: LogTransformer
Reciprocal transformation: ReciprocalTransformer
Exponential transformations: PowerTransformer
Box-Cox transformation: BoxCoxTransformer
Yeo-Johnson transformation: YeoJohnsonTransformer
Outlier handling methods(Exclusive)
Outlier removal: OutlierTrimmer (Exclusive)
Outlier capping or censoring: Winsorizer, ArbitraryOutlierCapper (Exclusive)
What is unique about Feature-engine?
Feature-engine has the following characteristics that differentiate it from other available open source packages:
Feature-engine contains the most exhaustive battery of feature engineering transformations
Feature-engine allows the selection of variables to transform directly at the transformer
Feature-engine takes in a dataframe and returns a dataframe suitable both for data exploration and production or deployment
Feature-engine is compatible with the Scikit-learn pipeline, thus all engineering transformations can be stored in a single Python pickle
Feature-engine automatically recognizes numerical and categorical variables
Feature-engine will alerts when transformations are not possible, for example if applying logarithm to negative variables or divisions by variables with 0s as values
1) Feature-engine’s exhaustive variable transformation toolkit
Feature-engine hosts all-round transformations to leave the data ready for machine learning. In addition to the widely used imputation techniques like mean, median, mode and arbitrary imputation, which are also supported by Scikit-learn, Feature-engine also supports imputation with values at the end of the distribution, and imputation by random sampling.
Feature-engine also offers a variety of exclusive techniques for categorical variable encoding. On top of the widely used one hot encoding and ordinal encoding, supported by Scikit-learn, and of target mean encoding and weight of evidence, supported by category encoders, Feature-engine also offers count and frequency encoding, monotonic ordinal encoding and probability ratio encoding.
Feature-engine hosts most mathematical transformations and discretisation techniques available in Scikit-learn, and it has the additional functionality to use decision trees to transform a variable into discrete numbers. Finally, Feature-engine is, to the best of our knowledge, the only open source library with functionality to remove or censor outliers.
2) Feature engine allows the selection of variables directly at the transformer
One of the reasons why Feature-engine’s transformers are so convenient, is because they allow us to select which variables we wish to transform with each technique, directly at the transformer. This way, we can specify the group of variables which, for example, we want to impute with the mean, and the group of variables to impute with the mode, directly within these transformers, without the need to slice the dataframe manually or use alternative transformers. Code examples will follow later on in the blog.
3) Feature-engine returns a dataframe
All Feature-engine transformers return dataframes as outputs. This means that after transforming our dataset, we do not need to worry about variable names and column order as we would do with the NumPy arrays returned by Scikit-learn. With Feature-engine, we can continue to leverage the power of pandas for data analysis and visualisation even after transforming our dataset, allowing for data exploration before and after transforming the variables.
4) Feature-engine is compatible with the Scikit-learn pipeline
Feature-engine transformers are compatible with the Scikit-learn pipeline. This allows the implementation of many feature engineering steps within a single Scikit-learn pipeline prior to training a machine learning algorithm, or obtaining its predictions from raw data. With Feature-engine, we can store an entire machine learning series of transformations into a single object that can be saved and retrieved at a later stage, or placed in memory, for live scoring. Code examples later on in the blog.
5) Feature-engine automatically recognizes numerical and categorical variables
Feature-engine automatically recognizes numerical and categorical variables, thus, preventing the risk of inadvertently applying categorical encoding to numerical variables or numerical imputation techniques to categorical variables.
This functionality also allows to run the transformers without indicating which variables to transform; Feature-engine transformers are intelligent enough to apply numerical transformations to numerical variables and categorical transformations to categorical variables, so that, returning very quickly, and without a lot of data manipulation a benchmark machine learning pipeline on a given dataset.
6) Feature-engine alerts when transformations are not possible for certain variables
Feature-engine will alert when transformations are not possible. For categorical encoding, for example, Feature-engine will signal the unexpected / unintended introduction of missing values. For variable transformations, Feature-engine will alert when logarithm is being applied on negative variables or when reciprocal transformations are applied on variables with 0s as values. This way, Feature-engine helps identify issues with the variables early on during the development of a machine learning engineering pipeline, so that we can choose a more suitable technique.
How to use Feature-engine
In the rest of the blog, we will show examples of how to use Feature-engine transformers for missing data imputation, categorical encoding, discretisation and variable transformation. Let’s begin by missing data imputation, which is typically the first step of a machine learning pipeline.
Feature-engine transformers learn parameters from data when the method fit() is used, and store this parameters within their attributes. These values can then be retrieved to transform new data. In the following sections, we will show how to instantiate and fit a transformer, and how to use a trained transformer to transform a train and a test set. For more details, please refer to their documentation.
Missing data imputation refers to replacing missing observations by a statistical parameter derived from the available values of the variable. As an example of Feature-engine’s imputation capabilities, we will perform median imputation. Feature-engine’s MeanMedianImputer automatically selects all numerical variables in the dataset for imputation, ignoring the categorical variables. The transformer also offers the option to select the variables to impute, as we will show below.
In the walk through below, you can see the implementation of the imputer using the median as the imputation_method on predicting variables on both the test and train datasets. Mean imputation can be implemented similarly by simply replacing “median” with “mean” for imputation_method. If you wish to run the code below, first download and prepare de dataset as indicated here.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.missing_data_imputers import MeanMedianImputer # Load dataset data = pd.read_csv('creditApprovalUCI.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop('A16', axis=1), data['A16'], test_size=0.3, random_state=0) # Set up the imputer median_imputer = MeanMedianImputer(imputation_method='median', variables=[‘A2’, ‘A3’, ‘A8’, ‘A11’, ‘A15’]) # fit the imputer median_imputer.fit(X_train) # transform the data X_train= median_imputer.transform(X_train) X_test= median_imputer.transform(X_test)
After running the above code, the training set will not contain missing values in the variables A2, A3, A8, A11 and A15, and the output will be a dataframe, that allow us to continue with data exploration, to for example, understand the effect of this transformation in the variables distribution.
Categorical encoding includes techniques to transform variables that contain strings as values, into numerical variables. To demonstrate how to use Feature-engine’s categorical encoders, we will perform Count encoding, that is, we will replace the categories by the number of times they appear in the train set. We will use the titanic dataset, which is publicly available in OpenML.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine import categorical_encoders as ce # Load dataset def load_titanic(): data = pd.read_csv('https://www.openml.org/data/get_csv/16826755 /phpMYEkMl') data = data.replace('?', np.nan) data['cabin'] = data['cabin'].astype(str).str data['pclass'] = data['pclass'].astype('O') data['embarked'].fillna('C', inplace=True) return data data = load_titanic() # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['survived', 'name', 'ticket'], axis=1), data['survived'], test_size=0.3, random_state=0) # set up the encoder encoder = ce.CountFrequencyCategoricalEncoder( encoding_method='frequency', variables=['cabin', 'pclass', 'embarked']) # fit the encoder encoder.fit(X_train) # transform the data train_t= encoder.transform(X_train) test_t= encoder.transform(X_test)
Feature-engine learns the category-to-string mappings from the train set, and stores them in the attribute encoder_dict_. The output is a dataframe, where the variables cabin, pclass and embarked are now numbers instead of strings.
Discretisation involves sorting the values of continuous variables into discrete intervals, also called bins or buckets. Here, we will show how to perform discretisation using decision trees, a technique supported exclusively by Feature-engine. We will use the house prices dataset, which is available on Kaggle.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine import discretisers as dsc # Load dataset data = data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the discretisation transformer disc = dsc.DecisionTreeDiscretiser( cv=3, scoring='neg_mean_squared_error', variables=['LotArea', 'GrLivArea'], regression=True) # fit the transformer disc.fit(X_train, y_train) # transform the data train_t= disc.transform(X_train) test_t= disc.transform(X_test)
The output of the variable transformation is a discrete variable, where each of the discrete values, is the prediction returned by the decision tree based of the variable original value.
Mathematical transformations refer to the transformation of the original variable by applying any mathematical function, typically to try and obtain a Gaussian distribution. Here, we will demonstrate how to implement the Box-Cox transformation with Feature-engine:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine import variable_transformers as vt # Load dataset data = data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the variable transformer tf = vt.BoxCoxTransformer(variables = ['LotArea', 'GrLivArea']) # fit the transformer tf.fit(X_train) # transform the data train_t= tf.transform(X_train) test_t= tf.transform(X_test)
Outliers are those variables of the variable that are extremely unusual given the rest of the values of said variable. Among its functionality, Feature-engine allows us to remove or censor outliers, based on the Gaussian approximation, the inter-quartile range proximity rule or the percentiles. Here, we will demonstrate how to censor outliers by finding the variable limits using the IQR:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine import outlier_removers as outr # Load dataset def load_titanic(): data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') data = data.replace('?', np.nan) data['cabin'] = data['cabin'].astype(str).str data['pclass'] = data['pclass'].astype('O') data['embarked'].fillna('C', inplace=True) data['fare'] = data['fare'].astype('float') data['fare'].fillna(data['fare'].median(), inplace=True) data['age'] = data['age'].astype('float') data['age'].fillna(data['age'].median(), inplace=True) return data data = load_titanic() # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['survived', 'name', 'ticket'], axis=1), data['survived'], test_size=0.3, random_state=0) # set up the capper capper = outr.Winsorizer( distribution='gaussian', tail='right', fold=3, variables=['age', 'fare']) # fit the capper capper.fit(X_train) # transform the data train_t= capper.transform(X_train) test_t= capper.transform(X_test)
The output is a dataframe, where the values of the variables age and fare that were beyond the boundaries of the distribution determined by the IQR, are now replaced by those boundaries.
Assembling Feature-engine transformers into the Scikit-learn pipeline
In the precedent sections, we showed how to implement each technique individually. When we build machine learning models, we usually perform various transformations to the variables. We can place all Feature-engine transformers within a Scikit-learn pipeline, to smooth data transformation and algorithm training, as well as easily score new raw data. In the following code snippet, we perform a complete feature engineering pipeline to the house prices dataset, and then build a Lasso regression to predict house price, leveraging the power of the Scikit-learn pipeline:
import pandas as pd import numpy as np from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline as pipe from sklearn.preprocessing import MinMaxScaler from feature_engine import categorical_encoders as ce from feature_engine import discretisers as dsc from feature_engine import missing_data_imputers as mdi # load dataset data = pd.read_csv('houseprice.csv') # drop some variables data.drop(labels=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Id'], axis=1, inplace=True) # make a list of categorical variables categorical = [var for var in data.columns if data[var].dtype == 'O'] # make a list of numerical variables numerical = [var for var in data.columns if data[var].dtype != 'O'] # make a list of discrete variables discrete = [ var for var in numerical if len(data[var].unique()) < 20] # categorical encoders work only with object type variables # to treat numerical variables as categorical, we need to re-cast them data[discrete]= data[discrete].astype('O') # continuous variables numerical = [ var for var in numerical if var not in discrete and var not in ['Id', 'SalePrice'] ] # separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=['SalePrice'], axis=1), data.SalePrice, test_size=0.1, random_state=0) # set up the pipeline price_pipe = pipe([ # add a binary missing indicator ('continuous_var_imputer', mdi.AddMissingIndicator(variables = ['LotFrontage'])), # replace NA by the median ('continuous_var_median_imputer', mdi.MeanMedianImputer( imputation_method='median', variables = ['LotFrontage', 'MasVnrArea'])), # replace NA by adding the label "Missing" ('categorical_imputer', mdi.CategoricalVariableImputer(variables = categorical)), # disretise continuous variables using trees ('numerical_tree_discretiser', dsc.DecisionTreeDiscretiser( cv = 3, scoring='neg_mean_squared_error', variables = numerical, regression=True)), # remove rare labels in categorical and discrete variables ('rare_label_encoder', ce.RareLabelCategoricalEncoder( tol = 0.03, n_categories=1, variables = categorical+discrete)), # encode categorical and discrete variables using the target mean ('categorical_encoder', ce.MeanCategoricalEncoder(variables = categorical+discrete)), # scale features ('scaler', MinMaxScaler()), # Lasso ('lasso', Lasso(random_state=2909, alpha=0.005)) ]) # train feature engineering transformers and Lasso price_pipe.fit(X_train, np.log(y_train)) # predict pred_train = price_pipe.predict(X_train) pred_test = price_pipe.predict(X_test)
Note in the code above, how we indicate which variables to transform within each of Feature-engine transformers. And also note, how easy it is to train the algorithm, and to obtain predictions, once all transformers are assembled within a pipeline. If we want to deploy these pipeline, we need only place 1 Python object in memory to do the job, or save and retrieve only 1 Python pickle, that contains the entire, pre-trained machine learning pipeline.
Bonus: Scikit-learn wrapper
Scikit-learn transformers like the SimpleImputer or any of the variable scalers like the StandardScaler or the MinMaxScaler, transform the entire input dataset and return a NumPy array. If we want to apply these transformers to a subset of features, we can use the Scikit-learn wrapper available in Feature-engine. Here is how to do it:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from feature_engine.wrappers import SklearnTransformerWrapper # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the wrapper with the SimpleImputer imputer = SklearnTransformerWrapper( transformer = SimpleImputer(strategy='mean'), variables = ['LotFrontage', 'MasVnrArea']) # fit the wrapper + SimpleImputer imputer.fit(X_train) # transform the data X_train = imputer.transform(X_train) X_test = imputer.transform(X_test)
Feature-engine’s Scikit-learn wrappers allows the application of most scikit-learn transformers to a selected feature subspace, returning a dataframe.
Feature engineering is the process of taking a dataset and constructing explanatory variables, or predictor features, that are then passed onto the prediction model to train a machine learning algorithm. It is a crucial step in all machine learning models, but can be challenging and time consuming if you aren’t already deeply familiar with the knowledge domain.
Open source libraries with of-the-shelf algorithms for feature engineering and data transformation have a major edge over manually encoding the transformation steps, as they enhance reproducibility while minimising the amount of coding required by the data scientist.
There is a growing number of open source libraries for variable transformation, which focus on different types of raw data, or engineering techniques, like Featuretools, Category encoders, Scikit-learn and Feature-engine. All of these libraries will help you streamline your data preparation pipelines.
In this blog, we explored the salience of Feature-engine, and its exhaustive battery of techniques for missing data imputation, categorical variable encoding, variable transformation, discretisation and outlier handling, and provided a few examples that show how easy it is to use.
To know more about Feature-engine visit its dedicated documentation. To stay alert of new Feature-engine releases register at trainindata. For an overview on feature engineering techniques included in Feature-engine visit the blog “Feature engineering: A comprehensive overview”. For code implementations of feature engineering with Feature-engine and other libraries check the book “Python Feature Engineering Cookbook. Finally, for and in-depth understanding of each engineering technique, its advantages and shortcomings, their effect of the variables and the dataset, and when to apply each transformation, visit the course “Feature Engineering on Machine Learning ”.
Thanks for reading!