AN OPEN SOURCE PYTHON PACKAGE TO CREATE REPRODUCIBLE FEATURE ENGINEERING STEPS AND SMOOTH MODEL DEPLOYMENT
Feature-engine allows you to design and store a feature engineering pipeline with bespoke procedures for different variable groups.
Missing Data Imputation
Feature-engine includes widely used techniques for missing data imputation, such as mean and median imputation, frequent category imputation, random sample imputation, and adding a missing indicator. Feature-engine also includes alternative techniques, like end of tail imputation.
Feature-engine comprises the most extensive library for categorical variable encoding to date, including one hot encoding, ordinal numbering, count or frequency encoding, as well as, more powerful techniques like target encoding, and weight of evidence. Feature-engine also handles rare labels automatically.
Feature-engine comes with the most popular methods of variable discretisation: equal width and equal frequency discretisation. Feature-engine also includes a method developed during the 2009 KDD data science competition which uses decision trees to automatically find the buckets for each variable.
Feature-engine allows you to cap variables at specific arbitrary values, or it automatically determines the capping values for you, using the inter-quantal range proximity rule.
Engineer Individual Feature Groups
Feature-engine allows you to select a subset of variables for each engineering step. You can apply mean imputation to certain variables and random sample imputation to others. All engineering steps can be integrated into a machine learning pipeline to smooth model deployment.
Why Use Feature-engine?
LEVERAGE THE POWER OF WELL-ESTABLISHED TECHNIQUES
Feature-engine includes feature engineering techniques extensively used in the industry and in data science competitions. Most of the techniques were gathered from the series of books released after the 2009 KDD data science competition, and are widely used in data science and machine learning competitions.
SIMPLIFY YOUR MACHINE LEARNING PIPELINES
Feature-engine offers Scikit-learn like functionality to create and store feature engineering steps that learn from train data and then transform test data. Each Feature-engine transformer, learns and stores parameters from the train data through the fit() method, and transforms new data using these stored parameters with the transform() method.
SMOOTH MODEL DEPLOYMENT
Feature-engine transformers are compatible with the Scikit-learn pipeline, allowing you to build and deploy one single Python object with all the required feature engineering, feature scaling and model training and scoring steps. You will only need to create, store and retrieve one pickle object in your APIs.
Feature-engine is built on top of Scikit-learn, pandas, NumPy and SciPy. Feature-engine is able to take in and return pandas dataframes to smooth the research phase of your data science project. Feature-engine also integrates well with the Scikit-learn pipeline, allowing you to build simplified machine learning pipelines and reduce the overhead of model deployment.
Feature-engine is available in PyPi and Github, and it can be easily installed with pip. Feature-engine’s documentation is growing, with several Jupyter notebooks with examples on how to use it in the Github repository. Getting started with Feature-engine should be fairly easy.
Feature-engine’s feature engineering and variable encoding functionality is inspired by a series of articles with the winning solutions of the 2009 KDD competition.
The functionality, assumptions, advantages and limitations each feature engineering step in Feature-engine are extensively covered in the course Feature Engineering for Machine Learning.