Testing and Monitoring Machine Learning Model Deployments
Updated: Jun 4
For years, businesses and developers have understood the importance of testing software before deployment. Before it can interface with customers in real time, a business naturally wants the software to function as expected. With the increasing demand for machine learning implemented in business, it’s reasonable to expect that machine learning models deployed into production need to be tested just as rigorously.
However, for many businesses, machine learning model deployments are relatively new, and some don’t have sufficient knowledge or a foundation in place to test them as rigorously as they test software. Though extensive testing of these models needs to happen in research and development, many other problems can also occur once live data enters the model.
In this blogpost, I’ll give a brief overview of what machine learning model deployment means and entails, along with some of the differences between testing these models in deployment as opposed to standard software. Next, I’ll discuss why testing machine learning models is important and the challenges these models might face after deployment. Finally, I’ll discuss methods for testing in order to address these challenges.
For details on the technical implementation of testing and monitoring machine learning model deployments, visit our online course.
Reproducibility in Model Deployment
The deployment of a machine learning model occurs when the business makes the model available in the production environment. Here, the model can take in live data as an input and give the most updated results to other software systems and customers.
Before this happens, ideally some testing occurs to ensure that the same machine learning model is reproducible between the research environment and the production environment. In other words, we’ve confirmed that given the same input, the model will return the same output in both environments.
Reproducibility ensures that the business value generated by the model will translate from research into production. Significant time and resources go into optimizing the model in research to make sure that it maximizes the business value, but if the model isn’t reproducible, the value may not hold.
In practice, it’s hard to make the model truly reproducible in both environments for a variety of reasons, though I won’t dive too deeply into that here. However, if the models aren’t completely reproducible, testing hopefully allows us to understand where the models diverge at the very least, and we can proceed effectively enough with that knowledge.
In addition to testing the models between research and production, we can also test again after deployment; here, we’ll use the live data instead of historical data. This is a good starting place for testing the model, but reproducibility doesn’t necessarily mean the model maximizes business value. Many more tests should occur to assess the effectiveness of the model, which is what I’ll dive into in the remainder of this blogpost.
Shadow deployments are critical to effectively testing and monitoring a machine learning model. A shadow deployment describes the process of running production traffic through the model but not serving its predictions to customers. In the meantime, whatever old model or version in place serves predictions, and the results from the new model are stored for testing and analysis.
Even if a model behaves as it should in a staging environment, that won’t necessarily be the case in production, for a variety of reasons that I’ll discuss in this blogpost. Shadow deployments ensure that the machine learning model handles the live inputs and incoming load properly before customers actually utilize it. This way, the stakes remain low, but the business can assess the model under the actual conditions it will experience when deployed. Therefore, all of the testing I describe in this blog post should occur in shadow mode, if possible.
Machine Learning Model Testing vs. Software Testing
Naturally, a business wants to test out any software before serving it to customers so that we’re releasing a functional product to the public. Software testing before deployment mainly consists of unit testing and integration testing.
Unit testing usually occurs first. This describes the evaluation of individual components of a software. The focus here is internal consistency; it has a narrow scope and shouldn’t have any dependencies on outside systems. It can also be referred to as component or module testing.
Integration testing happens afterwards. Once all the individual pieces function properly on their own, we need to ensure they interact together correctly as well. Common issues in the interface between two modules might include data exchange, function calling, or hardware issues.
Once all the individual modules are tested, we can progressively increase levels of modules to test their interaction, building it up until we test the entire software. This process is more difficult than unit testing, as identifying the problem is generally more complicated.
Machine learning model testing requires the unit testing and integration testing of a standard software, but also requires much more. The machine learning model is built up of a combination of code and data, and data necessitates additional tests at various levels to ensure reliability.
Machine Learning Model Testing
Various monitoring steps occur throughout the entire machine learning pipeline. The image below offers an end to end view of a machine learning system, giving a sense of when different types of testing needs to occur.
As you can see, it includes the typical software testing fundamentals as mentioned before such as unit testing, integration testing, and system monitoring. I won’t go into much detail about those here, as they are not unique to machine learning model deployments, but it’s worth acknowledging their presence.
In addition, the figure depicts tests implemented throughout the development of the model such as data quality tests, model performance tests, and machine learning infrastructure tests. None of these tests, however, will be the focus of this blog post, though they are very important as well.
Instead, I will discuss the testing and monitoring that needs to occur after the model deployment--ideally in shadow mode. This includes: skew tests, data monitoring, and prediction monitoring.
Once satisfied with the results of these tests, we can fully deploy the models to start serving customers, but this doesn’t mean that testing should stop here. The testing and monitoring processes need to continue running even after full deployment to ensure that the model continues providing value even after some time has passed.
Compared to software testing, testing machine learning models in deployment is an underdeveloped area of exploration. Most organizations don’t understand what to test exactly in order to truly validate a model. It may be difficult to pin down what to address. To remedy this, I’ve identified the key reasons as to why testing after deployment is important to help guide the process.
Importance of Testing and Monitoring
As mentioned previously, machine learning models depend not only on code, but on data. A model may perform admirably in research, but if discrepancies in the data occur in the live environment, the same model will provide little use, despite its performance in research.
Representativeness of training data needs to be assessed in order to understand the viability of the model in the live environment. If the training data doesn’t represent the live data well, the model provides little business value during deployment.
Feature dependencies also need to be identified. In other words, we want to evaluate whether features are changing over time and if we’re getting the features we think we’re getting. Sometimes other teams within a business or even third parties create a feature, and perhaps there’s a misalignment regarding what the feature represents.
A relevant example involves affordability measures, which illustrate debt to income ratios. In some cases, the measure considers all debt, but in others, the definition changes slightly to capture only secure debt, such as car loans and mortgages. If this was a feature in a model, different definitions would change the distribution of this input dramatically. If this feature definition did not translate properly between research and production, the deployed model would falter.
In other instances, market changes alter the meaning of a feature, as in the case with the development of electric cars. Insurers need to provide pricing and servicing to these customers, even though no historic data for electric cars exists. They need to change their systems to accommodate this new fueling option. Businesses need to ensure they’re able to tackle changes like this and adapt to an everchanging market.
Data dependencies need to be monitored. If there is a data outage, for example, we need to pick it up immediately, otherwise models will continue serving customers without appropriately accounting for the missing data. Data from third parties also may not always be available.
Model performance drift needs monitoring as well. A model may perform well when initially deployed, but perhaps its performance deteriorates over time. The business needs to monitor the model’s accuracy throughout its entire life in production so that they’ll know if it’s fallen below a predefined standard. If this occurs, the business can begin identifying why the model is worsening and what they can do to improve it.
These four challenges illustrate the importance of testing and monitoring machine learning models in deployment. Even if we test models thoroughly in the research environment, the model will always face these risks and hinderances while deployed in the live environment. The following tests will provide a means to assess the effect these problems may have on the deployed model and help ensure that the model provides maximal business value in the live environment.
Methods for Testing and Monitoring
Addressing all of the obstacles described above will require many different assessments, including different types of live data checks, skew tests, and model prediction and performance monitoring. The combination of all of these will address all the aforementioned challenges deployed models face.
Live Data Checks
The live data checks give a means to determine if we’re actually getting the data that we expect in the live environment. We will need to monitor the data inputs to make sure they resemble what we anticipate.
These tests involve checking that the inputs for each variable match what the model expects, though this looks slightly different for different variable types.
Categorical variables often have a small set of permitted values they can take on. For example, if we have a variable describing a person’s marital status, there are probably only four different values we should see as inputs: single, married, divorced, and widowed.
In instances like this, it’s relatively easy to code checks to see if all the actual inputs fall into one of these options.
On other occasions, however, categorical variables can take on a much larger range of values. For example, a variable describing the bank of a customer will have too many values to hard code validations of all of them. In these instances, one would need to use the training data to learn a list of reasonable values and then record values in the live data that haven’t been seen before.
Numerical variables often have certain characteristics presumed of them that one can check. Most commonly, a numerical value might have an expected range it falls in. Perhaps users are asked to rate their level of agreement from one to seven. We know that the variable inputs should naturally be within this range and could have an error flagged if an input is outside it.
Additionally, one can also use business knowledge about a variable to check if the inputs’ mean, median, or other statistics match what we might expect of them. This touches on the concept of evaluating distributions of data to assess them, which I’ll cover in more detail in the following section on skew tests.
What are we hoping to pick up with these tests? These live data checks will ultimately determine if any feature dependencies exist in the live model. They will ensure the features in the live model are, in fact, the features we expect in it.
Another set of tests is called the skew tests. These will help give an idea of how representative the training data is of the live data.
One of the most common and the simplest forms of this involves monitoring the percentage of missing data we’re seeing in the live data compared to the training data. Another common concern is the percentage of non-zero values, in cases where the variables are very skewed or sparse.
Missing data and non-zero values can be assessed with the Chi-squared test. I won’t go into too many of the specifics here, but the Chi-squared test will essentially determine if the two different proportions are statistically similar or not.
Proportions here could mean, for example, the proportion of missing data in the training data compared to the proportion of missing data in the live data. The exact proportions will rarely be identical, but this test will give an idea if the proportions are different enough to suggest that the training data isn’t quite representative of the live data.
Distributions can be compared for numerical variables with either the Kolmogorov-Smirnov test or, again, the Chi-squared test. If the variables are continuous and not heavily skewed, the Kolmogorov-Smirnov test applies. Alternatively, if we divide the variable into bins, then the proportions of the data in each bin could be compared with the Chi-squared test, similarly to the missing or non-zero data case.
What are we hoping to pick up with these tests? These tests will show how similar the training data is to the live data. This can give us a sense of how biased the training data was, or how much the current market has changed from the time the training data was gathered.
Perhaps a finance company used to be very conservative, but recently, they have expanded to take on more risk. In this case, the historic data used to train the models would no longer represent the current population accurately.
In another type of scenario, say we release a television add at four o’clock in the afternoon, and consequently, a large wave of customers enters the system. There may be particular characteristics of people who watch television in the afternoon which could differ from the general spread of characteristics.
Skew tests would pick up instances like these, therefore giving insight into market changes or bias. Additionally, these can test data dependencies; for example, the Chi-squared test for missing values will capture events such as a data outage.
Model Prediction and Performance Monitoring
The final step for effective testing of deployed machine learning models is monitor the model performance. This process should begin by comparing prediction distribution with live data compared to the training data, but it should continue by monitoring model performance throughout its entire lifespan serving customers.
Prediction distributions can be compared again with either the Kolmogorov-Smirnov test or the Chi-Squared test, depending on how we wish to go about grouping the predictions. Most predictions will be a continuous output, so the Kolmogorov-Smirnov will apply; however, if we wanted to group predictions into classes, we could again compare frequency of each of the groups with the Chi-Squared test.
If the model predictions using the live data are vastly different than the predictions from the training data, this is likely a good indicator that some discrepancies in data still exist in the live environment.
Model performance monitoring might be a little trickier, depending on the nature of the target. If we’re predicting something that will happen fairly quickly, we can test the performance well, as we’ll be able to compare the predictions with the actual outcomes.
If the outcome won’t occur until a few months or even longer, then we won’t be able to compare with the predictions while still in shadow mode. However, the business could still run quarterly performance reviews while the model is live later on.
What are we hoping to pick up with these tests? With these tests, we can determine the drift in performance of the model. These will give an idea of how similarly the model performs in the live versus training environments, as well as how the model performs overall compared with the actual outcomes.
With these three groups of tests, we cover all the previously mentioned priority areas for assessment in machine learning model deployments. Data monitoring helps identify feature dependencies, skew tests assess representativeness of training data and data dependencies, and prediction monitoring evaluates model performance drift.
The market will never stop changing, and an imperfect world will always give us imperfect data. Though a model will never be perfect either, by employing these testing techniques, a business can understand any deficiencies in a model and work to alleviate then, ensuring that business gives its customers the most accurate results it can.