The 5 Stages of Machine Learning Validation

TL;DR

  • Machine learning systems cannot be tested with traditional software testing techniques.
  • Machine learning validation is the process of assessing the quality of the machine learning system.
  • 5 different types of machine learning validations have been identified:
    – ML data validations: to assess the quality of the ML data
    – Training validations: to assess models trained with different data or parameters
    – Pre-deployment validations: final quality measures before deployment
    – Post-deployment validations: ongoing performance assessment in production
    – Governance & compliance validations: to meet government and organisational requirements
  • Implementing a machine learning validation process will ensure ML systems are built with high quality, are compliant, and accepted by the business to increase adoption.

Contents

  • Introduction
  • What is machine learning validation?
  • The 5 stages of machine learning validation
    – ML data validations
    – Training validations
    – Pre-deployment validations
    – Post-deployment validations
    – Governance & compliance validations
  • Benefits of having an ML validation policy

Introduction

A graph showing the barrier to entry of AI and machine learning systems decreasing over time, as the risk of AI and machine learning increases.
Figure 1: Graph comparing the barrier to entry for AI/ML and the associated risk over the past 8 years.

What is machine learning validation?

The 5 stages of machine learning validation

As shown below in Figure 2, 5 key stages of machine learning validation have been identified:

  1. ML Data validations
  2. Training validations
  3. Pre-deployment validations
  4. Post-deployment validations
  5. Governance & compliance validations

The remainder of this article will break down each stage further to outline, what it is, types of validations and examples for each category.

A block diagram showing the 5 stages of machine learning validation fitting into the typical machine learning lifecycle.
Figure 2: Diagram showing where the 5 stages of machine learning validation fit into the typical machine learning lifecycle.

1. ML data validations

  • Data Engineering validations — Identify any general issues within the data set, based on basic understanding and rules. This might include checking for null columns and NAN values throughout the data, as well as known ranges. For example, confirming the data for a feature of “Age” should be between 0–100.
  • ML-based data validations — Assess the quality of the data for training a machine learning model. For example, ensuring the dataset is evenly distributed so the model won’t be biased or have a far greater performance for a certain feature or value.

As shown in Figure 3, below, it is best practice for the Data Engineering validations to be completed prior to your machine learning pipeline. Therefore, only the ML-based data validations should be performed within the machine learning pipeline itself.

Diagram showing the Data Engineering validations fit into the data stream prior to the data being pushed to the data lake. The ML data validations fit into the data pipeline, after the data is being pulled out of the data lake and into the machine learning workflows.
Figure 3: Diagram showing where the two different types of data validations sit in the data process.

2. Training validations

Diagram showing how cross validation can be translated into a simple actionable test, but validating the performance of each model is within a given range of each other.
Figure 4: An example showing how cross validation can be translated into a simple validation.

Feature selection validations — Understanding how important or influential certain features are should also be a continuous process throughout the model’s lifecycle. Examples include removing features from the training set or adding random noise features, to validate the impact this has on metrics such as performance/feature importance.

3. Pre-deployment validations

After model training is complete and a model is selected, the final model’s performance and behaviour should be validated outside of the training validation process. This involves creating actionable tests around measurable metrics. For example, this might include reconfirming the performance metrics are above a certain threshold.

When assessing the performance of a model, it is common practice to look at metrics such as accuracy, precision, recall, F1 score or a custom evaluation metric. However, we can take this a step further by assessing these metrics across different data slices throughout a data set. For example, for a simple house price regression model, how does the model’s performance compare when predicting the house price of a 2 bedroom property and a 5 bedroom property. This information is rarely shared with users of the model, but can be greatly informative to understand a model’s strengths and weaknesses, thus contributing to growing trust of the model.

Diagram showing a simple example of a model accuracy validation across data slices for a house prices prediction example.
Figure 5: Example showing how the simple model performance metric, accuracy, can be broken down further, to validate the performance of different data slices.

Additional performance validations may also include, comparing the model to a random baseline model, to ensure the model is actually fitting to the data; or testing that the model inference time is below a certain threshold, when developing a low latency use case.

Other validations outside of performance can also be included. For example, the robustness of a model should be validated by checking single edge cases, or that the model accurately predicts on a minimum set of data. Additionally, explainability metrics can also be translated into validations, for example to check if a feature is within the top N most important features.

It is important to reiterate that all of these pre-deployment validations take a measurable metric and build it into a pass/fail test. The validations act as a final “go / no go” before the model is used in production. Therefore, these validations act as a preventative measure to ensure that a high quality and transparent model is about to be used to make the business decisions it was built for.

4. Post-deployment validations (model monitoring)

Once the model has passed the pre-deployment stage, it is promoted into production. As the model is then making live decisions, post-deployment validations are used to continuously check the health of the model, to confirm it is still fit for production. Therefore, post-deployment validations act as a reactive measure.

As a machine learning model predicts an outcome based on the historical data it has been trained on, even a small change in the environment around the model can result in dramatically incorrect predictions. Model monitoring has become a widely adopted practice within the industry to calculate live model metrics. This might include rolling performance metrics, or a comparison of the distribution of the live and training data.

Similar to pre-deployment validations, post-deployment validation is the practice of taking these model monitoring metrics and turning them into actionable tests. Typically, this involves alerting. For example, if the live accuracy metric drops below a certain threshold, an alert is sent, triggering some sort of action, such as a notification to the Data Science team, or an API call to start a retraining pipeline.

Graph showing the performance of a model decaying over time, and where a threshold can be added to trigger action, to save the model from having negative impact.
Figure 6: Diagram showing the performance of a model decaying over time, and where a threshold can be added to trigger action.

Post-deployment validations include:

  • Rolling performance calculations — If the machine learning system has the ability to gather feedback if the prediction was correct or not, performance metrics can be calculated on the fly. The live performance can then be compared to the training performance, to ensure they are within a certain threshold and not declining.
  • Outlier detection — By taking the distribution of the model’s training data, anomalies can be detected on real-time requests. By understanding if a data point is within a certain range of the training data distribution. Going back to our Age example, if a new request contained “Age=105”, this would be flagged as an outlier, as it is outside of the distribution of the training data (which we previously defined as ranging from 0–100).
  • Drift detection — To identify when the environment around a model has changed. A common technique used is to compare the distribution of the live data to the distribution of the training data, and checking it is within a certain threshold. Using the “Age” example again, if the live data inputs suddenly started receiving a large number of requests with Age>100, the distribution of the live data would change, and have a higher median than the training data. If this difference is greater than a certain threshold drift would be identified.

A/B testing — Before promoting a new model version into production, or to find the best performing model on live data, A/B testing can be used. A/B testing sends a subset of traffic to model A, and a different subset of traffic to model B. By assessing the performance for each model with a chosen performance metric, the higher performing model can be selected and promoted to production.

5. Governance & compliance validations

Having a model up and running in production, and making sure it is generating high quality predictions is important. However, it is just as important (if not more) to ensure that the model is making predictions in a fair and compliant manner. This includes meeting regulations set out by governing bodies, as well as aligning to specific company values of your organisation.

As discussed in the introduction, recent news articles have shown some of the world’s largest organisations getting this very wrong, and introducing biased / discriminating machine learning models into the real-world.

Regulations such as GDPR, EU Artificial Intelligence Act and GxP are beginning to put policies in place to ensure organisations are using machine learning in a safe and fair manner.

These policies include things such as:

  • Understanding and identifying the risk of an AI-system (broken down into unacceptable risk, high risk and limited & minimal risk)
  • Ensuring PII data is not stored or used inappropriately
  • Ensuring protected features such as gender, race or religion are not used
  • Confirming the freshness of the data a model is trained on
  • Confirming a model is frequently retrained and up to date, and there are sufficient retraining processes in place

An organisation should define their own AI/ML compliance policy that aligns with these official government AI/ML compliance acts and their company values. This will ensure organisations have the necessary processes and safeguards in place when developing any machine learning system.

This stage of the validation process fits across all of the other validation stages discussed above. Having an appropriate ML validation process in place will provide a framework to be able to report on how a model has been validated at every stage. Hence meeting the compliance requirements.

Benefits of having an ML validation policy

Having a suitable validation process implemented across all five stages of the machine learning pipeline will ensure:

  1. Machine learning systems are built with and maintain high-quality,
  2. The systems are fully compliant and safe to use,
  3. All stakeholders have visibility on how a model is validated, and the value of machine learning.

Businesses should ensure they have the right processes and policies in place to validate the machine learning their technical teams are delivering. Additionally, Data Science teams should include validation design in the scoping phase of their machine learning system. This will determine the tests a machine learning model must pass to move, and remain in, production.

This will not only ensure businesses are generating a large amount of value of their machine learning systems, but also, allow non-technical business users and stakeholders to have trust in the machine learning applications being delivered. Therefore, increasing the adoption of machine learning across organisations.

Leave a Reply

Scroll to Top
%d bloggers like this: