ML Test Score

19 min readDec 7, 2023

Introduction

As machine learning (ML) systems continue to take on ever more central roles in real-world production settings, the issue of ML reliability has become increasingly critical. ML reliability involves a host of issues not found in small toy examples or even large offline experiments, which can lead to surprisingly large amounts of technical debt. Testing and monitoring are important strategies for improving reliability, reducing technical debt, and lowering long term maintenance cost.

However, ML system testing is also more complex a challenge than testing manually coded systems, due to the fact that ML system behavior depends strongly on data and models that cannot be strongly specified a priori. One way to see this is to consider ML training as analogous to compilation, where the source is both code and training data. By that analogy, training data needs testing like code, and a trained ML model needs production practices like a binary does, such as debuggability, rollbacks and monitoring.

Why its different

Why Do we need testing or Quality assurance anyway?

The “subtle” difference between Production system and offline or R&D example

The (ML) systems are continuously evolving: from collecting and aggregating more data, to retraining models and improving their accuracy.

Quality control and assurance should be performed BEFORE the consumption by users to increase the reliability and reduce bias in our systems.

Where do unit test fit in software?

How ML System is different?

Where do we test?

Where we should Focus

Your Score

Data And Feature

Machine learning systems differ from traditional software based systems in that the behavior of ML systems is not specified directly in code but is learned from data. Therefore, while traditional software can rely on unit tests and integration tests of the code, here we attempt to add a sufficient set of tests of the data.

Feature expectations are captured in a shema

It is useful to encode intuitions about the data in a schema so they can be automatically checked. For example, an adult human is surely between one and ten feet in height. The most common word in English text is probably ‘the’, with other word frequencies following a power-law distribution. Such expectations can be used for tests on input data during training and serving

How?

To construct the schema, one approach is to start with calculating statistics from training data, and then adjusting them as appropriate based on domain knowledge. It may also be useful to start by writing down expectations and then compare them to the data to avoid an anchoring bias. Visualization tools such as Facets can be very useful for analyzing the data to produce the schema. Invariants to capture in a schema can also be inferred automatically from your system’s behavior.

All Features are beneficial

A kitchen-sink approach to features can be tempting, but every feature added has a software engineering cost. Hence, it’s important to understand the value each feature provides in additional predictive power (independent of other features).

How?

Some ways to run this test are by computing correlation coefficients, by training models with one or two features, or by training a set of models that each have one of k features individually removed.

No feature’s cost too much

It is not only a waste of computing resources, but also an ongoing maintenance burden to include -features that add only minimal predictive benefit.

How?

To measure the costs of a feature, consider not only added inference latency and RAM usage, but also more upstream data dependencies, and additional expected instability incurred by relying on that feature.

Features adhere to meta-level requirements.

Your project may impose requirements on the data coming in to the system. It might prohibit features derived from user data, prohibit the use of specific features like age, or simply prohibit any feature that is deprecated. It might require all features be available from a single source. However, during model development and experimentation, it is typical to try out a wide variety of potential features to improve prediction quality.

How?

Programmatically enforce these requirements, so that all models in production properly adhere to them.

The data pipeline has appropriate privacy controls.

Training data, validation data, and vocabulary files all have the potential to contain sensitive user data. While teams often are aware of the need to remove personally identifiable information (PII), during this type of exporting and transformations, programming errors and system changes can lead to inadvertent PII leakages that may have serious consequences.

How?

Make sure to budget sufficient time during new feature development that depends on sensitive data to allow for proper handling. Test that access to pipeline data is controlled as tightly as the access to raw user data, especially for data sources that haven’t previously been used in ML. Finally, test that any user-requested data deletion propagates to the data in the ML training pipeline, and to any learned
models.

New features can be added quickly.

The faster a team can go from a feature idea to the feature running in production, the faster it can both improve the system and respond to external changes. For highly efficient teams, this can be as little as one to two months even for global-scale, high-traffic ML systems. Note that this can be in tension with Data 5, but privacy should always take precedence.

All input feature code is tested.

Feature creation code may appear simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital. Bugs in features may be almost impossible to detect once they have entered the data generation process, especially if they are represented in both training and test data.

Model Deployment

While the field of software engineering has developed a full range of best practices for developing reliable software systems, similar best-practices for ML model development are still emerging.

Model Specs are reviewed and submitted

It can be tempting to avoid code review out of expediency, and run experiments based on one’s own personal modifications. In addition, when responding to production incidents, it’s crucial to know the exact code that was run to produce a given learned model. For example, a responder might need to re-run training with corrected input data, or compare the result of a particular modification. Proper version control of the model specification can help make training auditable and improve reproducibility.

Offline and online metrics correlate

A user-facing production system’s impact is judged by metrics of engagement, user happiness, revenue, and so forth. A machine learning system is trained to optimize loss metrics such as log-loss or squared error. A strong understanding of the relationship between these offline proxy metrics and the actual impact metrics is needed to ensure that a better scoring model will result in a better production system.

How?

The offline/online metric relationship can be measured in one or more small scale A/B experiments using an intentionally degraded model.

All hyperparameters have been tuned.

A ML model can often have multiple hyperparameters, such as learning rates, number of layers, layer sizes and regularization coefficients. Choice of the hyperparameter values can have dramatic impact on prediction quality.

How?

Methods such as a grid search or a more sophisticated hyperparameter search strategy not only improve prediction quality, but also can uncover hidden reliability issues. Substantial performance improvements have been realized in many ML systems through use of an internal hyperparameter tuning service[

The impact of model staleness is known.

Many production ML systems encounter rapidly changing, non-stationary data. Examples include content recommendation systems and financial ML applications. For such systems, if the pipeline fails to train and deploy sufficiently up-to-date models, we say the model is stale. Understanding how model staleness affects the quality of predictions is necessary to determine how frequently to update the model. If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? Most models need to be updated eventually to account for changes in the external world; a careful assessment is important to decide how often to perform the updates

How?

One way of testing the impact of staleness is with a small A/B experiment with older models. Testing a range of ages can provide an age-versus-quality curve to help understand what amount of staleness is tolerable.

A simpler model is not better.

Regularly testing against a very simple baseline model, such as a linear model with very few features, is an effective strategy both for confirming the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more sophisticated techniques.

Model quality is sufficient on important data slices.

Slicing a data set along certain dimensions of interest can improve fine-grained understanding of model quality. Slices should distinguish subsets of the data that might behave qualitatively differently, for example, users by country, users by frequency of use, or movies by genre. Examining sliced data avoids having fine-grained quality issues masked by a global summary metric, e.g. global accuracy improved by 1% but accuracy for one country dropped by 50%. This class of problems often arises from a fault in the collection of training data, that caused an important set of training data to be lost or late.

How?

Consider including these tests in your release process, e.g. release tests for models can impose absolute thresholds (e.g., error for slice x must be <5%), to catch large drops in quality, as well as incremental (e.g. the change in error for slice x must be <1% compared to the previously released model).

The model is tested for considerations of inclusion.

There have been a number of recent studies on the issue of ML Fairness, which may arise inadvertently due to factors such as choice of training data. For example, Bolukbasi et al. found that a
word embedding trained on news articles had learned some striking associations between gender and occupation that may have reflected the content of the news articles but which may have been inappropriate for use in a predictive modeling context. This form of potentially overlooked biases in training data sets may then influence the larger system behavior.

How?

Diagnosing such issues is an important step for creating robust modeling systems that serve all users well. Tests that can be run include examining input features to determine if they correlate strongly with protected user categories, and slicing predictions to determine if prediction outputs differ materially when conditioned on different user groups.

Bolukbasi et al. propose one method for ameliorating such effects by projecting embeddings to spaces that collapse differences along certain protected dimensions. Hardt et al propose a post-processing step in model creation to minimize disproportionate loss for certain groups in the manner of. Finally, the approach of collecting more data to ensure data representation for potentially under-represented categories or subgroups can be effective in many cases.

Infrastructure

An ML system often relies on a complex pipeline rather than a single running binary.

Training is reproducible.

Ideally, training twice on the same data should produce two identical models. Deterministic training dramatically simplifies reasoning about the whole system and can aid auditability and debugging. For example, optimizing feature generation code is a delicate process but verifying that the old and new feature generation code will train to an identical model can provide more confidence that the refactoring was correct. This sort of diff-testing relies entirely on deterministic training.

Unfortunately, model training is often not reproducible in practice, especially when working with non-convex methods such as deep learning or even random forests. This can manifest as a change in aggregate metrics across an entire dataset, or, even if the aggregate performance appears the same from run to run, as changes on individual examples.

Random number generation is an obvious source of nondeterminism, which can be alleviated with seeding. But even with proper seeding, initialization order can be underspecified so that different portions of the model will be initialized at different times on different runs leading to non-determinism. Furthermore, even when initialization is fully deterministic, multiple threads of execution on a single machine or across a distributed system may be subject to unpredictable orderings of training data, which is another source of non-determinism.

How?

Besides working to remove nondeterminism as discussed above, ensembling models can help.

Model specs are unit tested.

Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Unfortunately, testing a model specification can be very hard. Unit tests should run quickly and require no external dependencies but model training is often a very slow process that involves pulling in lots of data from many sources.

How?

It’s useful to distinguish two kinds of model tests: tests of API usage and tests of algorithmic correctness

ML APIs can be complex, and code using them can be wrong in subtle ways. Even if code errors would be apparent after training (due to a model that fails to train or results in poor performance), training is expensive and so the development loop is slow. We have found in practice that a simple unit test to generate random input data, and train the model for a single step of gradient descent is quite powerful for detecting a host of common library mistakes, resulting in a much faster development cycle. Another useful assertion is that a model can restore from a checkpoint after a mid-training job crash.

Testing correctness of a novel implementation of an ML algorithm is more difficult, but still necessary — it is not sufficient that code produces a model with high quality predictions, but that it does so for the expected reasons. One solution is to make assertions that specific subcomputations of the algorithm are correct, e.g. that a specific part of an RNN was executed exactly once per element of the input sequence. Another solution involves not training to completion in the unit test but only training for a few iterations and verifying that loss decreases with training. Still another is to purposefully train a model for overfitting: if one can get a model to effectively memorize its training data, then that provides some confidence that learning reliably happens. When testing models, pains should be taken to avoid “golden tests”, i.e., tests that partially train a model and compare the results to a previously generated model — such tests are difficult to maintain over time without blindly updating the golden file. In addition to problems in training non-determinism, when these tests do break they provide very little insight into how or why. Additionally, flaky tests remain a real danger here.

The ML pipeline is Integration tested.

A complete ML pipeline typically consists of assembling training data, feature generation, model training, model verification, and deployment to a serving system. Although a single engineering team may be focused on a small part of the process, each stage can introduce errors that may affect subsequent stages, possibly even several stages away. That means there must be a fully automated test that runs
regularly and exercises the entire pipeline, validating that data and code can successfully move through each stage and that the resulting model performs well.

How?

The integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before they reach production. Faster running integration tests with a subset of training data or a simpler model can give faster feedback.

Model quality is validated before serving.

After a model is trained but before it actually affects real traffic, an automated system needs to inspect it and verify that its quality is sufficient; that system must either bless the model or veto it, terminating its entry to the production environment.

How?

It is important to test for both slow degradations in quality over many versions as well as sudden drops in a new version. For the former, setting loose thresholds and comparing against predictions on a validation set can be useful; for the latter, it is useful to compare predictions to the previous version of the model while setting tighter thresholds.

The model is debuggable.

When someone finds a case where a model is behaving bizarrely, how difficult is it to figure out why? Is there an easy, well documented process for feeding a single example to the model and investigating
the computation through each stage of the model (e.g. each internal node of a neural network)? Observing the step-by-step computation through the model on small amounts of data is an especially useful debugging strategy for issues like numerical instability.

How?

An internal tool that allows users to enter examples and see how the a specific model version interprets it can be very helpful. The TensorFlow debugger is one example of such a tool.

Models are canaried before serving.

Offline testing, however extensive, cannot by itself guarantee the model will perform well in live production settings, as the real world often contains significant non-stationarity or other issues that limit the utility of historical data. Consequently, there is always some risk when turning on a new model in production.

One recurring problem that canarying can help catch is mismatches between model artifacts and serving infrastructure. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. For example, as shown in Figure 2, a refactoring in the core learning library might change the low-level implementation of an operation Op in the model from Op0.1 to a more efficient mplementation, Op0.2. A newly trained model will thus expect to be implemented with Op0.2; an older deployed server will not include Op0.2 and so will refuse to load the model.

How?

To mitigate the mismatch issue, one approach is testing that a model successfully loads into production serving binaries and that inference on production input data succeeds. To mitigate the new-model risk more generally, one can turn up new models gradually, running old and new models concurrently, with new models only seeing a small fraction of traffic, gradually increased as the new model is observed to behave sanely.

Serving models can be rolled back.

A model “roll back” procedure is a key part of incident response to many of the issues that can be detected by the monitoring discussed in Section V. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system. Because rolling back is an emergency procedure, operators should practice doing it normally, when not in emergency conditions.

Monitoring tests

It is crucial to know not just that your ML system worked correctly at launch, but that it continues to work correctly over time. An ML system by definition is making predictions on previously unseen data, and typically also incorporates new data over time into training. The standard approach is to monitor the system, i.e. to have a constantly-updated “dashboard” user interface displaying relevant graphs and statistics, and to automatically alert the engineering team when particular metrics deviate significantly from expectations. For ML systems, it is important to monitor serving systems, training pipelines, and input data. Here we recommend specific metrics to monitor throughout the system. The usual sorts of incident response approaches will apply; one unique to ML is to roll back not the system code but the learned model, hence our test earlier to regularly ensure that this process is safe and easy

Dependency changes result in notification.

ML systems typically consume data from a wide array of other systems to generate useful features. Partial outages, version upgrades, and other changes in the source system can radically change the feature’s meaning and thus confuse the model’s training or inference, without necessarily producing values that are strange enough to trigger other monitoring.

How?

Make sure that your team is subscribed to and reads announcement lists for all dependencies, and make sure that the dependent team knows your team is using the data.

Data invariants hold for inputs.

It can be difficult to effectively monitor the internal behavior of a learned model for correctness, but the input data should be more transparent. Consequently, analyzing and comparing data sets is the first line of defense for detecting problems where the world is changing in ways that can confuse an ML system.

How?

Using the schema constructed in test Data 1, measure whether data matches the schema and alert when they diverge significantly. In practice, careful tuning of alerting thresholds is needed to achieve a useful balance between false positive and false negative rates to ensure these alerts remain useful and actionable.

Training and serving are not skewed.

The codepaths that actually generate input features may differ at training and inference time. Ideally the different codepaths should generate the same values, but in practice a common problem is that they do not. This is sometimes called “training/serving skew” and requires careful monitoring to detect and avoid. As one concrete example, imagine adding a new feature to an existing production system. While the value of the feature in the serving system might be computed based on data from live user behavior, the feature will not be present in training data, and so must be backfilled by imputing it from other stored data, likely using an entirely independent codepath. Another example is when the computation at training time is done using code that is highly flexible (for easy experimentation) but inefficient, while at serving time the same computation is heavily optimized for low latency.

How?

To measure this, it is crucial to log a sample of actual serving traffic. For systems that use serving input as future training data, adding identifiers to each example at serving time will allow direct comparison; the feature values should be perfectly identical at training and serving time for the same example. Important metrics to monitor here are the number of features that exhibit skew, and the number of examples exhibiting skew for each skewed feature.

Another approach is to compute distribution statistics on the training features and the sampled serving features, and ensure that they match. Typical statistics include the minimum, maximum, or average, values, the fraction of missing values, etc. Again, thresholds for alerting on these metrics must be carefully tuned to ensure a low enough false positive rate for actionable response.

Models are not too stale.

In test Model 4 we discussed testing the effect that an old (“stale”) model has on prediction quality. Here, we recommend monitoring how old the system in production is, using the prior measurement
as a guide for determining what age is problematic enough to raise an alert. Surprisingly, infrequently updated models also incur a maintenance cost. Imagine a model that is manually retrained
once or twice a year by a given engineer. If that engineer leaves the team, this process may be difficult to replicate — even carefully written instructions may become stale or incorrect over this kind of time horizon.

How?

For models that re-train regularly (e.g. weekly or more often), the most obvious metric is the age of the model in production. It is also important to measure the age of the model at each stage of the training pipeline, to quickly determine where a stall has occurred and react appropriately. Even for models that re-train more infrequently, there is often a dependence on data aggregation or other such
processes to produce features, which can themselves grow stale. For example, consider using a feature based on the most popular n items (movies, apps, cars, etc). The process that computes the top-n table must be re-run frequently, and it is crucial to monitor the age of this table, so that if the process stops running, alerts will fire.

Models are numerically stable.

Invalid or implausible numeric values can potentially crop up during model training without triggering explicit errors, and knowing that they have occurred can speed diagnosis of the problem.

How?

Explicitly monitor the initial occurrence of any NaNs or infinities. Set plausible bounds for weights and the fraction of ReLU units in a layer returning zero values, and trigger alerts during training if these exceed appropriate thresholds.

Computing performance has not regressed.

The computational performance (as opposed to predictive quality) of an ML system is often a key concern at scale. Deep neural networks can be slow to train and run inference on, wide linear models
with feature crosses can use a lot of memory; any ML model may take days to train; and so forth. Swiftly reacting to changes in this performance due to changes in data, features, modeling, or underlying compute library or infrastructure is crucial to maintaining a performant system.

How?

While measuring computational performance is a standard part of any monitoring, it is useful to slice performance metrics not just by the versions and components of code, but also by data and model versions. Degradations in computational performance may occur with dramatic changes (for which comparison to performance of prior versions or time slices can be helpful for detection) or in slow leaks (for which a pre-set alerting threshold can be helpful for detection)

Prediction quality has not regressed.

Validation data will always be older than real serving input data, so measuring a model’s quality on that validation data before pushing it to serving is only an estimate of quality metrics on
actual live serving inputs. However, it is not always possible to know the correct labels even shortly after serving time, making quality measurement difficult.

How?

Here are some options to make sure that there is no degradation in served prediction quality due to changes in data, differing codepaths, etc.

Measure statistical bias in predictions, i.e. the average of predictions in a particular slice of data. Generally speaking, models should have zero bias, in aggregate and on slices (e.g. 90% of predictions of probability 0.9 should in fact be positive). Knowing that a model is unbiased is not enough to know it is any good, but knowing there is bias can be a useful canary to detect problems.
In some tasks, the label actually is available immediately or soon after the prediction is made (e.g. will a user click on an ad). In this case, we can judge the quality of predictions in almost real-time and identify problems quickly.
Finally, it can be useful to periodically add new training data by having human raters manually annotate labels for logged serving inputs. Some of this data can be held out to validate the served predictions.

However the measure can be done, thresholds must be set as to acceptable quality (e.g. based on bounds of quality at the launch of the initial system), and then a responder should be notified immediately if quality drifts outside that threshold. As with computational performance, it is crucial to monitor both dramatic and slow-leak regressions in prediction quality.

Because technical debt is difficult to quantify, it can be difficult to prioritize paydown or measure improvements. To address this, our rubric provides a quantified ML Test Score which can be measured and improved over time. This provides a vector for incentivizing ML system developers to achieve strong levels of reliability by providing a clear indicator of readiness and clear guidelines for how to improve.

How to Calculate

The final test score is computed as follows:

For each test, half a point is awarded for executing the test manually, with the results documented and distributed.
A full point is awarded if there is a system in place to run that test automatically on a repeated basis.
Sum the score for each of the 4 sections individually.
The final ML Test Score is computed by taking the minimum of the scores aggregated for each of the 4 sections.

We choose the minimum because we believe all four sections are important, and so a system must consider all in order to raise the score. One downside of this approach is that it reduces the extent to which an individual’s efforts are reflected in higher system scores and ranks; it remains to be seen how this will affect the adoption of our system.

0 Points: Not production ready

1–2 points: Might have reliability holes

3–4 points: Reasonably tested

5–6 points: Good level of testing

7+ points: Very strong levels of automated testing

ML Test Score

Introduction

Why its different

Where do unit test fit in software?

How ML System is different?

Where do we test?

Where we should Focus

Your Score

Data And Feature

Model Deployment

Infrastructure

Monitoring tests

How to Calculate

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mahendra S. Chouhan

Responses (1)