Statistical Metamorphic Testing using the API
This intermediate-level tutorial demonstrates the Causal Testing Framework’s (CTF) core Python API, using statistical metamorphic testing as an example application. Assuming basic Python familiarity, it shows how to construct causal test cases that vary the sampling window dimensions of a Poisson Line Tessellation (PLT) model by customising the API. Theoretical background is in Section 5.1 of our paper, with further resources at the end of the notebook.
Step 1: Defining your Input Configurations
Before diving into the details, a good first step is to define your file paths, including your input configurations:
[1]:
# Step 1: Define your file paths
from pathlib import Path
base_dir = Path.cwd() # Get the current working directory
data_dir = base_dir / "data" # Define your data directory here
dag_file = "dag.dot" # Define your DAG filename here
data_file = "data/random/data_random_1000.csv" # This is your input runtime data (usually a .csv file)
output_file = "causal_test_results.json" # This is your output file containing the causal test results
Under the hood, the CTF consists of three main components, namely a Modelling Scenario, a Causal Directed Acyclic Graph (DAG), and a Causal Test Case. The modelling scenario specifies the (or part of the) system under test by defining the observable variables and any constraints that exist between those variables. The CTF currently supports three types of variables:
Inputvariables, which are inputs to the system.Outputvariables, which are outputs from the system.Metavariables, which are not directly observable but can be related to the system in terms of otherInputorOutputvariables.
The causal DAG encodes information about the expected causal structure of the system through nodes representing variables and directed edges representing causal relationships, which is a model of how the data could have been generated. Together, the Causal DAG and modelling scenario form the causal specification.
Note: The CTF currently doesn’t support native visualisation tools, but it is possible to use existing frameworks such as NetworkX to visualise your DAG. Alternatively, browser-based environments such as DAGitty may also be useful.
Step 2: Read in the Data
At this point, it might be worthwhile to interrogate your data and apply any pre-processing, transforming or cleaning as necessary. However, for the purposes of this tutorial, there won’t be any additional processing required. Section 5.13 of our paper explains in detail how this dataset was generated.
[2]:
import pandas as pd
df = pd.read_csv(data_file, index_col=0)
df
[2]:
| width | height | intensity | num_lines_abs | num_shapes_abs | num_lines_unit | num_shapes_unit | |
|---|---|---|---|---|---|---|---|
| 0 | 4.908266 | 8.367783 | 10.684103 | 278 | 13991 | 6.768716 | 340.651464 |
| 1 | 9.239562 | 0.632306 | 12.826541 | 238 | 2603 | 40.737877 | 445.549133 |
| 2 | 5.144895 | 9.633530 | 15.219349 | 425 | 33859 | 8.574859 | 683.143879 |
| 3 | 3.489453 | 4.535775 | 8.362393 | 126 | 3305 | 7.960892 | 208.815469 |
| 4 | 7.985650 | 1.090290 | 1.032276 | 24 | 74 | 2.756506 | 8.499228 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 5 | 8.198583 | 2.479352 | 7.394970 | 152 | 3378 | 7.477674 | 166.181475 |
| 6 | 5.539523 | 9.453742 | 12.416852 | 378 | 22500 | 7.217980 | 429.641692 |
| 7 | 1.182998 | 4.851435 | 4.361107 | 53 | 534 | 9.234678 | 93.043734 |
| 8 | 2.438737 | 1.708604 | 1.107249 | 6 | 14 | 1.439942 | 3.359864 |
| 9 | 4.799560 | 0.032911 | 9.314464 | 94 | 129 | 595.097328 | 816.676121 |
1000 rows × 7 columns
In this case, the PLT model has three positive floating-point input parameters: thee width and height of the sampling window, and the intensity of the Poisson process. The model then outputs the total number of lines intersecting the sampling window, and the number of polygons formed by the intersecting lines. Note: in this dataset, the output variables appended by the suffix _unit are normalised with respect to their respective areas (width*height).
Step 3: Create a Modelling Scenario
[3]:
from causal_testing.specification.variable import Input, Output
from causal_testing.specification.causal_dag import CausalDAG
from causal_testing.specification.scenario import Scenario
# Define the input variables
width = Input("width", float)
height = Input("height", float)
intensity = Input("intensity", float)
# Define the output variables
num_lines_abs = Output("num_lines_abs", float)
num_lines_unit = Output("num_lines_unit", float)
num_shapes_abs = Output("num_shapes_abs", float)
num_shapes_unit = Output("num_shapes_unit", float)
# Pass these variables into the Scenario class
scenario = Scenario(
variables={
width,
height,
intensity,
num_lines_abs,
num_lines_unit,
num_shapes_abs,
num_shapes_unit})
causal_dag = CausalDAG(dag_file) # Secondly, create the Causal DAG
Step 4: Create Causal Test Cases
Now that we’ve created our Modelling Scenario, we’re ready to create our Causal Tests. Causal tests are essentially metamorphic tests that are executed using statistical causal inference. A causal test expresses the change in a given output that we expect to see when we change a particular input in some way.
Firstly, a base test case, which specifies the relationship between the given output and input and the desired effect, is required to build a causal test case. Together, the causal test case forms the complete executable test, which is the minimum required to perform identification on the DAG.
In this tutorial, the two metamorphic relations we would like to investigate are the following:
Doubling the intensity should cause the number of polygons per unit area to increase by a factor of 4.
The number of polygons per unit area should be independent of width and height.
Metamorphic Relation 1: Doubling the intensity should cause the number of polygons per unit area to increase by a factor of 4
[4]:
from causal_testing.testing.base_test_case import BaseTestCase
from causal_testing.testing.causal_test_case import CausalTestCase
from causal_testing.testing.causal_effect import ExactValue, Positive
from causal_testing.estimation.linear_regression_estimator import LinearRegressionEstimator
base_test_case = BaseTestCase(treatment_variable=intensity, outcome_variable=num_shapes_unit) # Create the base test case
# Perform identification on the DAG using the base test case
adjustment_set = causal_dag.identification(base_test_case) # Note: an empty adjustment set means there are no confounding variables that need to be controlled for
Following this, we can now create our causal test case. The minimum parameter’s well need to create this are: the expected causal effect as a CausalEffect object (e.g. ExactValue), the estimate type, which is a str specifying the type of estimate to return, and an estimator, which can be is an Estimator object. Since the relation we’re investigating is inherently linear, we can use the LinearRegressionEstimator class to build our causal test case.
[5]:
import numpy as np
from causal_testing.estimation.linear_regression_estimator import LinearRegressionEstimator
control_values, treatment_values = 2 ** np.arange(0, 4), 2 ** np.arange(1, 5) # Initialise the dummy intensity variables
intensity_results = [] # Initiate an empty list to store the causal test results
for (control, treatment) in zip(control_values, treatment_values): # Simultaneously loop over control and treatment
estimator=LinearRegressionEstimator(
df=df, # Pass in the dataframe
base_test_case=base_test_case, # Base test case we created above
treatment_value=treatment, # Doubled intensity values
control_value=control, # Baseline intensity values
adjustment_set=adjustment_set, # Adjustment set (no confounders in this example)
formula="num_shapes_unit ~ I(intensity ** 2) + intensity - 1", # Patsy formula describing a linear regression model
alpha=0.05) # Significance level
causal_test_case = CausalTestCase(
base_test_case=base_test_case, # Pass in the base test case
expected_causal_effect=ExactValue(4, atol=0.5), # Include a tolerence of 0.5
estimate_type="risk_ratio", # As described in our paper
estimator = estimator) # Pass in the estimator we created above
test_results = causal_test_case.execute_test() # Execute the tests
# Parse the test result values we need:
intensity_results += [
{
"width": test_results.estimator.control_value,
"height": test_results.estimator.treatment_value,
"control": test_results.estimator.control_value,
"treatment": test_results.estimator.treatment_value,
"risk_ratio": test_results.effect_estimate.value[0],
}]
Finally, we can parse the causal test results as a pandas dataframe and optionally export it to a .csv file.
[6]:
intensity_results_df = pd.DataFrame(intensity_results)
intensity_results_df
# intensity_results_df.to_csv("intensity_test_results.csv", index=0) # Uncomment this to save as a csv.
[6]:
| width | height | control | treatment | risk_ratio | |
|---|---|---|---|---|---|
| 0 | 1 | 2 | 1 | 2 | 2.827958 |
| 1 | 2 | 4 | 2 | 4 | 3.171104 |
| 2 | 4 | 8 | 4 | 8 | 3.477219 |
| 3 | 8 | 16 | 8 | 16 | 3.699311 |
Summary
From the above causal test results and the risk ratios, we can conclude that doubling the intensity does not cause the number of polygons per unit area to increase by a factor of 4 as we expected. Instead, it increases by factors ranging from 2.8 to 3.7, meaning that the metamorphic relation is violated at a 95% confidence level.
Metamorphic Relation 2: The number of polygons per unit area should be independent of the width and height
In a very similar way to the method above, we can test our second metamorphic relation that the number of polygons per unit area should be independent of sample width and height. Since we are only interested in whether there is some effect, we use the average treatment effect (ATE) instead of the risk ratio from above, which quantifies the additive change in outcome caused by the intervention.
To investigate whether the width affects number of polygons per unit area, we need to execute a new set of test cases, but this time fixing the intensity and varying the width. Note: we don’t need to redefine the modelling scenario, nor the perform identification again; but we have to redefine our base test case since we’re now considering the Polygon’s width as the treatment variable.
[7]:
control_values, treatment_values = np.arange(1,10), np.arange(1, 17) # Initialise the dummy width variables
width_results = [] # Empty list for storing test case results
base_test_case = BaseTestCase(treatment_variable=width, outcome_variable=num_shapes_unit) # Create the base test case
adjustment_set = causal_dag.identification(base_test_case) # Calculate the adjustment set again (if it exists)
for intensity in treatment_values:
for width_value in control_values:
estimator = LinearRegressionEstimator(
df=df, # Pass in the dataframe
base_test_case = base_test_case, # Base test case we created above
treatment_value = width_value + 1.0, # Changing the width
control_value=float(width_value), # Baseline width values
adjustment_set=adjustment_set, # Use the same adjustment set as list comprehension
effect_modifiers={"intensity": intensity},
formula="num_shapes_unit ~ width + I(intensity ** 2)+I(width ** -1)+intensity-1", # Patsy formula describing a linear regression model
alpha=0.05) # Significance level
causal_test_case = CausalTestCase(
base_test_case = base_test_case, # Pass in the base test case
expected_causal_effect = Positive(), # We expect a positive increase
estimate_type = "ate_calculated", # Calls the ate_calculated method in the linear regression estimator
estimator=estimator) # Pass in the estimator we created above
test_results = causal_test_case.execute_test() # Execute the tests
# Parse the test result values we need:
width_results += [
{
"control": test_results.estimator.control_value,
"treatment": test_results.estimator.treatment_value,
"intensity": estimator.effect_modifiers["intensity"],
"ate": test_results.effect_estimate.value[0],
"ci_low": test_results.effect_estimate.ci_low[0],
"ci_high": test_results.effect_estimate.ci_high[0],
}]
[8]:
# Convert the results to a dataframe and print the first 10 rows
width_results_df = pd.DataFrame(width_results)
width_results_df.head(10)
[8]:
| control | treatment | intensity | ate | ci_low | ci_high | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2.0 | 1 | -7.378642 | -13.918239 | -0.839046 |
| 1 | 2.0 | 3.0 | 1 | -2.709659 | -9.802883 | 4.383566 |
| 2 | 3.0 | 4.0 | 1 | -1.542413 | -11.120888 | 8.036062 |
| 3 | 4.0 | 5.0 | 1 | -1.075514 | -13.708422 | 11.557393 |
| 4 | 5.0 | 6.0 | 1 | -0.842065 | -16.741294 | 15.057163 |
| 5 | 6.0 | 7.0 | 1 | -0.708666 | -19.972927 | 18.555596 |
| 6 | 7.0 | 8.0 | 1 | -0.625291 | -23.308418 | 22.057837 |
| 7 | 8.0 | 9.0 | 1 | -0.569708 | -26.704347 | 25.564931 |
| 8 | 9.0 | 10.0 | 1 | -0.530800 | -30.138282 | 29.076683 |
| 9 | 1.0 | 2.0 | 2 | -7.378642 | -16.381136 | 1.623851 |
Summary
The causal test results in this case demonstrate that the ATE values for width increases from 1 → 2 through 9 → 10, revealing that while most changes produce non-significant effects (ATEs ranging from -2.7097 to -0.5308 with confidence intervals containing zero), the width change from 1 → 2 produces a statistically significant negative effect of -7.3786 with a confidence interval of [-13.9182, -0.8390]. This indicates that there is a problem with either the program or
the metamorphic property itself. A likely interpretation is that, geometrically, lines are less likely to intersect a smaller sample window. As the sample window becomes larger, there is more area to average over. Therefore, the metamorphic relations should ideally specify a minimum window size to which they apply.
Additionally, in the paper we further demonstrate that these results show that the CTF was able to identify the same discrepancy as conventional statistical metamorphic testing, but using only a fifth of the data. Ultimately, this highlights the potential of causal inference-driven approaches to offer economical alternatives to testing techniques that depend on repeated potentially costly executions of the system under test.