Plover Found 9 Errors in the UC Irvine AI4I Predictive Maintenance Dataset

Finding errors in the code behind the synthetic data

2024-11-20 by Baxter Eaves in [data science, data quality, plover]

TL;DR: Plover found nine mislabeled records. Download the cleaned data here. Try Plover in your browser here.

I believe most, if not all, of the people I know who work with data would agree that all real-world datasets of substance have errors. But under what conditions could (or should) a datasets not have errors? In the machine learning space, one way we get around the messiness and sparsity of real-world data is by building computer programs to programatically generate synthetic data. You may think "surely data generated by a program would be, by definition, error free, right?" Wrong.

There is a subtle distinction that often gets overlooked by the data quality community. Your data are true. No matter how "bad" they are. The data were generated by the true data generating process and ended up in your database. So-called "erroneous" data isn't in fact erroneous but is evidence of an error in the data process. Maybe someone fat-fingered something, maybe units were not normalized across sources. In the latter case, the erroneous data actually reflect erroneous code.

Here we'll show how we used Plover to identify errors in (the code that generated) a highly curated synthetic dataset in one of the most used repositories for clean machine-learning-ready data: the UCI Machine Learning Repository AI4I 2020 Predictive Maintenance Dataset.

To quote the UCI repository page

The AI4I 2020 Predictive Maintenance Dataset is a synthetic dataset that reflects real predictive maintenance data encountered in industry.

The dataset consists of 10 000 data points stored as rows with 14 features in columns UID: unique identifier ranging from 1 to 10000

product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise
torque [Nm]: torque values are normally distributed around 40 Nm with a Ïƒ = 10 Nm and no negative values.
tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process. and a
'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.

The machine failure consists of five independent failure modes

tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 and 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tool's rotational speed is below 1380 rpm. This is the case for 115 data points.
power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset.
overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints.
random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset.

Let's dig in. We're going to do everything in the python bindings today, so we'll start by creating a local Plover instance.

from plover import Plover
from plover.source import DataFrame

plvr = (
    Plover.local(
        source=DataFrame.csv(
            "ai4i2020.csv",
            index_col="UID",
            schema="auto"
        ),
        store="ai4i2020.plvrstore",
    )
    .fit()
    .compute_metrics()
    .metalearn()
    .persist()
)

We use a local dataset using the DataFrame source, store our metadata locally using the Local store, and use a local machine learning backend. The fit command builds an inference model of the data. The compute_metrics method computes the error/anomaly metrics for each cell. The metalearn method creates a second-order machine learner that allows similarity queries, as we'll see below. The persist method saves everything to the local store so we can resume later.

Now that we've done all that, let's find the top five most likely errors.

plvr.errors(top=5)

row	col	ic	pred
9016	TWF	142.939812	0
5537	HDF	115.492924	1
9016	OSF	113.370143	1
4703	PWF	75.064483	1
1493	OSF	50.205087	0

The above table shows the inconsistency metric as ic. The important thing is that more inconsistency means more likely an error. You can see the steep falloff in ic from top to bottom. Interestingly there are two cells from record 9016 in the top five errors. Let's ask plover to explain which features are responsible for TWF's high inconsistency on record 9016.

plvr.explain(row="9016", col="TWF")

	feature	ic	observed	predicted
0		142.94
1	Machine failure	5.06351	1	0
2	Tool wear [min]	0.549236	210.0	114.47

The above tables shows us how much inconsistency is left after removing certain features. Features are sorted by their contribution to the uncertainty in the target variable, in this case, TWF. Plover is telling us that the Machine Failure value is generally responsible for all the inconsistency. In a distant second is Tool wear [min]. Let's take a look at record "9016".

plvr.data(row="9016")

	9016
Type	L
Air temperature [K]	297.2
Process temperature [K]	308.1
Rotational speed [rpm]	1431.0
Torque [Nm]	49.7
Tool wear [min]	210.0
Machine failure	1
TWF	0
HDF	0
PWF	0
OSF	0
RNF	0

Right off the bat we see that Machine failure is 1 but all the failure modes are 0. According to the data documentation

If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1.

We have an error! Plover correctly identified that Machine Failure should be 0.

An error that happens once can happen again, so we need to find similar errors. We can do this one of two ways: we can write a rule and filter the dataset (which would be really easy in this case, but really difficult in general), or we can ask plover to find similar cells.

plvr.similar_cells(row="9016", col="Machine failure").head(10)

row	similarity
4045	0.789062
5942	0.789062
4685	0.789062
1438	0.773438
5537	0.773438
2750	0.742188
6479	0.71875
8507	0.71875
5910	0.5625
4703	0.554688

There are eight cells that have a high meta similarity with the error we found. Since it's easy to do in this case, let's filter the data to pull out all the rows in which Machine failure is 1 but every failure mode is 0.

failure_modes = ["TWF", "HDF", "PWF", "OSF", "RNF"]
df = plvr.df()
df[ 
    (df[failure_modes].sum(axis=1) == 0) 
    & 
    (df['Machine failure'] > 0)
][failure_modes + ["Machine failure"]]

row	TWF	HDF	PWF	OSF	RNF	Machine failure
1438	0	0	0	0	0	1
2750	0	0	0	0	0	1
4045	0	0	0	0	0	1
4685	0	0	0	0	0	1
5537	0	0	0	0	0	1
5942	0	0	0	0	0	1
6479	0	0	0	0	0	1
8507	0	0	0	0	0	1
9016	0	0	0	0	0	1

Cross-checking the meta-similar records with the filter, we see that meta-similarity found the same entries as the hard-coded rule. Nice!

Digging in more, and recreating the failure rules supplied by the dataset authors, it turns out the entry 9016 could have suffered from a tool wear failure (TWF) since its tool wear was greater than 200 minutes. However, since TWF is a random and rare failure for tool wears greater than 200, we decided to mark 9016 as not having failed in the cleaned dataset.

Conclusion

All processes are prone to errors, so all data — even synthetic — can contain the evidence of those errors in the form of so-called bad data. We showed how Plover can easily identify erroneous data and how we can use meta-similarity to identify similar errors. In this case, Plover found a very salient error in part 9016 and used meta-similarity to find all other data exhibiting that particular error without having to write a single rule.

Changelog

Download the cleaned data here. And try Plover in your browser here.

UID (row)	Column	Original
9016	Machine Failure	1
5537	Machine failure	1
1438	Machine failure	1
2750	Machine failure	1
4045	Machine failure	1
4685	Machine failure	1
5942	Machine failure	1
6479	Machine failure	1
8507	Machine failure	1

Plover Found 9 Errors in the UC Irvine AI4I Predictive Maintenance Dataset

Finding errors in the code behind the synthetic data

Conclusion

Changelog

Get in touch