Plover Found 9 Errors in the UC Irvine AI4I Predictive Maintenance Dataset

Finding errors in the code behind the synthetic data

2024-11-20 by Baxter Eaves in [data science, data quality, plover]

TL;DR: Plover found nine mislabeled records. Download the cleaned data here. Try Plover in your browser here.


I believe most, if not all, of the people I know who work with data would agree that all real-world datasets of substance have errors. But under what conditions could (or should) a datasets not have errors? In the machine learning space, one way we get around the messiness and sparsity of real-world data is by building computer programs to programatically generate synthetic data. You may think "surely data generated by a program would be, by definition, error free, right?" Wrong.

There is a subtle distinction that often gets overlooked by the data quality community. Your data are true. No matter how "bad" they are. The data were generated by the true data generating process and ended up in your database. So-called "erroneous" data isn't in fact erroneous but is evidence of an error in the data process. Maybe someone fat-fingered something, maybe units were not normalized across sources. In the latter case, the erroneous data actually reflect erroneous code.

Here we'll show how we used Plover to identify errors in (the code that generated) a highly curated synthetic dataset in one of the most used repositories for clean machine-learning-ready data: the UCI Machine Learning Repository AI4I 2020 Predictive Maintenance Dataset.

To quote the UCI repository page

The AI4I 2020 Predictive Maintenance Dataset is a synthetic dataset that reflects real predictive maintenance data encountered in industry.

The dataset consists of 10 000 data points stored as rows with 14 features in columns UID: unique identifier ranging from 1 to 10000

  • product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
  • air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
  • process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
  • rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise
  • torque [Nm]: torque values are normally distributed around 40 Nm with a σ = 10 Nm and no negative values.
  • tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process. and a
  • 'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.

The machine failure consists of five independent failure modes

  • tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 and 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
  • heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tool's rotational speed is below 1380 rpm. This is the case for 115 data points.
  • power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset.
  • overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints.
  • random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset.

Let's dig in. We're going to do everything in the python bindings today, so we'll start by creating a local Plover instance.

from plover import Plover
from plover.source import DataFrame

plvr = (
    Plover.local(
        source=DataFrame.csv(
            "ai4i2020.csv",
            index_col="UID",
            schema="auto"
        ),
        store="ai4i2020.plvrstore",
    )
    .fit()
    .compute_metrics()
    .metalearn()
    .persist()
)

We use a local dataset using the DataFrame source, store our metadata locally using the Local store, and use a local machine learning backend. The fit command builds an inference model of the data. The compute_metrics method computes the error/anomaly metrics for each cell. The metalearn method creates a second-order machine learner that allows similarity queries, as we'll see below. The persist method saves everything to the local store so we can resume later.

Now that we've done all that, let's find the top five most likely errors.

plvr.errors(top=5)
rowcolicobspred
9016TWF142.93981200
5537HDF115.49292401
9016OSF113.37014301
4703PWF75.06448301
1493OSF50.20508700

The above table shows the inconsistency metric as ic. The important thing is that more inconsistency means more likely an error. You can see the steep falloff in ic from top to bottom. Interestingly there are two cells from record 9016 in the top five errors. Let's ask plover to explain which features are responsible for TWF's high inconsistency on record 9016.

plvr.explain(row="9016", col="TWF")
featureicobservedpredicted
0142.94
1Machine failure5.0635110
2Tool wear [min]0.549236210.0114.47

The above tables shows us how much inconsistency is left after removing certain features. Features are sorted by their contribution to the uncertainty in the target variable, in this case, TWF. Plover is telling us that the Machine Failure value is generally responsible for all the inconsistency. In a distant second is Tool wear [min]. Let's take a look at record "9016".

plvr.data(row="9016")
9016
TypeL
Air temperature [K]297.2
Process temperature [K]308.1
Rotational speed [rpm]1431.0
Torque [Nm]49.7
Tool wear [min]210.0
Machine failure1
TWF0
HDF0
PWF0
OSF0
RNF0

Right off the bat we see that Machine failure is 1 but all the failure modes are 0. According to the data documentation

If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1.

We have an error! Plover correctly identified that Machine Failure should be 0.

An error that happens once can happen again, so we need to find similar errors. We can do this one of two ways: we can write a rule and filter the dataset (which would be really easy in this case, but really difficult in general), or we can ask plover to find similar cells.

plvr.similar_cells(row="9016", col="Machine failure").head(10)
rowsimilarity
40450.789062
59420.789062
46850.789062
14380.773438
55370.773438
27500.742188
64790.71875
85070.71875
59100.5625
47030.554688

There are eight cells that have a high meta similarity with the error we found. Since it's easy to do in this case, let's filter the data to pull out all the rows in which Machine failure is 1 but every failure mode is 0.

failure_modes = ["TWF", "HDF", "PWF", "OSF", "RNF"]
df = plvr.df()
df[ 
    (df[failure_modes].sum(axis=1) == 0) 
    & 
    (df['Machine failure'] > 0)
][failure_modes + ["Machine failure"]]
rowTWFHDFPWFOSFRNFMachine failure
1438000001
2750000001
4045000001
4685000001
5537000001
5942000001
6479000001
8507000001
9016000001

Cross-checking the meta-similar records with the filter, we see that meta-similarity found the same entries as the hard-coded rule. Nice!

Digging in more, and recreating the failure rules supplied by the dataset authors, it turns out the entry 9016 could have suffered from a tool wear failure (TWF) since its tool wear was greater than 200 minutes. However, since TWF is a random and rare failure for tool wears greater than 200, we decided to mark 9016 as not having failed in the cleaned dataset.

Conclusion

All processes are prone to errors, so all data — even synthetic — can contain the evidence of those errors in the form of so-called bad data. We showed how Plover can easily identify erroneous data and how we can use meta-similarity to identify similar errors. In this case, Plover found a very salient error in part 9016 and used meta-similarity to find all other data exhibiting that particular error without having to write a single rule.

Changelog

Download the cleaned data here. And try Plover in your browser here.

UID (row)ColumnOriginalNew
9016Machine Failure10
5537Machine failure10
1438Machine failure10
2750Machine failure10
4045Machine failure10
4685Machine failure10
5942Machine failure10
6479Machine failure10
8507Machine failure10

Get in touch

Provide your information here and we will reach out to you promptly.