Baxter Eaves

The what, why, and how of synthetic data

Faster, more secure product development in four lines of code

Privacy requirements often restrict organizations from distributing sensitive data for things like running a public hackaton, PoC evaluation with external organizations, and even product development at different units within the originating organization.

Synthetic data are simulated data designed to mimic the properties of a real-world dataset. Synthetic data are often used as externally-available alternatives to sensitive data. For example,

Generating synthetic data with the Redpoll Reformer

If you have a Reformer server running, generating a csv of synthetic data four lines of python code:

import redpoll
rp = redpoll.Client(client_address)
# Generate 1000 rows of synthetic data
df = rp.simulate(redpoll.AllColumns, n=1000)

Synthetic data with balanced classes

Class imbalance is a big issue in real-world datasets.

import redpoll
import pandas as pd
rp = redpoll.Client(client_address)
# Generate 1000 rows of synthetic binary classification data with 
# 500 examples of each class
df0 = rp.simulate(redpoll.AllColumns, given={'class': 0}, n=500)
df1 = rp.simulate(redpoll.AllColumns, given={'class': 1}, n=500)
pd.concat([df0, df1]).to_csv("my-balanced-synthetic-data.csv")