
CloseLight


Baxter Eaves
2020-12-21
[ai]
The what, why, and how of synthetic data
Faster, more secure product development in four lines of code
▾
Privacy requirements often restrict organizations from distributing sensitive data for things like running a public hackaton, PoC evaluation with external organizations, and even product development at different units within the originating organization.
Synthetic data are simulated data designed to mimic the properties of a real-world dataset. Synthetic data are often used as externally-available alternatives to sensitive data. For example,
Generating synthetic data with the Redpoll Reformer
If you have a Reformer server running, generating a csv of synthetic data four lines of python code:
import redpoll rp = redpoll.Client(client_address) # Generate 1000 rows of synthetic data df = rp.simulate(redpoll.AllColumns, n=1000) df.to_csv("my-synthetic-data.csv")
Synthetic data with balanced classes
Class imbalance is a big issue in real-world datasets.
import redpoll import pandas as pd rp = redpoll.Client(client_address) # Generate 1000 rows of synthetic binary classification data with # 500 examples of each class df0 = rp.simulate(redpoll.AllColumns, given={'class': 0}, n=500) df1 = rp.simulate(redpoll.AllColumns, given={'class': 1}, n=500) pd.concat([df0, df1]).to_csv("my-balanced-synthetic-data.csv")