Skip to content

Local-first · Open source · Apache-2.0

Synthetic data you design,
versioned like source code.

Declare a dataset — distributions, causal structure, difficulty, and failure modes — in one spec file, and regenerate it byte-for-byte identically, forever, from (spec hash, seed). No network. No telemetry. No account.

$ pip install datadoom

Why DataDoom

Realistic and reproducible — no trade-off.

⚙️

Deterministic by construction

One seeded RNG underpins everything. The same spec + seed yields a bitwise-identical dataset on the pinned path.

📊

Honest statistics

Distributions are sampled correctly and their fit is reported (KS / chi-square, compliance score) — parameters are never refit to flatter the sample.

🕸️

Causal structure

A DAG of structural equations with per-node noise and do() interventions, plus a true-graph and mutual-information report.

💥

Failure injection

Eight mechanisms — MCAR/MAR/MNAR, label & feature noise, drift, covariate shift, leakage — corrupt a copy while the clean baseline is kept.

🎯

Difficulty targeting

Calibrate a binary label to a chosen baseline-model AUROC band, reported with the achieved metric and bisection trace.

🧩

Extensible & ready to consume

Distributions, failure modes, and exporters ship as plugins. Export CSV / JSON / Parquet, load into pandas / PyTorch / TF / HuggingFace.

Install

Up and running in two commands.

1

Install the engine + CLI

pip install datadoom              # engine + CLI
pip install "datadoom[server]"    # + web Canvas
pip install "datadoom[parquet]"   # + Parquet export
2

Generate & prove reproducibility

# generate a dataset from a spec
datadoom run examples/causal-fraud.datadoom.yaml --seed 42 --out ./out

# regenerate and compare bytes
datadoom verify examples/causal-fraud.datadoom.yaml --seed 42 --against ./out
3

Or design in the web Canvas

pip install "datadoom[server]"
datadoom serve   # http://127.0.0.1:8000 — no Node toolchain

Prefer Docker? docker run --rm -p 8000:8000 ghcr.io/santhoshreddy352/datadoom:latest starts the Canvas automatically.

A dataset should be as shareable and reproducible as source code.