Deterministic by construction
One seeded RNG underpins everything. The same spec + seed yields a bitwise-identical dataset on the pinned path.
Local-first · Open source · Apache-2.0
Declare a dataset — distributions, causal structure, difficulty, and
failure modes — in one spec file, and regenerate it
byte-for-byte identically, forever, from
(spec hash, seed). No network. No telemetry. No account.
Why DataDoom
One seeded RNG underpins everything. The same spec + seed yields a bitwise-identical dataset on the pinned path.
Distributions are sampled correctly and their fit is reported (KS / chi-square, compliance score) — parameters are never refit to flatter the sample.
A DAG of structural equations with per-node noise and do()
interventions, plus a true-graph and mutual-information report.
Eight mechanisms — MCAR/MAR/MNAR, label & feature noise, drift, covariate shift, leakage — corrupt a copy while the clean baseline is kept.
Calibrate a binary label to a chosen baseline-model AUROC band, reported with the achieved metric and bisection trace.
Distributions, failure modes, and exporters ship as plugins. Export CSV / JSON / Parquet, load into pandas / PyTorch / TF / HuggingFace.
Install
pip install datadoom # engine + CLI
pip install "datadoom[server]" # + web Canvas
pip install "datadoom[parquet]" # + Parquet export
# generate a dataset from a spec
datadoom run examples/causal-fraud.datadoom.yaml --seed 42 --out ./out
# regenerate and compare bytes
datadoom verify examples/causal-fraud.datadoom.yaml --seed 42 --against ./out
pip install "datadoom[server]"
datadoom serve # http://127.0.0.1:8000 — no Node toolchain
Prefer Docker? docker run --rm -p 8000:8000 ghcr.io/santhoshreddy352/datadoom:latest
starts the Canvas automatically.
Documentation
Write your first spec, end to end.
The authoring contract for AI tools.
The full spec surface + live manifest.
Extend the engine with your own components.
How determinism, layering, and the pipeline fit.
Runnable example specs and domain templates.