Skip to content

Spec reference

A DataDoom spec is a single YAML (or JSON) file with datadoom_version: 1. It is additive: within version 1, only new optional fields are ever introduced, so older specs keep working.

The live manifest is authoritative

The spec surface is exposed as a machine-readable manifest built from the engine's live registries — so every built-in and every installed plugin appears:

datadoom spec-reference          # CLI: full capability manifest as JSON
GET /api/spec-reference          # same manifest over HTTP (when running the server)

The manifest enumerates every distribution, structural function, failure mode, difficulty tier, feature type, exporter, and text provider, plus the hard validation rules. Prefer it over any static list when building tooling.

Top-level shape

Key Purpose
datadoom_version Spec format version (1).
name Human label for the dataset.
rows Number of rows to generate.
seed Default seed (overridable on the CLI / API).
features The columns — a discriminated union by type (numeric, categorical, boolean, datetime, text, timeseries).
causal Optional DAG of structural equations over the features.
difficulty Optional baseline-AUROC targeting for a binary label.
failures Optional ordered list of corruption mechanisms (applied to a copy).
export Output formats (csv/json/parquet) and versions (clean/injected).

Full surface

The complete, prose spec reference (every field, type, and constraint) lives in the authoritative design set:

For a guided, example-driven walkthrough, start with the YAML authoring guide; for the AI-authoring contract, see the LLM reference.