LLM / agent authoring reference¶

The authoring contract for AI tools and agents that emit DataDoom specs: the complete capability surface, the hard validation rules, and validated few-shot examples. Embedded verbatim from the authoritative source.

Live manifest

For a machine-readable, always-current capability list (including any installed plugins), run datadoom spec-reference or GET /api/spec-reference instead of hard-coding this page.

DataDoom spec authoring — reference for AI models¶

This is a dense, authoritative reference for an AI/agent tasked with writing a DataDoom spec (*.datadoom.yaml) from a natural-language request. It is the prose companion to the machine-readable capabilities manifest, which you should fetch programmatically when available:

datadoom spec-reference          # JSON manifest on stdout (CLI)
GET /api/spec-reference           # same manifest over HTTP

The manifest is generated from the live engine registries, so it always reflects the running build and any installed plugins (extra distributions, structural functions, failure modes, exporters). When a plugin is present, prefer the manifest's names over this static list.

For a gentle, human-oriented tutorial see 20_YAML_Authoring_Reference. This document is the terse contract optimized for correct machine generation.

0. Output contract (follow exactly)¶

Emit only YAML for a single spec document — no prose, no code fences unless the caller asked for them.
Use only the names/fields enumerated here (or in the manifest). Never invent a distribution, structural function, failure type, difficulty tier, feature type, or export format. If unsure, choose the closest listed option.
Honor every rule in §9. A spec that violates one is rejected by datadoom validate with a locator. Self-check before returning.
Prefer the simplest spec that satisfies the request. Add causal, failures, or difficulty only when the request implies them.
Always set datadoom_version: "1", a slug name, an integer rows, and a seed (for reproducibility) unless told otherwise.
Numbers are YAML numbers (no quotes); dates and the version are quoted strings.

1. Document skeleton¶

datadoom_version: "1"     # REQUIRED, always "1"
name: "slug-name"         # REQUIRED, matches [A-Za-z0-9_-]+ (no spaces)
description: "..."        # optional
seed: 42                  # optional int — fixes the draws (not part of the hash)
rows: 1000                # REQUIRED, int >= 1
features: { ... }         # REQUIRED — map of column name -> feature (§2)
causal: { ... }           # optional — derive columns from others (§4)
difficulty: { ... }       # optional — calibrate a binary label's hardness (§6)
failures: [ ... ]         # optional — ordered data-quality corruptions (§5)
export: { ... }           # optional — output formats/variants/splits (§7)
meta: { ... }             # optional — free-form, ignored by the engine

Column names must start with a letter or _. Every feature may also carry description (string) and emit (bool — false makes it latent: computed and able to drive the SEM, but not exported and excluded from probe/compliance).

2. Feature types¶

Each feature has a type selecting its fields.

`numeric` — numbers from a distribution (or causal-derived)¶

age: { type: numeric, dist: normal, params: { mean: 40, std: 12 }, min: 18, max: 90, dtype: int }

- dist: a distribution name (§3). Omit dist to make it a causal-derived column (its values then come from causal edges; it must be an edge to). - params: object — the keys required by dist. - min / max: optional clamps. dtype: float (default) or int (rounds).

`categorical` — one label per row¶

tier: { type: categorical, categories: [bronze, silver, gold], weights: [0.6, 0.3, 0.1] }

- categories: non-empty string list. weights: optional non-negative list, positionally matched, normalized (default uniform).

`boolean` — true/false¶

active: { type: boolean, rate: 0.4 }     # P(true)=0.4; default 0.5

Omit rate and a boolean may instead be a causal-derived target (no rate).

`datetime` — timestamps in a range¶

ts: { type: datetime, start: "2023-01-01", end: "2024-12-31", granularity: day }

- granularity: second | minute | hour | day (default day).

`text` — strings (filler or realistic)¶

note:    { type: text, generator: lorem, length: { min: 5, max: 30 } }
person:  { type: text, generator: name, locale: en }

- generator: lorem (uses length) or a realistic key (§8). locale: default en.

`timeseries` — ordered additive series over the row index¶

temp:
  type: timeseries
  trend: { slope: 0.002, intercept: 20 }
  seasonality:
    - { amplitude: 6, period: 24, phase: 0 }      # daily
    - { amplitude: 2, period: 168, phase: 1.5 }   # weekly
  ar: [0.5]            # AR(1); sum(|coeffs|) MUST be < 1
  noise_std: 0.8
  dtype: float

Realizes Xt = trend + Σ seasonality + AR(p) + N(0, noise_std²). Row order is the time axis (preserved). A timeseries may be a causal parent; it is never a causal target and is not distribution-compliance assessed.

3. Distributions (`numeric.dist` and `feature_noise.dist`)¶

name	required `params`	constraints / notes
`normal`	`mean`, `std`	`std > 0`; any real value
`lognormal`	`mu`, `sigma`	`sigma > 0`; positive-only; `mu`/`sigma` are in log space (median = e^mu)
`uniform`	`low`, `high`	`high > low`; flat
`exponential`	`scale`	`scale > 0`; non-negative; mean = scale
`poisson`	`lam`	`lam > 0`; discrete integer counts
`pareto`	`alpha`, `xm`	both `> 0`; values `>= xm`; heavy tail

Pick by shape: counts → poisson; money/positive-skew → lognormal (or pareto for heavy tails); bounded score → normal + min/max; latency → exponential.

4. Causal graph (`causal`)¶

Derive columns from others. A derived feature is declared with no dist/rate and appears as an edge to.

causal:
  edges:
    - { from: age,       to: income,   fn: linear,   weight: 800, bias: 10000 }
    - { from: education, to: income,   fn: map,      mapping: { hs: 0, college: 15000, grad: 40000 } }
    - { from: income,    to: is_fraud, fn: logistic, weight: -0.00002, bias: 1.0 }
  noise:
    income:   { dist: normal, params: { mean: 0, std: 5000 } }
    is_fraud: { dist: none }
  interventions:
    - { do: { income: 50000 } }    # optional: pin a node to a constant

A node sums its incoming edges' contributions, then adds node noise. A boolean target's summed value is turned into a probability (use logistic as its incoming edge) and sampled.

Structural functions (fn):

`fn`	contribution	needs	parent type
`linear`	`weight·parent + bias`	`weight` (+ `bias?`)	numeric/boolean/timeseries
`logistic`	`σ(weight·parent + bias)`	`weight` (+ `bias?`)	numeric/boolean/timeseries
`polynomial`	`Σ coeffs[i]·parent^i`	`coeffs` (list)	numeric/boolean/timeseries
`map`	`mapping[parent]`	`mapping` (covers all categories)	categorical
`identity`	`parent`	—	numeric/boolean/timeseries

noise[node] is { dist: none } (deterministic) or { dist: <name>, params: {…} }.

5. Failures (`failures`) — ordered list¶

Applied after a clean baseline is captured; export injected to write the corrupted file (§7). Each item has type + fields. rate is a fraction in [0,1].

`type`	purpose	fields
`mcar`	random blanks	`column` or `columns:[…]`; `rate`
`mar`	blanks driven by another column	`column`; `driver`; `rate`; `strength?` (def 2.0)
`mnar`	blanks driven by the column's own value	`column`; `driver?`; `rate`; `strength?` (def 2.0)
`label_noise`	flip/reassign a label	`column` (boolean/categorical); `rate`
`feature_noise`	add noise to a number	`column` (numeric); `dist`; `params`
`drift`	shift across the row index	`column` (numeric); `schedule`
`covariate_shift`	rescale to target moments	`column` (numeric); `target:{mean?,std?}`
`leakage`	plant a proxy column	`target` (numeric/boolean); `into` (new ≠ target); `noise?` (def 0.05)

drift.schedule: { kind: linear|step, magnitude: <total shift> } (or rate: <per-row slope>; at: 0..1 jump point for step, default 0.5).

failures:
  - { type: mnar, column: income, rate: 0.12, strength: 2.5 }
  - { type: label_noise, column: is_fraud, rate: 0.03 }
  - { type: drift, column: income, schedule: { kind: linear, magnitude: 8000 } }
  - { type: leakage, target: is_fraud, into: fraud_score, noise: 0.05 }

A failure must not reference a latent (emit:false) feature.

6. Difficulty (`difficulty`) — binary classification only¶

Calibrates the dataset so a baseline probe lands in a target AUROC band.

difficulty:
  target: advanced        # tier name OR { band: [0.72, 0.80] }
  label: defaulted        # boolean / 2-class categorical; not latent
  probe: logreg           # logreg | tree
  max_iters: 10           # int >= 1 (default 8)
  knobs: [noise, label_noise]   # subset of {noise, label_noise}

tier	AUROC band
`beginner`	0.90–0.99
`intermediate`	0.80–0.90
`advanced`	0.72–0.80
`kaggle`	0.62–0.72

Best paired with a causal label (often a latent risk_score emit:false that combines drivers via logistic). Note: calibration blurs predictors, so the distribution-compliance score legitimately drops — that is intended.

7. Export (`export`)¶

export:
  formats: [csv, json, parquet]   # default [csv]; parquet needs the optional extra
  versions: [clean, injected]     # default [clean]; injected only with failures
  splits: { train: 0.8, test: 0.2 }   # ratios MUST sum to 1.0
  shuffle: true                   # default true
  metadata: true                  # default true

8. Realistic text generators (`text.generator`)¶

lorem (filler) plus these seeded providers: name, first_name, last_name, email, username, phone, occupation, title, nationality, address, street, city, state, country, postal_code, company, currency, price, url, hostname, ipv4, word, sentence, color.

9. Hard rules (validation will enforce these)¶

Required top-level keys present; name is a slug; rows >= 1.
A causal-derived feature has no dist/rate and is some edge's to.
A feature is never both sampled (dist) and a causal target.
The causal graph is acyclic; only numeric/boolean nodes can be targets.
map needs a categorical parent and a mapping covering every category; other fns need a numeric/boolean/timeseries parent.
difficulty.label is boolean or 2-class categorical and not latent; knobs ⊆ {noise, label_noise}; target is a tier or {band:[a,b]}.
Failures are ordered; export.versions must include injected to emit the corrupted file; a failure can't reference a latent.
export.splits ratios sum to 1.0; export.formats are known formats.
Time-series ar satisfies sum(|coeffs|) < 1; seasonality period > 0.
Distribution params satisfy their domain (e.g. std > 0, uniform.high > low, poisson.lam > 0).

10. Worked few-shot examples¶

Request: "1k customers with age, country, and a 30%-true churn flag."¶

datadoom_version: "1"
name: "customers-basic"
seed: 42
rows: 1000
features:
  age:     { type: numeric, dist: normal, params: { mean: 42, std: 13 }, min: 18, max: 90, dtype: int }
  country: { type: categorical, categories: [US, UK, IN, DE], weights: [4, 2, 3, 1] }
  churned: { type: boolean, rate: 0.3 }
export: { formats: [csv] }

Request: "Fraud dataset where income depends on age+education and drives fraud, with MNAR income and a leaky proxy. Train/test split."¶

datadoom_version: "1"
name: "fraud-causal"
seed: 7
rows: 5000
features:
  age:       { type: numeric, dist: normal, params: { mean: 40, std: 12 }, min: 18, max: 90, dtype: int }
  education: { type: categorical, categories: [hs, college, grad], weights: [0.5, 0.4, 0.1] }
  income:    { type: numeric, dtype: float, min: 0 }   # derived
  is_fraud:  { type: boolean }                          # derived target
causal:
  edges:
    - { from: age,       to: income,   fn: linear,   weight: 800, bias: 10000 }
    - { from: education, to: income,   fn: map,      mapping: { hs: 0, college: 15000, grad: 40000 } }
    - { from: income,    to: is_fraud, fn: logistic, weight: -0.00002, bias: 1.0 }
  noise:
    income:   { dist: normal, params: { mean: 0, std: 5000 } }
    is_fraud: { dist: none }
failures:
  - { type: mnar,    column: income,   rate: 0.12, strength: 2.5 }
  - { type: leakage, target: is_fraud, into: fraud_score, noise: 0.05 }
export:
  formats: [csv]
  versions: [clean, injected]
  splits: { train: 0.8, test: 0.2 }

Request: "Credit-default benchmark calibrated to 'advanced' difficulty, with a hidden risk score."¶

datadoom_version: "1"
name: "credit-advanced"
seed: 17
rows: 6000
features:
  income:     { type: numeric, dist: normal, params: { mean: 60000, std: 20000 }, min: 0 }
  debt_ratio: { type: numeric, dist: normal, params: { mean: 0.35, std: 0.12 }, min: 0 }
  inquiries:  { type: numeric, dist: poisson, params: { lam: 2 }, dtype: int }
  risk_score: { type: numeric, dtype: float, emit: false }   # latent
  defaulted:  { type: boolean }                              # label (derived)
causal:
  edges:
    - { from: income,     to: risk_score, fn: linear,   weight: -0.00003 }
    - { from: debt_ratio, to: risk_score, fn: linear,   weight: 6.0 }
    - { from: inquiries,  to: risk_score, fn: linear,   weight: 0.5 }
    - { from: risk_score, to: defaulted,  fn: logistic, weight: 3.0, bias: -0.5 }
  noise:
    risk_score: { dist: none }
    defaulted:  { dist: none }
difficulty:
  target: advanced
  label: defaulted
  probe: logreg
  knobs: [noise, label_noise]
export: { formats: [csv] }

Request: "Hourly temperature time-series for a year with daily seasonality and a warming trend."¶

datadoom_version: "1"
name: "temperature-hourly"
seed: 1
rows: 8760
features:
  temperature:
    type: timeseries
    trend: { slope: 0.0005, intercept: 15 }
    seasonality:
      - { amplitude: 8, period: 24, phase: 0 }
    ar: [0.6]
    noise_std: 1.0
    dtype: float
export: { formats: [csv] }

After generating, the author can verify with datadoom validate <file> and generate with datadoom run <file> --seed <n> --out <dir>. The reproducibility contract: same (spec, seed) → identical bytes.

LLM / agent authoring reference¶

DataDoom spec authoring — reference for AI models¶

0. Output contract (follow exactly)¶

1. Document skeleton¶

2. Feature types¶

numeric — numbers from a distribution (or causal-derived)¶

categorical — one label per row¶

boolean — true/false¶

datetime — timestamps in a range¶

text — strings (filler or realistic)¶

timeseries — ordered additive series over the row index¶

3. Distributions (numeric.dist and feature_noise.dist)¶

4. Causal graph (causal)¶

5. Failures (failures) — ordered list¶

6. Difficulty (difficulty) — binary classification only¶

7. Export (export)¶

8. Realistic text generators (text.generator)¶