Architecture¶
DataDoom is a pure, deterministic engine wrapped by thin surfaces. The guarantees come from a handful of non-negotiable invariants enforced in CI.
Layering¶
engine/is a clean, installable library: it imports nothing from the web/DB/CLI layers or any web framework. This is enforced byimport-linter(the "engine stays framework-free" contract).- Higher layers may import lower ones, never the reverse.
store,plugins, andadapterssit beside the engine and depend only on it.
The single pipeline¶
The CLI, the HTTP API, and datadoom.generate() all call
engine.pipeline.generate() — generation logic is never duplicated. The pipeline
is a fixed sequence of stages:
intake → snapshot → seed → base_generation → causal → difficulty
→ failure_injection → compliance → packaging
base_generationsamples root features;causalderives the rest via the SEM walk in topological order.difficultycalibrates the clean frame to a target baseline-AUROC band.failure_injectioncorrupts a copy, preserving the clean baseline.compliancereports fit honestly;packagingwrites byte-stable artifacts.
Determinism invariant¶
All randomness flows through engine.rng. The RNG key is:
No stdlib random, uuid4, time, or global np.random.* ever appears in the
data path. On the pinned path (pinned numpy/scikit-learn/mimesis), the same
(spec_hash, seed) yields byte-identical artifacts — proven by the
determinism gate and the cross-OS reproducibility matrix in CI.
Reproducible artifacts¶
metadata.json and the data files carry no timestamps or ambient state. Every run
also ships a locked resolved spec (canonical body + baked-in seed) and an
audit report, all checksummed, so a dataset is as reproducible and
shareable as source code.
Full design set¶
The authoritative architecture documents: