Parallelism & Cache¶
TL;DR: FastFlowTransform executes models in parallel DAG levels and uses deterministic fingerprints to skip unchanged nodes — while a separate HTTP cache accelerates API models.
FastFlowTransform introduces a level-wise parallel scheduler and a build cache driven by stable fingerprints. This document explains how parallel execution works, when nodes are skipped, the exact fingerprint formula, and the meta table written after successful builds.
Table of Contents¶
- Parallel Scheduler
- Cache Policy
- Fingerprint Formula
- Meta Table Schema
- CLI Recipes
- Troubleshooting & FAQ
- Example: simple_duckdb
- Appendix: Environment Inputs
Parallel Scheduler¶
FastFlowTransform splits the DAG into levels (all nodes that can run together without violating dependencies). Within a level, up to --jobs nodes execute in parallel.
- Dependencies are never violated.
--keep-going: tasks already started in a level finish; subsequent levels won’t start if any task in the current level fails.- Logs are serialized through an internal queue to keep lines readable and per-node timing visible.
Quick start
# Run with 4 workers per level
fft run . --env dev --jobs 4
# Keep tasks in the same level running even if one fails
fft run . --env dev --jobs 4 --keep-going
Cache Policy¶
The cache decides whether a node can be skipped when nothing relevant changed. Modes:
--cache=off # always build
--cache=rw # default; skip on match; write cache after build
--cache=ro # skip on match; on miss build but don't write cache
--cache=wo # always build and write cache
--rebuild <glob> # ignore cache for matching nodes
--no-cache # alias for --cache=off
Skip condition¶
A node is skipped iff:
- The current fingerprint matches the on-disk cache value, and
- The physical relation exists on the target engine.
If the relation was dropped externally, FastFlowTransform will rebuild even if the fingerprint matches.
HTTP Response Cache¶
In addition to the build cache, FastFlowTransform provides an HTTP response cache for API models using
fastflowtransform.api.http.get_df(...).
- Purpose: Avoid redundant API calls and support offline mode.
- Location: Controlled by
FF_HTTP_CACHE_DIR(e.g..local/http-cache). - Controls (environment):
FF_HTTP_ALLOWED_DOMAINS: comma-separated list of hosts allowed to cache.FF_HTTP_MAX_RPS,FF_HTTP_MAX_RETRIES,FF_HTTP_TIMEOUT: rate limiting & retry policy.FF_HTTP_OFFLINE=1: run in offline mode — serve only from cache, no network calls.- CLI visibility: Each run writes HTTP stats (
requests,cache_hits,bytes,used_offline) to.fastflowtransform/target/run_results.json. - Makefile helpers: see
make api-show-httpin the API demo to inspect HTTP cache usage.
This cache is independent from the build cache; it stores API responses, not SQL or fingerprints.
Fingerprint Formula¶
Fingerprints are stable hashes that change on any relevant input:
-
SQL models:
fingerprint_sql(node, rendered_sql, env_ctx, dep_fps) -
Uses rendered SQL (after Jinja), not the raw template.
-
Python models:
fingerprint_py(node, func_src, env_ctx, dep_fps) -
Uses
inspect.getsource(func)with a file-content fallback if needed.
env_ctx includes:
engine(e.g.,duckdb,postgres,bigquery)profile_name(CLI--env)- Selected environment entries: all
FF_*keys (key + value) - A normalized portion of
sources.yml(sorted keys/dump)
dep_fps are upstream fingerprints; any upstream change invalidates downstream fingerprints.
Properties
- Same inputs ⇒ same hash.
- Minimal change in SQL/function ⇒ different hash.
- Dependency changes propagate downstream.
Note: The active engine and profile name are part of the fingerprint. Switching from
duckdbtopostgresautomatically invalidates the cache, so cross-engine runs never reuse outdated fingerprints.
Meta Table Schema¶
After a successful build, FastFlowTransform writes a per-node audit row:
_ff_meta (
node_name TEXT/STRING, -- logical name, e.g. "users.ff"
relation TEXT/STRING, -- physical table/view, e.g. "users"
fingerprint TEXT/STRING,
engine TEXT/STRING,
built_at TIMESTAMP
)
Backends:
- DuckDB: table
_ff_metainmain. - Postgres: table
_ff_metain the active schema. - BigQuery: table
<dataset>._ff_meta.
Note: Skip logic uses the file-backed fingerprint cache and a direct relation existence check; the meta table is for auditing and tooling.
CLI Recipes¶
# First run — builds everything, writes cache and meta
fft run . --env dev --cache=rw
# No-op run — should skip all nodes (if nothing changed)
fft run . --env dev --cache=rw
# Force rebuild of a single model (ignores cache for it)
fft run . --env dev --cache=rw --rebuild marts_daily.ff
# Read-only cache (skip on match, build on miss, no writes)
fft run . --env dev --cache=ro
# Always build and write cache
fft run . --env dev --cache=wo
# Disable cache entirely
fft run . --env dev --no-cache
With parallelism:
fft run . --env dev --jobs 4
fft run . --env dev --jobs 4 --keep-going
Troubleshooting & FAQ¶
“Why did it skip?” A skip requires a fingerprint match and an existing relation. Fingerprints include:
- rendered SQL / Python function source,
sources.yml(normalized),- engine/profile,
- all
FF_*environment variables, - upstream fingerprints.
Any change in the above triggers a rebuild downstream.
“Relation missing but cache says skip?” We also check relation existence. If the table/view was dropped externally, FastFlowTransform will rebuild.
“My logs interleave under parallelism.”
Logs are serialized via a queue; use -v / -vv for richer but still stable output. Each node prints start/end and duration; levels summarize.
“Utest cache?”
fft utest --cache {off|ro|rw} defaults to off for deterministic runs. With rw, expensive unit cases can be accelerated. Unit tests do not rely on the meta table by default.
Example: simple_duckdb¶
The demo contains two independent staging nodes (users.ff.sql, orders.ff.sql). They run in parallel within the same level.
Makefile targets:
run_parallel:
FF_ENGINE=duckdb FF_DUCKDB_PATH="$(DB)" fft run "$(PROJECT)" --env dev --jobs 4
cache_rw_first:
FF_ENGINE=duckdb FF_DUCKDB_PATH="$(DB)" fft run "$(PROJECT)" --env dev --cache=rw
cache_rw_second:
FF_ENGINE=duckdb FF_DUCKDB_PATH="$(DB)" fft run "$(PROJECT)" --env dev --cache=rw
cache_invalidate_env:
FF_ENGINE=duckdb FF_DUCKDB_PATH="$(DB)" FF_DEMO_TOGGLE=1 fft run "$(PROJECT)" --env dev --cache=rw
Appendix: Environment Inputs¶
Only environment variables with the FF_ prefix affect fingerprints (keys and values). If you change one (e.g., FF_RUN_DATE, FF_REGION), fingerprints change and downstream nodes rebuild.
# Will invalidate fingerprints and rebuild affected nodes
FF_RUN_DATE=2025-01-01 fft run . --env dev --cache=rw
---
### 🔗 `docs/index.md` – Link zum neuen Kapitel
```diff
--- a/docs/index.md
+++ b/docs/index.md
@@ -10,6 +10,7 @@
- [User Guide – Operational](./Technical_Overview.md#part-i--operational-guide)
- [Modeling Reference](./Config_and_Macros.md)
- [Parallelism & Cache](./Cache_and_Parallelism.md)
- [Developer Guide – Architecture & Internals](./Technical_Overview.md#part-ii--architecture--internals)