Cost monitoring, query stats & budgets¶
This document describes the new cost & observability features in
fastflowtransform:
- Per-query cost guards via environment variables
- Per-model query statistics in
run_results.json - Optional project-level budgets via
budgets.yml
All of these are opt-in and backwards-compatible: if you do nothing, existing projects keep working exactly as before.
1. Per-query cost guards (budgets.yml or FF_*_MAX_BYTES)¶
For engines that can estimate query size, FFT can now abort a query before running it if it would scan more data than you allow.
Supported engines & env vars¶
| Engine | Env var | Applies to |
|---|---|---|
| BigQuery | FF_BQ_MAX_BYTES |
All model-driven SQL queries |
| DuckDB | FF_DUCKDB_MAX_BYTES |
All model-driven SQL queries |
| Postgres | FF_PG_MAX_BYTES |
All model-driven SQL queries |
| Databricks / Spark | FF_SPK_MAX_BYTES |
All spark.sql(...) queries |
| Snowflake (Snowpark) | FF_SF_MAX_BYTES |
All session.sql(...) queries |
If the env var is unset or ≤ 0, the guard is disabled for that engine.
Value format¶
All FF_*_MAX_BYTES use the same parser as fastflowtransform.executors.budget.parse_max_bytes_env:
- Plain integers:
5000000000 -
With unit suffix (case-insensitive,
_and,ignored): -
10GB,5G 750MB,100M42KB,10K- Units are powers of 1024
Examples:
# Limit BigQuery queries to ~1 GB
export FF_BQ_MAX_BYTES=1GB
# Limit DuckDB queries to ~2 GB
export FF_DUCKDB_MAX_BYTES=2GB
# Limit Snowflake queries to ~100 MB
export FF_SF_MAX_BYTES=100mb
# Limit Postgres queries to ~500 MB
export FF_PG_MAX_BYTES=500M
# Limit Spark queries to ~10 GB
export FF_SPK_MAX_BYTES=10g
````
### Behaviour
1. **Before running a query**, the executor asks the engine for a best-effort
size estimate:
* BigQuery: dry-run load/query job reports logical bytes.
* DuckDB: `EXPLAIN (FORMAT JSON)` stats (cardinality × row width), best-effort.
* Postgres: `EXPLAIN (FORMAT JSON)` → `Plan Rows × Plan Width`.
* Spark/Databricks: `DataFrame._jdf.queryExecution().optimizedPlan().stats().sizeInBytes`.
* Snowflake Snowpark: `EXPLAIN USING JSON` and `GlobalStats.bytesAssigned`.
2. If the estimate is **`None`** or **≤ 0** (cannot estimate), the query runs
as usual and is **not blocked**.
3. If the estimate is **> limit**, FFT raises a `RuntimeError` *before* the
query is submitted:
> Aborting <Engine> query: estimated scanned bytes 123456789 (117 MB)
> exceed FF_XY_MAX_BYTES=100000000 (95.4 MB).
> Increase FF_XY_MAX_BYTES or unset it to allow this query.
The CLI surfaces this error as a **failed model** like any other engine error.
### Configuring limits in budgets.yml
You can define the same per-query limits directly in `budgets.yml` so that
every engineer sees a single source of truth:
```yaml
version: 1
query_limits:
duckdb:
max_bytes: 2_000_000_000
postgres:
max_bytes: 500_000_000
When query_limits.<engine>.max_bytes is set, FFT automatically enforces it for
that engine. Environment variables still work and explicitly override the YAML
value, which is useful for quick experiments in CI or ad-hoc runs.
2. Query statistics per model (run_results.json)¶
FFT now collects per-model query statistics and writes them into
.fastflowtransform/target/run_results.json alongside the usual run
information.
What is recorded?¶
For each model in a run, FFT now aggregates (best effort):
bytes_scanned: total bytes processed by all SQL queries in that modelrows: total rows affected/returned (where the engine reports it)query_duration_ms: total time spent inside the engine for SQL queries (excluding Python overhead)
These are stored on each RunNodeResult:
{
"results": [
{
"name": "fct_events",
"status": "success",
"duration_ms": 1234,
"bytes_scanned": 987654321,
"rows": 123456,
"query_duration_ms": 1100
},
{
"name": "dim_users",
"status": "success",
"duration_ms": 250,
"bytes_scanned": 1048576,
"rows": 1000,
"query_duration_ms": 200
}
]
}
Engines that don’t support a metric simply leave it as null/0; all
statistics are best effort and must never break a run.
Where does it come from?¶
- Each executor implements a private
_execute_sql(...)method that is used for all model-driven SQL. - After each query, the executor updates its internal “current node stats”.
- At the end of a model run,
_RunEnginecallsexecutor.get_node_stats()and stores the aggregated numbers inrun_results.json.
This design means:
- All SQL executed by FFT for a model contributes to these stats.
- Python models now contribute metrics on engines that can inspect their DataFrames:
- Postgres (pandas): uses in-memory dataframe size / row count.
- BigQuery (pandas):
load_table_from_dataframeduration + pandas memory usage. - Databricks/Spark: optimized plan stats via Snowpark/Spark query execution.
- Snowflake Snowpark:
EXPLAIN USING JSONfor the Snowpark plan. Engines without support fall back toNonefor bytes/rows.
How to use it¶
You can post-process run_results.json to:
- Identify the most expensive models (by bytes scanned or query time)
- Build dashboards / cost reports in your own tooling
- Feed data into a documentation UI (e.g. show last run cost per model)
Example (pseudo-Python):
import json
from pathlib import Path
path = Path("my_project/.fastflowtransform/target/run_results.json")
data = json.loads(path.read_text())
for node in data["results"]:
print(
node["name"],
node.get("bytes_scanned"),
node.get("query_duration_ms"),
)
3. Project-level budgets (budgets.yml)¶
In addition to per-query env guards, you can define project-level budgets in
a budgets.yml file at the project root:
- Env vars (
FF_*_MAX_BYTES) protect individual queries budgets.ymlprotects the entire run (and optional per-engine / per-model slices) based on the aggregated stats inrun_results.json
⚠️ If you don’t create
budgets.yml, nothing changes – budgets are fully opt-in.
File location¶
Place the file next to project.yml:
my_project/
project.yml
budgets.yml # ← new
models/
...
Basic example¶
version: 1
# Per-engine query limits (applied before executing individual queries)
query_limits:
duckdb:
max_bytes: 5_000_000
postgres:
max_bytes: 10_000_000
bigquery:
max_bytes: 50_000_000
databricks_spark:
max_bytes: 50_000_000
snowflake_snowpark:
max_bytes: 50_000_000
# Global limits across the entire fft run
total:
bytes_scanned:
# ~10 MB – adjust down if you want to force a warning
warn: 100
# ~100 MB – adjust down if you want to force an error
error: 100_000_000
# Optional: total query time across all queries in the run
query_duration_ms:
warn: "30s" # human-friendly duration, parsed to ms
error: "2m"
# Per-model limits (keys must match node names: stg_users.ff, mart_user_orders.ff, http_users, ...)
models:
stg_users.ff:
bytes_scanned:
# keep this fairly low so you can see a warn if you want
warn: 100
error: 10_000_000
stg_orders.ff:
bytes_scanned:
warn: 1_000_000
error: 10_000_000
mart_user_orders.ff:
bytes_scanned:
warn: 1_000_000
error: 100_000_000
http_users:
# HTTP model → mainly interesting on engines that can report bytes_scanned
bytes_scanned:
warn: 5_000_000
error: 50_000_000
py_constants:
bytes_scanned:
warn: 5_000_000
error: 50_000_000
# Per-tag budgets (aggregated over all models with that tag)
tags:
"example:cache_demo":
bytes_scanned:
warn: 10_000_000
Value syntax¶
warn/errorfor bytes use the same notation asFF_*_MAX_BYTES:
warn: "5GB"
error: "500_000_000"
- Durations are currently in milliseconds (plain integers).
When are budgets evaluated?¶
- You run
fft runas usual. - FFT executes models and writes
run_results.jsonwith the new stats. -
After the run completes, FFT:
-
Aggregates
bytes_scanned/query_duration_msacross allRunNodeResult - Compares the totals to the budgets defined in
budgets.yml - Emits warnings and/or fails the run depending on your config
What happens on exceed?¶
- If a warn budget is exceeded, FFT prints a warning but the CLI exit code
stays
0(unless there were other errors). - If an error budget is exceeded, FFT treats it like a failed run and exits
with
1.
If both arn and error are defined and exceeded, the error
behaviour wins.
Interaction with env-level guards¶
- Env vars (
FF_*_MAX_BYTES) are hard per-query gates. If an estimate is above the limit, the query is never sent to the engine. - Budgets (
budgets.yml) are soft run-level gates. They look at what was actually executed and may warn or fail after the fact.
You can happily combine both:
- Env vars to avoid accidentally launching a single “monster query”
- Budgets to catch an overall run that’s becoming too expensive over time
4. Backwards compatibility¶
-
All new features are opt-in:
-
No env var → no per-query guard
- No
budgets.yml→ no run-level budgets - Existing
run_results.jsonreaders remain valid; they simply gain more fields (bytes_scanned,rows,query_duration_ms). - Engines that can’t provide certain metrics just leave them empty/
null. No behaviour changes for those engines.