Skip to content

Add dbt-factory bundle template and example to contrib#163

Open
mwojtyczka wants to merge 2 commits into
databricks:mainfrom
mwojtyczka:dbt-factory-contrib
Open

Add dbt-factory bundle template and example to contrib#163
mwojtyczka wants to merge 2 commits into
databricks:mainfrom
mwojtyczka:dbt-factory-contrib

Conversation

@mwojtyczka

@mwojtyczka mwojtyczka commented Jul 2, 2026

Copy link
Copy Markdown

What this adds

By default, running dbt on Databricks executes your whole dbt project as a single, opaque Workflow task — one green or red box. You can't see which model failed, you can't rerun just the failed models, and independent models don't run in parallel.

This PR provides a template that turns a dbt project into a Databricks Workflow with one task per dbt object
(model, seed, snapshot, test), with dependencies wired to match your dbt DAG. That gives you:

  • Faster execution — independent models run in parallel, and the notebook task type keeps
    dbt's dependencies pre-cached in the serverless environment, avoiding a cold start on every task.
  • Visibility & simplified troubleshooting — pinpoint failures at the model level in the UI.
  • Enhanced logging & notifications — per-task logs and precise, model-level error alerts.
  • Improved retriability — retry only the failed model tasks without rerunning the whole project.
  • Seamless testing — dbt data tests run as their own tasks right after each model finishes.

It ships two things:

  • contrib/dbt_factory/ — a complete, deployable example you can clone and run.
  • contrib/templates/dbt-factory/ — a databricks bundle init template that scaffolds a new dbt project already wired up this way (bring your own models, or migrate an existing project).

How it works

flowchart TD
    subgraph setup["One-time setup"]
      T["dbt-factory bundle template"] -->|databricks bundle init| B["Scaffolded project:<br/>dbt project + PyDABs hook + factory code"]
      X["Existing dbt project<br/>(optional)"] -.->|move models/seeds/... into src/| B
    end
    subgraph deploy["Every deploy"]
      C["make manifest<br/>(dbt parse)"] --> D["target/manifest.json"]
      D --> E["databricks bundle deploy"]
      E --> F["PyDABs load_resources reads the<br/>manifest and generates the job"]
    end
    subgraph runtime["At run time — serverless"]
      G["Databricks Workflow:<br/>one task per model / seed / snapshot / test"] --> H["Each task triggers dbt<br/>via the runner notebook"]
      H --> I[("SQL warehouse")]
    end
    B --> C
    F --> G

    classDef optional stroke:#999,stroke-dasharray:5 4,color:#888;
    class X optional;
Loading

The job is not written as YAML. It's generated at databricks bundle deploy time from the dbt manifest via the PyDABs
python.resources hook: resources/__init__.py's load_resources reads target/manifest.json and builds one Databricks task per dbt node, wiring up dependencies. No per-model YAML is checked in — the task graph tracks the dbt DAG automatically.

The generation logic comes from the databricks-dbt-factory library, whose source is included under src/databricks_dbt_factory/ (adapted from commit e767a9d, v0.2.1, MIT — reformatted to this repo's style; see NOTICE for attribution). The only integration code is the small resources/__init__.py.

Design choices

  • Serverless compute and the notebook task type (dbt runs via a small runner notebook using dbtRunner) for the fastest task start times.
  • A minimal set of exposed options: bundle_tests, environment_key, extra_dbt_command_options. Target / warehouse / catalog / schema live in dbt_profiles/profiles.yml.
  • The example commits a target/manifest.json so it deploys out of the box; make manifest regenerates it. The README covers migrating an existing dbt project (init, then move your dbt files into the generated project).
  • Ships both an example and a template

Testing

  • Deployed and ran end-to-end on a serverless SQL warehouse: databricks bundle deploy + databricks bundle run completes SUCCESS with all model and test tasks passing.
  • Offline test suite (make test) passes — including an integration test that exercises load_resources against the committed manifest with no workspace.
  • databricks bundle init on the template renders cleanly; the generated project's resources/__init__.py compiles and its bundle validates.
  • Passes ruff format --check (the repo's fmt CI check).

Note for reviewers

The databricks_dbt_factory/ core (and its tests) appear twice: once in the example (contrib/dbt_factory/) and once in the template (contrib/templates/dbt-factory/template/{{.project_name}}/). This is intentional, not an
oversight:

  • Each artifact must be self-contained. The example is meant to be cloned and run as-is, and databricks bundle init can only stamp out files that live under the template's template/ directory — a template file can't reference code outside it. So both need their own copy.
  • Consistent with repo precedent. This mirrors the existing contrib/templates/data-engineering + contrib/data_engineering pairing, which likewise duplicates its shared files (e.g. scripts/, conftest.py) between template and example.

The code is owned by this repo (bundle-examples) — the NOTICE files credit the original databricks-dbt-factory source for attribution; it's been reformatted to the repo's style and passes ruff format --check.

Slight downside: a future update to the core means editing it in both places (re-synced from the pinned upstream commit). Low-cost in practice — it's stable code touched only on version bumps, not something maintained in-repo day to day.

TODOs

  • Manual testing

Runs a dbt project on Databricks as a Workflow with one task per dbt object
(model/seed/snapshot/test) on serverless. The job is generated at deploy time
from the dbt manifest via the PyDABs `load_resources` hook, which calls the
bundled databricks-dbt-factory core -- no per-model YAML is checked in.

The core under src/databricks_dbt_factory/ is adapted from
mwojtyczka/databricks-dbt-factory@e767a9d (v0.2.1, MIT), reformatted to this
repo's style; see NOTICE for attribution. The only integration code is
resources/__init__.py. A committed target/manifest.json lets the bundle deploy
out of the box; `make manifest` regenerates it. The README lists the benefits,
includes an end-to-end Mermaid diagram (greenfield + existing-project paths),
and covers migrating an existing dbt project.

Verified end-to-end on a serverless SQL warehouse: bundle deploy + run
completes SUCCESS with all model and test tasks passing. Passes
`ruff format --check`.

Co-authored-by: Isaac
`databricks bundle init` template that scaffolds a dbt project wired to the
PyDABs load_resources hook, derived from the contrib/dbt_factory example.
Prompts expose the project name, catalog/schema, warehouse HTTP path, and the
factory options (bundle_tests, environment_key, extra_dbt_command_options).
The extra-options value is rendered with printf %q so quoted dbt args (e.g.
--vars) stay valid Python. The README lists the benefits and includes an
end-to-end Mermaid diagram of the template flow, including the optional
existing-project migration path. The manifest is generated post-init via
`make manifest`.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant