Add dbt-factory bundle template and example to contrib#163
Open
mwojtyczka wants to merge 2 commits into
Open
Conversation
Runs a dbt project on Databricks as a Workflow with one task per dbt object (model/seed/snapshot/test) on serverless. The job is generated at deploy time from the dbt manifest via the PyDABs `load_resources` hook, which calls the bundled databricks-dbt-factory core -- no per-model YAML is checked in. The core under src/databricks_dbt_factory/ is adapted from mwojtyczka/databricks-dbt-factory@e767a9d (v0.2.1, MIT), reformatted to this repo's style; see NOTICE for attribution. The only integration code is resources/__init__.py. A committed target/manifest.json lets the bundle deploy out of the box; `make manifest` regenerates it. The README lists the benefits, includes an end-to-end Mermaid diagram (greenfield + existing-project paths), and covers migrating an existing dbt project. Verified end-to-end on a serverless SQL warehouse: bundle deploy + run completes SUCCESS with all model and test tasks passing. Passes `ruff format --check`. Co-authored-by: Isaac
`databricks bundle init` template that scaffolds a dbt project wired to the PyDABs load_resources hook, derived from the contrib/dbt_factory example. Prompts expose the project name, catalog/schema, warehouse HTTP path, and the factory options (bundle_tests, environment_key, extra_dbt_command_options). The extra-options value is rendered with printf %q so quoted dbt args (e.g. --vars) stay valid Python. The README lists the benefits and includes an end-to-end Mermaid diagram of the template flow, including the optional existing-project migration path. The manifest is generated post-init via `make manifest`. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds
By default, running dbt on Databricks executes your whole dbt project as a single, opaque Workflow task — one green or red box. You can't see which model failed, you can't rerun just the failed models, and independent models don't run in parallel.
This PR provides a template that turns a dbt project into a Databricks Workflow with one task per dbt object
(model, seed, snapshot, test), with dependencies wired to match your dbt DAG. That gives you:
dbt's dependencies pre-cached in the serverless environment, avoiding a cold start on every task.
It ships two things:
contrib/dbt_factory/— a complete, deployable example you can clone and run.contrib/templates/dbt-factory/— adatabricks bundle inittemplate that scaffolds a new dbt project already wired up this way (bring your own models, or migrate an existing project).How it works
flowchart TD subgraph setup["One-time setup"] T["dbt-factory bundle template"] -->|databricks bundle init| B["Scaffolded project:<br/>dbt project + PyDABs hook + factory code"] X["Existing dbt project<br/>(optional)"] -.->|move models/seeds/... into src/| B end subgraph deploy["Every deploy"] C["make manifest<br/>(dbt parse)"] --> D["target/manifest.json"] D --> E["databricks bundle deploy"] E --> F["PyDABs load_resources reads the<br/>manifest and generates the job"] end subgraph runtime["At run time — serverless"] G["Databricks Workflow:<br/>one task per model / seed / snapshot / test"] --> H["Each task triggers dbt<br/>via the runner notebook"] H --> I[("SQL warehouse")] end B --> C F --> G classDef optional stroke:#999,stroke-dasharray:5 4,color:#888; class X optional;The job is not written as YAML. It's generated at
databricks bundle deploytime from the dbt manifest via the PyDABspython.resourceshook:resources/__init__.py'sload_resourcesreadstarget/manifest.jsonand builds one Databricks task per dbt node, wiring up dependencies. No per-model YAML is checked in — the task graph tracks the dbt DAG automatically.The generation logic comes from the databricks-dbt-factory library, whose source is included under
src/databricks_dbt_factory/(adapted from commite767a9d, v0.2.1, MIT — reformatted to this repo's style; seeNOTICEfor attribution). The only integration code is the smallresources/__init__.py.Design choices
dbtRunner) for the fastest task start times.bundle_tests,environment_key,extra_dbt_command_options. Target / warehouse / catalog / schema live indbt_profiles/profiles.yml.target/manifest.jsonso it deploys out of the box;make manifestregenerates it. The README covers migrating an existing dbt project (init, then move your dbt files into the generated project).Testing
databricks bundle deploy+databricks bundle runcompletes SUCCESS with all model and test tasks passing.make test) passes — including an integration test that exercisesload_resourcesagainst the committed manifest with no workspace.databricks bundle initon the template renders cleanly; the generated project'sresources/__init__.pycompiles and its bundle validates.ruff format --check(the repo'sfmtCI check).Note for reviewers
The
databricks_dbt_factory/core (and its tests) appear twice: once in the example (contrib/dbt_factory/) and once in the template (contrib/templates/dbt-factory/template/{{.project_name}}/). This is intentional, not anoversight:
databricks bundle initcan only stamp out files that live under the template'stemplate/directory — a template file can't reference code outside it. So both need their own copy.contrib/templates/data-engineering+contrib/data_engineeringpairing, which likewise duplicates its shared files (e.g.scripts/,conftest.py) between template and example.The code is owned by this repo (bundle-examples) — the
NOTICEfiles credit the originaldatabricks-dbt-factorysource for attribution; it's been reformatted to the repo's style and passesruff format --check.Slight downside: a future update to the core means editing it in both places (re-synced from the pinned upstream commit). Low-cost in practice — it's stable code touched only on version bumps, not something maintained in-repo day to day.
TODOs