Skip to content

Kmonte/custom launcher validator#605

Draft
kmontemayor2-sc wants to merge 9 commits intomainfrom
kmonte/custom-launcher-validator
Draft

Kmonte/custom launcher validator#605
kmontemayor2-sc wants to merge 9 commits intomainfrom
kmonte/custom-launcher-validator

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

Scope of work done

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

kmontemayor and others added 9 commits April 17, 2026 01:56
Adds `CustomResourceConfig { string launcher_fn; map<string,string>
launcher_args; }` as a new oneof member (field 5) on both
TrainerResourceConfig and InferencerResourceConfig. Also surfaces the
new oneof in the GiglResourceConfigPbWrapper so callers can retrieve
it alongside the existing VertexAi/Local/KFP/Dataflow variants.

Also widens the Union parameter of
resource_config_checks._validate_machine_config and the trainer
annotation in check_if_trainer_resource_config_valid to include the
new type, with a trivial launcher_fn-truthy branch; full semantic
validation (e.g. import-resolvability, dry-run) is intentionally
deferred to the follow-up validator PR.
Adds gigl/src/common/custom_launcher.py with launch_custom(), which
resolves CustomResourceConfig.launcher_fn via import_obj and invokes
it with the standard trainer/inferencer kwargs plus the opaque
launcher_args map.

Wires CustomResourceConfig isinstance branches into the existing
dispatch in glt_trainer.py (__execute_VAI_training + run()) and
glt_inferencer.py (__execute_VAI_inference + run()). V1 trainer and
v1 gnn_inferencer remain Vertex-only.

Also adds gigl/env/runtime.py with is_ray_runtime()/get_runtime_env()
for callers that need to branch on execution environment.

Step 2 of 3 in the upstream series (A: proto, B: dispatch, C: validator).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the resource-config validator to handle CustomResourceConfig:
- check_if_trainer_resource_config_valid / check_if_inferencer_resource_
  config_valid short-circuit for custom configs (launcher-pluggable; no
  machine shape to validate).
- Reverts the minimal stub from PR #A inside _validate_machine_config
  — that function's "unrecognized config" else-branch is restored as
  the contract demands.
- New check_if_custom_resource_config_dry_run_valid helper invokes
  launch_custom(..., is_dry_run=True) (lazy import keeps
  assert_yaml_configs_parse import-free).
- New --check_custom_launcher_dry_run CLI flag on config_validator.
- New check_custom_resource_config_requires_glt_backend compatibility
  check raises when CustomResourceConfig is paired with a task config
  that has should_use_glt_backend=False (v1 dispatchers don't consult
  the custom oneof).

Step 3 of 3 in the upstream series (A: proto, B: dispatch, C: validator).
Introduces a generic, repeatable --env_vars KEY=VALUE flag on
gigl.orchestration.kubeflow.runner that bakes environment variables into
every GiGL-owned container at compile time via PipelineTask.set_env_variable.
Applied uniformly across all SPECED_COMPONENTS plus the GLT eligibility
check and log_metrics_to_ui tasks; the managed VertexNotificationEmailOp
exit handler is intentionally excluded.

Rejected with --action=run_no_compile to prevent silent UX failure (envs
are baked at compile time, so the flag would do nothing in that mode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ThreadPoolExecutor(max_workers=0) raises ValueError. The previous code
passed len(node_data_references) / len(edge_data_references) directly,
which crashes when a preprocessor returns empty preprocessing-spec
dicts (legitimate use case for an end-to-end harness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same root cause as the enumerate fix: ThreadPoolExecutor(max_workers=0)
raises ValueError. When both node_ref_to_preprocessing_spec and
edge_ref_to_preprocessing_spec are empty, num_dataflow_jobs is 0 and
the executor blows up in __init__. Early-return an empty
PreprocessedMetadataReferences in that case — no Dataflow work to do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants