Skip to content

Cld2labs/airgap#115

Open
HarikaDev296 wants to merge 29 commits into
opea-project:dell-airgapfrom
cld2labs:cld2labs/airgap
Open

Cld2labs/airgap#115
HarikaDev296 wants to merge 29 commits into
opea-project:dell-airgapfrom
cld2labs:cld2labs/airgap

Conversation

@HarikaDev296

Copy link
Copy Markdown
Contributor

Overview:

This PR adds support for deploying Enterprise Inference in fully air-gapped environments on Dell single-node systems. It uses JFrog Artifactory as a local mirror for container images, Helm charts, OS packages, and HuggingFace models, so no internet access is needed during deployment.

What's Included:

  • JFrog Artifactory Setup

    Added scripts under third_party/Dell/air-gap/jfrog-setup/ to install, configure, populate, and uninstall JFrog Artifactory. The setup script handles pulling container images, Helm charts, pip packages, and AI models into the local registry. A README walks through the full setup and teardown process.

  • Air-Gap Deployment Guides

    Added air-gap.md under third_party/Dell/air-gap/EI/single-node/ covering the full deployment flow from internet-connected prep through offline transfer to the final air-gapped install. It includes HuggingFace and JFrog model deployment options. Also added air-gap-troubleshooting.md covering common failure scenarios like option 5 silent removal, step 3f apt caching issues, and JFrog connectivity problems.

Core Fixes and Enhancements:

  • Moved airgap registry mirrors and version pins from all.yml to offline.yml for cleaner separation
  • Added an internet reachability check that exits early if airgap_enabled=yes but the node can still reach the internet
  • Fixed several internet leaks across playbooks and shell scripts that were causing pip and apt to attempt outbound connections during air-gapped installs
  • Fixed the undeploy path missing airgap vars which caused pip installs to reach out to the internet
  • Fixed the list-models task failing when no vllm models are deployed
  • Fixed step 3f apt caching to use apt-get download correctly
  • Added JFrog model download support for HuggingFace CPU deployments
  • Added a pre-check playbook that exits early if the required model is not present in JFrog
  • Pinned the JFrog installer to v7.111.8 to avoid a known regression in v7.146.10 related to a missing db5.3-util bundle

Harika and others added 29 commits June 16, 2026 08:55
Enables full EI stack deployment (Kubernetes + LLM serving + GenAI Gateway)
on internet-blocked machines by routing all dependencies through a local
JFrog Artifactory instance.

Changes:
- Add airgap_enabled / jfrog_url / jfrog_username / jfrog_password vars
- Dual-task pattern in all playbooks (internet vs JFrog path)
- setup-env.sh: pip, kubespray, ansible collections, apt from JFrog
- prereq-check.sh: connectivity check against JFrog ping endpoint
- offline.yml: Kubespray binary URLs redirected to JFrog
- containerd mirror config for all 5 registries via JFrog
- Kubespray hosts.toml.j2 patched to not write skip_verify unless true
- inference-tools role: helm, pip, jq installs all JFrog-aware
- nri_cpu_balloons role: helm repo and airgap vars wired up
- JFrog setup script + README for offline bundle preparation
- Air-gap troubleshooting and deployment documentation

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
- Remove step 3i (Meta-Llama-3.1-8B-Instruct)
- Renumber Llama-3.2-3B-Instruct as step 3i
- Add step 3j for Qwen/Qwen3.5-0.8B
- Add step 3k for Qwen/Qwen3.5-4B

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
- Replace Llama-3.1-8B with Qwen3.5-0.8B and Qwen3.5-4B
- Update HuggingFace credentials section with model table
- Update disk space requirement note
- Update --hf-token flag description and step-by-step table

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
- Rename Qwen3.5-0.8B -> Qwen3-0.6B and Qwen3.5-4B -> Qwen3-4B throughout
  (script, README, step headers, HuggingFace repo IDs, JFrog folder names)
- Fix SKIP_STEPS loop in should_run: drop erroneous `:-` default expansion
  that caused an empty-string iteration when no steps were skipped

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
…install

apt-get install --download-only --reinstall can use apt's in-memory package
state and skip the network fetch entirely for already-installed packages like
python3-pip, so JFrog never caches the .deb. apt-get download always fetches
from the configured sources regardless of install state, reliably triggering
the JFrog remote proxy to cache the package.

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
grep returns rc=1 when it finds no matches, which Ansible treats as
a task failure. Allow rc=0 (matches found) and rc=1 (no matches) as
both valid; only fail on real errors like helm not being available.

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
….yml

all.yml is copied for every deployment (airgap and non-airgap). Having
containerd_registries_mirrors with JFROG_HOST placeholders in all.yml
causes non-airgap deployments to fail — containerd tries to resolve
the literal string JFROG_HOST as a DNS name and image pulls fail.

offline.yml is only copied when airgap_enabled=yes, and setup-env.sh
substitutes JFROG_HOST with the real JFrog IP before Kubespray runs.
Moving mirrors, calico_version, and coredns_version there ensures:
- airgap=no: no registry mirrors configured, internet pulls work
- airgap=yes: mirrors point to JFrog with real IP substituted

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
If the system still has internet connectivity while airgap mode is
enabled, Docker images not cached in JFrog may silently fall through
to the internet, breaking the airgap guarantee. Detect this condition
early and exit with a clear message directing the user to disable
internet access before proceeding.

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
…ation

- Clarify VM2 network requirement: internet must be disabled before running
  EI with airgap_enabled=yes; deployment now exits with an error if not
- Update step 3f description to reflect apt-get download fix for reliable
  python3-pip caching in JFrog
- Add troubleshooting entry for the internet connectivity exit with
  instructions on how to disable internet access on VM2

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Clearly document which models have been tested and validated end-to-end
in airgap mode: Llama-3.2-3B-Instruct, Qwen3-0.6B, Qwen3-1.7B, and
Qwen3-4B. Includes a note that other models are not supported without
manual JFrog uploads and have not been validated.

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
JFrog's bundled installer expects db5.3-util to be present on the system
but the package was missing from our prerequisites list, causing install.sh
to fail when trying to install it from the bundled .deb.

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
…sue in 7.146.10

Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <codewith3@gmail.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Co-authored-by: alexsin368 <109180236+alexsin368@users.noreply.github.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
…o air-gap.md

Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Documents the bug where entering a deployment name with the -cpu suffix
causes the remove-model script to silently do nothing and still report success.

Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
…ments

- Set HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 for all CPU model
  deployments in airgap mode to prevent vLLM from contacting HuggingFace
- Fix vllm configmap template to merge per-model and default configMapValues
  so HF_HUB_OFFLINE correctly reaches containers with per-model xeon configs
- Add JFrog download tasks for Llama-3.2-3B and Qwen3-1.7B validated models;
  use local hostPath for LLM_MODEL_ID so vLLM loads weights without HF hub
- Guard helm repo update calls across ingress, keycloak, genai-gateway, NRI,
  observability, ceph, istio, and bastion playbooks to prevent internet
  connection attempts in airgap mode
- Guard internet binary downloads in setup-bastion.yml

Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
uninstall-model.sh did not pass airgap_enabled, jfrog_url, jfrog_username,
or jfrog_password to the Ansible playbook. This caused the inference-tools
role to run the internet pip install task instead of the airgap JFrog path,
failing with 404 errors on the JFrog debian repo.

Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
In airgap mode, derive JFrog folder from model ID (strip org prefix),
download weights to /opt/ei-models/<model_id>, and use local path for
LLM_MODEL_ID so vLLM loads from disk without contacting HuggingFace.

Convention: JFrog folder = model name without org prefix (e.g.
Qwen/Qwen3-4B -> Qwen3-4B), matching the naming used in jfrog-setup.sh.

Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
Signed-off-by: Harika <harika.devulapally@cloud2labs.com>
@HarikaDev296 HarikaDev296 mentioned this pull request Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant