Skip to content

feat: add MLflow to observability stack#400

Open
dimakis wants to merge 8 commits into
mainfrom
feat/mlflow-observability
Open

feat: add MLflow to observability stack#400
dimakis wants to merge 8 commits into
mainfrom
feat/mlflow-observability

Conversation

@dimakis

@dimakis dimakis commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds MLflow v2.22.0 container to docker-compose.yml alongside Jaeger, Loki, and Grafana
  • Port 5050 externally (avoids macOS AirPlay conflict on 5000)
  • SQLite backend + file-based artifact store via Docker volume (.mlflow-data/)
  • Updates ensure-observability.sh to verify MLflow container on startup

Context

MLflow tracks knowledge space experiments (contrastive learning, V-learning trace efficiency) from the mgmt repo's knowledge_space/ pipeline. Dual OTel export sends traces to both Jaeger (operational) and MLflow (experiment tracking).

Related: dimakis/mgmt#203

Test plan

  • podman compose up -d — mlflow container starts
  • curl http://localhost:5050/api/2.0/mlflow/experiments/list returns 200
  • ./scripts/ensure-observability.sh — shows mlflow as running
  • MLflow UI accessible at http://localhost:5050

🤖 Generated with Claude Code

Adds MLflow v2.22.0 as a persistent container alongside Jaeger, Loki,
and Grafana. Port 5050 (avoids macOS AirPlay on 5000). SQLite backend
with file-based artifact store for experiment tracking persistence.

Used by knowledge_space/ scripts in mgmt for patent validation
experiments (contrastive learning, V-learning trace efficiency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 4 issue(s) (1 critical) (1 warning).

.gitignore

The MLflow service definition works but is missing a .gitignore entry for .mlflow-data/ (will pollute the repo), and has redundant/incorrect environment variables that should be cleaned up.

  • 🔴 bugs: .mlflow-data/ is not in .gitignore. The .jaeger-data/ volume mount is gitignored (line 14), but the analogous .mlflow-data/ directory will be tracked by git — SQLite DB, artifacts, and all. This will bloat the repo and likely cause merge conflicts. [fixable]

docker-compose.yml

The MLflow service definition works but is missing a .gitignore entry for .mlflow-data/ (will pollute the repo), and has redundant/incorrect environment variables that should be cleaned up.

  • 🟡 unsafe_assumptions (L54): The MLFLOW_BACKEND_STORE_URI env var and --backend-store-uri flag both use sqlite:///mlflow/mlflow.db (three slashes = relative path inside the container). This works because the command runs from /, making the resolved path /mlflow/mlflow.db which lands inside the mounted volume. However, the MLFLOW_BACKEND_STORE_URI env var is redundant with the CLI flag — MLflow's CLI flag takes precedence. The env var MLFLOW_ARTIFACTS_DESTINATION is also not used by mlflow server (the correct env var for the server's artifact root is MLFLOW_DEFAULT_ARTIFACT_ROOT, or the --default-artifact-root CLI flag which is already set). These orphan env vars may confuse future maintainers. [fixable]
  • 🔵 style (L56): The command line duplicates both --backend-store-uri and --default-artifact-root values that are also set (or attempted) via environment variables. Pick one mechanism — the CLI flags in command are sufficient and the environment block could be dropped entirely to avoid drift between the two. [fixable]

scripts/ensure-observability.sh

The MLflow service definition works but is missing a .gitignore entry for .mlflow-data/ (will pollute the repo), and has redundant/incorrect environment variables that should be cleaned up.

  • 🔵 missing_tests (L27): The container name pattern mitzo_${svc}_1 is a Docker Compose v1 convention. Podman Compose (and Docker Compose v2) may use mitzo-${svc}-1 (hyphens instead of underscores). This was a pre-existing issue, but adding mlflow to the loop extends its blast radius. If the verification step is silently failing for all services today, it would be worth confirming the actual container naming convention. [fixable]

Comment thread docker-compose.yml Outdated
ports:
- '5050:5000' # 5050 externally — 5000 conflicts with macOS AirPlay
environment:
MLFLOW_BACKEND_STORE_URI: 'sqlite:///mlflow/mlflow.db'

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The MLFLOW_BACKEND_STORE_URI env var and --backend-store-uri flag both use sqlite:///mlflow/mlflow.db (three slashes = relative path inside the container). This works because the command runs from /, making the resolved path /mlflow/mlflow.db which lands inside the mounted volume. However, the MLFLOW_BACKEND_STORE_URI env var is redundant with the CLI flag — MLflow's CLI flag takes precedence. The env var MLFLOW_ARTIFACTS_DESTINATION is also not used by mlflow server (the correct env var for the server's artifact root is MLFLOW_DEFAULT_ARTIFACT_ROOT, or the --default-artifact-root CLI flag which is already set). These orphan env vars may confuse future maintainers. [fixable]

Comment thread docker-compose.yml Outdated
environment:
MLFLOW_BACKEND_STORE_URI: 'sqlite:///mlflow/mlflow.db'
MLFLOW_ARTIFACTS_DESTINATION: '/mlflow/artifacts'
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The command line duplicates both --backend-store-uri and --default-artifact-root values that are also set (or attempted) via environment variables. Pick one mechanism — the CLI flags in command are sufficient and the environment block could be dropped entirely to avoid drift between the two. [fixable]

Comment thread scripts/ensure-observability.sh Outdated
# 3. Verify
for svc in jaeger loki grafana; do
for svc in jaeger loki grafana mlflow; do
container="mitzo_${svc}_1"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The container name pattern mitzo_${svc}_1 is a Docker Compose v1 convention. Podman Compose (and Docker Compose v2) may use mitzo-${svc}-1 (hyphens instead of underscores). This was a pre-existing issue, but adding mlflow to the loop extends its blast radius. If the verification step is silently failing for all services today, it would be worth confirming the actual container naming convention. [fixable]

- Add .mlflow-data/ to .gitignore (prevents SQLite + artifacts in repo)
- Remove redundant environment variables from MLflow service (CLI flags
  in command are sufficient and authoritative)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 4 issue(s) (1 warning).

scripts/ensure-observability.sh

Clean infrastructure-only addition. The main actionable item is updating the onboarding doc's observability table to include MLflow; the container name pattern in ensure-observability.sh is a pre-existing fragility worth noting.

  • 🟡 bugs (L27): Container name pattern mitzo_${svc}_1 assumes Compose V1 naming (project_service_N). Podman Compose and Docker Compose V2 use project-service-N (hyphens, not underscores). If Podman Compose uses V2-style naming, the verification loop will always report FAILED TO START for all services (including the existing ones — this is a pre-existing issue, but adding mlflow extends it). Consider using podman compose ps --format json or a service-based filter instead of hardcoded container names. [fixable]

docs/onboarding.md

Clean infrastructure-only addition. The main actionable item is updating the onboarding doc's observability table to include MLflow; the container name pattern in ensure-observability.sh is a pre-existing fragility worth noting.

  • 🔵 style: The onboarding doc's Observability section (line 540-546) lists Jaeger, Grafana, and Loki in a components table but doesn't mention MLflow. The table should include an MLflow row (http://localhost:5050, Experiment tracking). Similarly, the step at line 144 says 'Jaeger/Grafana/Loki' without MLflow. [fixable]

docker-compose.yml

Clean infrastructure-only addition. The main actionable item is updating the onboarding doc's observability table to include MLflow; the container name pattern in ensure-observability.sh is a pre-existing fragility worth noting.

  • 🔵 unsafe_assumptions (L53): The MLflow SQLite backend (sqlite:///mlflow/mlflow.db) is fine for local single-user use, but the volume mounts the entire /mlflow directory including both the DB and artifacts. If artifact storage grows large, this could consume significant disk. Not a blocker for a local dev tool, but worth noting in docs or a comment.
  • 🔵 style (L48): The existing services (jaeger, loki, grafana) form a logical group with grafana depending on loki and jaeger. The new mlflow service is independent with no depends_on or network links to the other services, which is correct. However, a one-line comment above the service (like the port comment) explaining what MLflow will be used for in Mitzo's context (e.g., experiment tracking for prompt tuning) would help future readers understand why it's in the observability stack. [fixable]

Comment thread scripts/ensure-observability.sh Outdated
# 3. Verify
for svc in jaeger loki grafana; do
for svc in jaeger loki grafana mlflow; do
container="mitzo_${svc}_1"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Container name pattern mitzo_${svc}_1 assumes Compose V1 naming (project_service_N). Podman Compose and Docker Compose V2 use project-service-N (hyphens, not underscores). If Podman Compose uses V2-style naming, the verification loop will always report FAILED TO START for all services (including the existing ones — this is a pre-existing issue, but adding mlflow extends it). Consider using podman compose ps --format json or a service-based filter instead of hardcoded container names. [fixable]

Comment thread docker-compose.yml Outdated
restart: always
ports:
- '5050:5000' # 5050 externally — 5000 conflicts with macOS AirPlay
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: The MLflow SQLite backend (sqlite:///mlflow/mlflow.db) is fine for local single-user use, but the volume mounts the entire /mlflow directory including both the DB and artifacts. If artifact storage grows large, this could consume significant disk. Not a blocker for a local dev tool, but worth noting in docs or a comment.

Comment thread docker-compose.yml
- loki
- jaeger

mlflow:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The existing services (jaeger, loki, grafana) form a logical group with grafana depending on loki and jaeger. The new mlflow service is independent with no depends_on or network links to the other services, which is correct. However, a one-line comment above the service (like the port comment) explaining what MLflow will be used for in Mitzo's context (e.g., experiment tracking for prompt tuning) would help future readers understand why it's in the observability stack. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 4 issue(s) (2 warning).

CLAUDE.md

Clean infra-only addition of MLflow to the observability stack. Main gap is that CLAUDE.md — which serves as the project's architecture source of truth — is not updated to reflect the new fourth container.

  • 🟡 regressions: CLAUDE.md documents the observability stack as 'Three containers via docker-compose.yml' (line 248) listing Jaeger, Grafana, and Loki. The PR adds a fourth container (MLflow) but does not update CLAUDE.md to reflect this — the count, container list, and env var documentation are now stale. This will mislead developers and AI agents relying on CLAUDE.md for architecture context. [fixable]

scripts/ensure-observability.sh

Clean infra-only addition of MLflow to the observability stack. Main gap is that CLAUDE.md — which serves as the project's architecture source of truth — is not updated to reflect the new fourth container.

  • 🟡 unsafe_assumptions (L27): The container name pattern mitzo_${svc}_1 assumes the legacy docker-compose v1 naming convention (project_service_N). Podman Compose and Docker Compose v2 use project-service-N (hyphens, not underscores) by default. If the existing services already work with this pattern it's fine for MLflow too, but this is worth verifying — if the naming convention doesn't match, the verification loop will always report 'FAILED TO START' for all services, not just MLflow. This is a pre-existing issue that now affects one more service. [fixable]

docker-compose.yml

Clean infra-only addition of MLflow to the observability stack. Main gap is that CLAUDE.md — which serves as the project's architecture source of truth — is not updated to reflect the new fourth container.

  • 🔵 style (L53): The SQLite backend store URI sqlite:///mlflow/mlflow.db is a triple-slash relative path, resolving to /mlflow/mlflow.db inside the container. This works because the volume mounts .mlflow-data at /mlflow, but the path coupling is implicit. Consider adding a brief comment or using an env var to make the relationship between the volume mount and the backend-store-uri clearer — other services in this file use named mount points that match their config files. [fixable]
  • 🔵 missing_tests: No integration point between the Mitzo server and MLflow exists yet — this is purely infra. The design doc (docs/design/otel-deep-instrumentation.md) mentions '(and later MLflow)' so this is clearly preparatory work. Not blocking, but the PR could benefit from a brief note (in the PR description or a doc) explaining how MLflow will be wired into Mitzo's server code in a follow-up.

Comment thread scripts/ensure-observability.sh Outdated
# 3. Verify
for svc in jaeger loki grafana; do
for svc in jaeger loki grafana mlflow; do
container="mitzo_${svc}_1"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The container name pattern mitzo_${svc}_1 assumes the legacy docker-compose v1 naming convention (project_service_N). Podman Compose and Docker Compose v2 use project-service-N (hyphens, not underscores) by default. If the existing services already work with this pattern it's fine for MLflow too, but this is worth verifying — if the naming convention doesn't match, the verification loop will always report 'FAILED TO START' for all services, not just MLflow. This is a pre-existing issue that now affects one more service. [fixable]

Comment thread docker-compose.yml Outdated
restart: always
ports:
- '5050:5000' # 5050 externally — 5000 conflicts with macOS AirPlay
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The SQLite backend store URI sqlite:///mlflow/mlflow.db is a triple-slash relative path, resolving to /mlflow/mlflow.db inside the container. This works because the volume mounts .mlflow-data at /mlflow, but the path coupling is implicit. Consider adding a brief comment or using an env var to make the relationship between the volume mount and the backend-store-uri clearer — other services in this file use named mount points that match their config files. [fixable]

- Add purpose comment above MLflow service in docker-compose
- Add MLflow to onboarding doc observability table and architecture diagram

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 4 issue(s) (2 warning).

CLAUDE.md

Clean infrastructure addition. Main actionable item is updating CLAUDE.md to reflect the new 4-container stack; the docker-compose and script changes are correct and follow existing conventions.

  • 🔵 regressions: CLAUDE.md's Observability section documents the stack as 'Jaeger/Grafana/Loki' in multiple places and doesn't mention MLflow. Since CLAUDE.md is the primary project context document, it should be updated to stay consistent with docker-compose.yml and onboarding.md. Relevant sections: the 'Infrastructure' bullet under Observability (~3 containers → 4), and the env vars section. [fixable]

docker-compose.yml

Clean infrastructure addition. Main actionable item is updating CLAUDE.md to reflect the new 4-container stack; the docker-compose and script changes are correct and follow existing conventions.

  • 🟡 unsafe_assumptions (L54): The SQLite backend-store (sqlite:///mlflow/mlflow.db) runs inside the container with no WAL mode or concurrency guarantees. This is fine for single-user local dev, but the file is mounted on a host volume (.mlflow-data/). If MLflow is ever hit by concurrent requests (e.g., multiple experiment-logging scripts), SQLite without WAL can produce 'database is locked' errors. Consider adding ?check_same_thread=False to the URI or documenting that this is intentionally single-user only. Low risk for now but worth a comment in docker-compose.yml. [fixable]
  • 🔵 missing_tests: No automated test or CI step verifies that the MLflow container starts and the UI is reachable. This is consistent with how Jaeger/Loki/Grafana are handled (no container integration tests exist for those either), so this is not a regression — just a gap in the existing pattern.

scripts/ensure-observability.sh

Clean infrastructure addition. Main actionable item is updating CLAUDE.md to reflect the new 4-container stack; the docker-compose and script changes are correct and follow existing conventions.

  • 🟡 unsafe_assumptions (L27): The container name pattern mitzo_${svc}_1 assumes Docker Compose V1 naming convention. Podman Compose and Docker Compose V2 may use mitzo-${svc}-1 (hyphens instead of underscores). This pre-existing issue (not introduced by this PR) now affects MLflow verification too. If the naming convention works for the existing three services, it will work for MLflow — but worth noting as a latent fragility. [fixable]

Comment thread docker-compose.yml Outdated
restart: always
ports:
- '5050:5000' # 5050 externally — 5000 conflicts with macOS AirPlay
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The SQLite backend-store (sqlite:///mlflow/mlflow.db) runs inside the container with no WAL mode or concurrency guarantees. This is fine for single-user local dev, but the file is mounted on a host volume (.mlflow-data/). If MLflow is ever hit by concurrent requests (e.g., multiple experiment-logging scripts), SQLite without WAL can produce 'database is locked' errors. Consider adding ?check_same_thread=False to the URI or documenting that this is intentionally single-user only. Low risk for now but worth a comment in docker-compose.yml. [fixable]

Comment thread scripts/ensure-observability.sh Outdated
# 3. Verify
for svc in jaeger loki grafana; do
for svc in jaeger loki grafana mlflow; do
container="mitzo_${svc}_1"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The container name pattern mitzo_${svc}_1 assumes Docker Compose V1 naming convention. Podman Compose and Docker Compose V2 may use mitzo-${svc}-1 (hyphens instead of underscores). This pre-existing issue (not introduced by this PR) now affects MLflow verification too. If the naming convention works for the existing three services, it will work for MLflow — but worth noting as a latent fragility. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 4 issue(s) (1 warning).

scripts/ensure-observability.sh

Clean infrastructure addition — no bugs, but the ensure-observability.sh container name pattern may not match actual compose-generated names (pre-existing issue now affecting mlflow too), and CLAUDE.md should be updated to mention MLflow.

  • 🟡 regressions (L27): Container name pattern mitzo_${svc}_1 assumes Docker Compose V1 naming (project_service_N). Compose V2 (and podman-compose) uses project-service-N (hyphens, not underscores). Since the docker-compose.yml has no container_name directives and no name: project key, the actual name depends on the compose implementation. Adding mlflow inherits this existing bug, but if the verification has been silently failing for the other three services it won't be caught. Consider adding explicit container_name: to each service in docker-compose.yml, or switching the check to filter by compose service label (--filter label=com.docker.compose.service=$svc). [fixable]

docker-compose.yml

Clean infrastructure addition — no bugs, but the ensure-observability.sh container name pattern may not match actual compose-generated names (pre-existing issue now affecting mlflow too), and CLAUDE.md should be updated to mention MLflow.

  • 🔵 style (L54): The SQLite backend-store URI sqlite:///mlflow/mlflow.db is a triple-slash relative path resolved inside the container, which works but is fragile if the working directory changes. Using an absolute path sqlite:////mlflow/mlflow.db (four slashes) would be more explicit and resilient to future working_dir changes in the service definition. [fixable]
  • 🔵 missing_tests: No health check is defined for the mlflow service. The other services also lack health checks, but MLflow's startup (SQLite init + artifact root creation) can take a few seconds. Adding healthcheck: { test: ['CMD', 'curl', '-f', 'http://localhost:5000/health'], interval: 30s, timeout: 5s, retries: 3 } would let dependent services and the verification script rely on container health status. [fixable]

CLAUDE.md

Clean infrastructure addition — no bugs, but the ensure-observability.sh container name pattern may not match actual compose-generated names (pre-existing issue now affecting mlflow too), and CLAUDE.md should be updated to mention MLflow.

  • 🔵 style: CLAUDE.md's Observability section documents Jaeger, Loki, and Grafana but does not mention MLflow. The onboarding docs were updated but CLAUDE.md was not — this is the primary project context file and should list all observability components. [fixable]

Comment thread scripts/ensure-observability.sh Outdated
# 3. Verify
for svc in jaeger loki grafana; do
for svc in jaeger loki grafana mlflow; do
container="mitzo_${svc}_1"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 regressions: Container name pattern mitzo_${svc}_1 assumes Docker Compose V1 naming (project_service_N). Compose V2 (and podman-compose) uses project-service-N (hyphens, not underscores). Since the docker-compose.yml has no container_name directives and no name: project key, the actual name depends on the compose implementation. Adding mlflow inherits this existing bug, but if the verification has been silently failing for the other three services it won't be caught. Consider adding explicit container_name: to each service in docker-compose.yml, or switching the check to filter by compose service label (--filter label=com.docker.compose.service=$svc). [fixable]

Comment thread docker-compose.yml Outdated
restart: always
ports:
- '5050:5000' # 5050 externally — 5000 conflicts with macOS AirPlay
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The SQLite backend-store URI sqlite:///mlflow/mlflow.db is a triple-slash relative path resolved inside the container, which works but is fragile if the working directory changes. Using an absolute path sqlite:////mlflow/mlflow.db (four slashes) would be more explicit and resilient to future working_dir changes in the service definition. [fixable]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis

dimakis commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 2 issue(s) (1 warning).

docker-compose.yml

Clean infrastructure-only PR. One SQLite URI path issue (relative vs absolute) could cause data loss if the MLflow image changes its WORKDIR; otherwise LGTM.

  • 🟡 bugs (L54): SQLite URI sqlite:///mlflow/mlflow.db uses a relative path (3 slashes = relative in SQLAlchemy). It resolves correctly only if the container's WORKDIR is /. Use sqlite:////mlflow/mlflow.db (4 slashes) for an absolute path to guarantee the database lands in the /mlflow volume mount regardless of the image's working directory. [fixable]

docs/onboarding.md

Clean infrastructure-only PR. One SQLite URI path issue (relative vs absolute) could cause data loss if the MLflow image changes its WORKDIR; otherwise LGTM.

  • 🔵 style (L566): The 'Enable tracing' section lists env vars for Jaeger (OTEL_EXPORTER_OTLP_ENDPOINT) and Loki (LOKI_HOST) but does not mention MLFLOW_TRACKING_URI. CLAUDE.md documents it as optional — consider adding it here too for discoverability. [fixable]

…ation

Support both Compose v1 (mitzo_svc_1) and v2 (mitzo-svc-1) naming
conventions in ensure-observability.sh. Add config validation tests
for docker-compose.yml service definitions and port mappings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 1 issue(s).

docs/onboarding.md

Clean infrastructure PR — MLflow service is well-configured, shell script improvements benefit all services, tests cover the key assertions, and docs are updated consistently across CLAUDE.md and onboarding. One minor doc gap noted.

  • 🔵 style: The 'Enable tracing' section (lines 564-573) lists env vars for .env (OTEL_EXPORTER_OTLP_ENDPOINT, LOKI_HOST) but doesn't mention MLFLOW_TRACKING_URI, which CLAUDE.md documents as optional. Adding it here would keep the two docs consistent for users following the onboarding guide. [fixable]

@dimakis

dimakis commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 1 issue(s).

docs/onboarding.md

Clean infrastructure addition — MLflow service, docs, tests, and flexible container verification all look correct. One minor documentation gap where the onboarding guide doesn't mention the MLFLOW_TRACKING_URI env var that CLAUDE.md references.

  • 🔵 missing_tests (L97): The onboarding env var table (lines 82-98) lists OTEL_EXPORTER_OTLP_ENDPOINT and LOKI_HOST but omits the new MLFLOW_TRACKING_URI env var that CLAUDE.md documents. The "Enable tracing" section (lines 564-571) similarly omits it. Minor doc inconsistency — consider adding a row for MLFLOW_TRACKING_URI to the env var table and/or the .env snippet. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 2 issue(s) (1 warning).

scripts/ensure-observability.sh

Clean infrastructure-only PR that adds MLflow to the observability stack with good docs, tests, and a nice improvement to container name matching in the verification script. No bugs or regressions.

  • 🟡 bugs (L28): The regex name=mitzo.*${svc} uses .* which is a PCRE/glob pattern, but podman's --filter name= uses Go's regexp.MatchString (RE2). This works in practice because .* is valid RE2, but the intent is substring matching — a simpler name=${svc} or name=.*${svc} would be more precise. The current pattern would also match a hypothetical container named mitzo_foo_grafana2 but that's unlikely in this compose setup. Not a real bug, just worth noting.

docker-compose.yml

Clean infrastructure-only PR that adds MLflow to the observability stack with good docs, tests, and a nice improvement to container name matching in the verification script. No bugs or regressions.

  • 🔵 style (L54): The MLflow SQLite path sqlite:///mlflow/mlflow.db is an absolute path inside the container (/mlflow/mlflow.db), which is correct since the volume maps .mlflow-data:/mlflow. However, unlike the Jaeger service which gets a TTL (BADGER_SPAN_STORE_TTL: 336h), the MLflow SQLite DB will grow unbounded. Consider documenting a manual cleanup strategy or noting this tradeoff, especially since .mlflow-data/ is gitignored and could grow large over time.

# 3. Verify — match both Compose v1 (mitzo_svc_1) and v2 (mitzo-svc-1) naming
for svc in jaeger loki grafana mlflow; do
# Filter matches any container whose name contains the service name
container=$("$PODMAN" ps --filter "name=mitzo.*${svc}" --format '{{.Names}}' 2>/dev/null | head -1)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: The regex name=mitzo.*${svc} uses .* which is a PCRE/glob pattern, but podman's --filter name= uses Go's regexp.MatchString (RE2). This works in practice because .* is valid RE2, but the intent is substring matching — a simpler name=${svc} or name=.*${svc} would be more precise. The current pattern would also match a hypothetical container named mitzo_foo_grafana2 but that's unlikely in this compose setup. Not a real bug, just worth noting.

Comment thread docker-compose.yml Outdated
restart: always
ports:
- '5050:5000' # 5050 externally — 5000 conflicts with macOS AirPlay
command: mlflow server --host 0.0.0.0 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The MLflow SQLite path sqlite:///mlflow/mlflow.db is an absolute path inside the container (/mlflow/mlflow.db), which is correct since the volume maps .mlflow-data:/mlflow. However, unlike the Jaeger service which gets a TTL (BADGER_SPAN_STORE_TTL: 336h), the MLflow SQLite DB will grow unbounded. Consider documenting a manual cleanup strategy or noting this tradeoff, especially since .mlflow-data/ is gitignored and could grow large over time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 2 issue(s) (1 warning).

scripts/ensure-observability.sh

Clean addition of MLflow to the observability stack with good docs, tests, and a nice fix to container name matching — one set -e/pipefail interaction in the verify loop could crash the script if podman is flaky.

  • 🟡 bugs (L28): With set -euo pipefail at the top, the bare assignment container=$("$PODMAN" ps ... | head -1) will crash the script if podman ps returns non-zero (e.g. socket not ready). The original code was inside an if condition, which suppresses set -e. Wrap in container=$(...) || true or move it back into the if condition (e.g. if container=$(...) && [ -n "$container" ]; then). [fixable]

.gitignore

Clean addition of MLflow to the observability stack with good docs, tests, and a nice fix to container name matching — one set -e/pipefail interaction in the verify loop could crash the script if podman is flaky.

  • 🔵 style (L15): .mlflow-data/ is correctly added, but the sibling volume-mount directories .loki-data/ and .grafana-data/ (from docker-compose.yml) are also not gitignored. Pre-existing issue, but easy to fix while you're editing this file. [fixable]

# 3. Verify — match both Compose v1 (mitzo_svc_1) and v2 (mitzo-svc-1) naming
for svc in jaeger loki grafana mlflow; do
# Filter matches any container whose name contains the service name
container=$("$PODMAN" ps --filter "name=mitzo.*${svc}" --format '{{.Names}}' 2>/dev/null | head -1)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: With set -euo pipefail at the top, the bare assignment container=$("$PODMAN" ps ... | head -1) will crash the script if podman ps returns non-zero (e.g. socket not ready). The original code was inside an if condition, which suppresses set -e. Wrap in container=$(...) || true or move it back into the if condition (e.g. if container=$(...) && [ -n "$container" ]; then). [fixable]

Comment thread .gitignore
certs/
logs/
.jaeger-data/
.mlflow-data/

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: .mlflow-data/ is correctly added, but the sibling volume-mount directories .loki-data/ and .grafana-data/ (from docker-compose.yml) are also not gitignored. Pre-existing issue, but easy to fix while you're editing this file. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 4 issue(s) (1 warning).

docker-compose.yml

Clean infrastructure addition — MLflow service is well-integrated with consistent patterns. Minor suggestions around command syntax consistency and a volume/gitignore coherence test.

  • 🟡 unsafe_assumptions (L52): MLflow uses SQLite inside a container volume (sqlite:///mlflow/mlflow.db). SQLite is single-writer and not designed for container restarts under load — if the container is killed mid-write the WAL can corrupt. This is acceptable for a local dev tool but worth documenting the limitation. Additionally, there's no healthcheck on the MLflow service, unlike Grafana which has depends_on constraints. If downstream tooling ever depends on MLflow being ready, a healthcheck would be needed. [fixable]
  • 🔵 style (L52): The command value is a long inline string. Other services in this file use either array syntax (command: ['--config', ...] for jaeger) or short flags (command: -config.file=... for loki). Consider using YAML array syntax for consistency and readability: command: ['mlflow', 'server', '--host', '0.0.0.0', '--backend-store-uri', 'sqlite:///mlflow/mlflow.db', '--default-artifact-root', '/mlflow/artifacts']. [fixable]

server/__tests__/observability-config.test.ts

Clean infrastructure addition — MLflow service is well-integrated with consistent patterns. Minor suggestions around command syntax consistency and a volume/gitignore coherence test.

  • 🔵 missing_tests: The test validates that config files mention expected services and ports, but doesn't verify the MLflow volume mount pattern (.mlflow-data) matches the .gitignore entry. A test asserting the gitignore contains .mlflow-data/ would guard against accidental data commits, consistent with how .jaeger-data/ is already gitignored. [fixable]
  • 🔵 style (L35): The regex /\$\{?PODMAN\}?.*mitzo_\w+_1/ tests that $PODMAN commands don't use hardcoded mitzo_svc_1 names, but the error fallback line echo "[ensure-observability] mitzo_${svc}: FAILED TO START" still uses the mitzo_ prefix (without _1). The test passes because the regex requires _1 at the end, but the test description ("not hardcoded _svc_1 names in commands") could be clearer that it only checks $PODMAN invocations, not echo statements.

Comment thread docker-compose.yml
mlflow:
image: ghcr.io/mlflow/mlflow:v2.22.0
restart: always
ports:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: MLflow uses SQLite inside a container volume (sqlite:///mlflow/mlflow.db). SQLite is single-writer and not designed for container restarts under load — if the container is killed mid-write the WAL can corrupt. This is acceptable for a local dev tool but worth documenting the limitation. Additionally, there's no healthcheck on the MLflow service, unlike Grafana which has depends_on constraints. If downstream tooling ever depends on MLflow being ready, a healthcheck would be needed. [fixable]

Comment thread docker-compose.yml
mlflow:
image: ghcr.io/mlflow/mlflow:v2.22.0
restart: always
ports:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The command value is a long inline string. Other services in this file use either array syntax (command: ['--config', ...] for jaeger) or short flags (command: -config.file=... for loki). Consider using YAML array syntax for consistency and readability: command: ['mlflow', 'server', '--host', '0.0.0.0', '--backend-store-uri', 'sqlite:///mlflow/mlflow.db', '--default-artifact-root', '/mlflow/artifacts']. [fixable]

// Should use --filter pattern, not hardcoded _svc_1 names in commands
expect(script).toContain('--filter');
// Actual container references use dynamic filter, not hardcoded names
expect(script).not.toMatch(/\$\{?PODMAN\}?.*mitzo_\w+_1/);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The regex /\$\{?PODMAN\}?.*mitzo_\w+_1/ tests that $PODMAN commands don't use hardcoded mitzo_svc_1 names, but the error fallback line echo "[ensure-observability] mitzo_${svc}: FAILED TO START" still uses the mitzo_ prefix (without _1). The test passes because the regex requires _1 at the end, but the test description ("not hardcoded _svc_1 names in commands") could be clearer that it only checks $PODMAN invocations, not echo statements.

- Convert MLflow command to YAML array syntax for consistency
- Add SQLite limitation comment
- Add gitignore coverage test for compose data volumes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 1 issue(s).

server/__tests__/observability-config.test.ts

Clean infra-only PR — well-documented, consistent with existing services, good test coverage. One minor regex robustness nit in the test.

  • 🔵 unsafe_assumptions (L42): The volume regex /\.\/.\w[\w-]*data/g is not anchored to the volume-path boundary (:) — it would match a substring like ./.mlflow-database as ./.mlflow-data, producing a wrong gitignore entry. Consider a stricter pattern like /\.\/.\w[\w-]*data(?=[\/:])/g or /\.\/.\w[\w-]*data\b/g to avoid partial matches on hypothetical future volumes. [fixable]

const gitignore = readFileSync(resolve(ROOT, '.gitignore'), 'utf-8');
const compose = readFileSync(resolve(ROOT, 'docker-compose.yml'), 'utf-8');
// Extract volume mount patterns like ./.foo-data
const volumeMatches = compose.match(/\.\/\.\w[\w-]*data/g) || [];

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: The volume regex /\.\/.\w[\w-]*data/g is not anchored to the volume-path boundary (:) — it would match a substring like ./.mlflow-database as ./.mlflow-data, producing a wrong gitignore entry. Consider a stricter pattern like /\.\/.\w[\w-]*data(?=[\/:])/g or /\.\/.\w[\w-]*data\b/g to avoid partial matches on hypothetical future volumes. [fixable]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant