This repository contains a complete, production-like observability stack optimized for Fedora Workstation with rootless Podman. It is designed as an educational environment to help Developers and DevOps Engineers understand how modern monitoring tools interlock to provide comprehensive metrics, logging, tracing, profiling and alerting capabilities. The entire stack is automatically configured upon startup, including pre-provisioned Grafana dashboards, datasources, and alerting rules.
- Educational Benefits
- Architecture & Data Flow
- Service Port Map
- Tooling & Functionality
- Installation & Startup
- Additional scripts
- Usage & Exploration (Screenshots)
- Troubleshooting
- Teardown & Cleanup
Why use this stack? This environment is built to teach you:
- The four Pillars of Observability: How to seamlessly connect Metrics (Prometheus), Logs (Loki), Traces (Tempo) and Profiles (Pyroscope).
- Contextual Drill-down: How to configure Grafana datasources so you can jump directly from a spike in a metric to the specific log line, then to the exact application trace and finally to the specific line of code causing the bottleneck via a Flame Graph.
- Modern Collection: Using Grafana Alloy and OpenTelemetry Collector as modern, vendor-neutral data pipelines.
- S3-Compatible Storage: How Loki and Tempo use MinIO object storage (fork by pgsty) for scalable, long-term data retention instead of local disks.
- Advanced Alerting Routing: The flow of an alert from Prometheus -> Alertmanager -> KeepHQ / Karma / Webhook-tester.
- Secure Local Networking: Running a complex stack via Traefik Reverse Proxy with TLS/SSL on your own custom domain using rootless Podman.
- Automated Validation: How to programmatically verify the health of all individual components and validate the end-to-end data flows across the entire observability pipeline.
The stack is designed around specific data flows.
Node-exporter, Podman-exporter, and Blackbox-exporter expose metrics -> Prometheus scrapes them -> Grafana visualizes them.
System (journald) and Container logs -> Grafana Alloy collects them -> Pushed to Loki -> Stored in MinIO -> Visualized in Grafana.
Application traces -> OpenTelemetry Collector -> Pushed to Tempo -> Stored in MinIO -> Visualized in Grafana.
Prometheus evaluates alert.rules.yml -> Fires to Alertmanager -> Alertmanager routes to Karma (UI), KeepHQ (AIOps), and Webhook-tester.
Applications can expose pprof endpoints -> Grafana Alloy scrapes these CPU and Memory profiles -> Pushed to Pyroscope -> Stored in MinIO -> Visualized as Flame Graphs in Grafana.
| Service | Internal Port | Public URL | Description |
|---|---|---|---|
| Nginx | 80 | https://localhost | Landing page portal |
| Traefik | 443 / 8082 | https://traefik.localhost | Reverse proxy & Ingress routing |
| Grafana | 3000 | https://grafana.localhost | Main visualization & Dashboard UI |
| Prometheus | 9090 | https://prometheus.localhost | Time-series database |
| Loki | 3100 | https://loki.localhost | Log aggregation engine |
| Tempo | 3200 | https://tempo.localhost | Distributed Tracing backend |
| Pyroscope | 4040 | https://pyroscope.localhost | Continuous Profiling backend (flamegraph) |
| MinIO fork by pgsty | 9000 / 9001 | https://minio.localhost | S3 Object Storage for Loki & Tempo |
| Alloy | 12345 | https://alloy.localhost | Log collection pipeline |
| OTel Collector | 4317 / 8888 | https://otel-collector.localhost | Trace collection pipeline |
| Alertmanager | 9093 | https://alertmanager.localhost | Alert routing and deduplication |
| Karma | 8080 | https://karma.localhost | Alert visualization dashboard |
| KeepHQ | 3000 / 8080 | https://keep.localhost | Open-source AIOps and alert management |
| Webhook Tester | 8080 | https://webhook-tester.localhost | Endpoint for inspecting webhook payloads |
| node-exporter | 9100 | https://node-exporter.localhost | Host metrics |
| podman-exporter | 9882 | https://podman-exporter.localhost | Container metrics |
| Blackbox | 9115 | https://blackbox-exporter.localhost | HTTP/TCP endpoint probe |
note: Instead of localhost you can configure your own DOMAIN using the .env file.
1. Visualization & Portal
- Nginx (Portal): Serves as a static, central hub linking to all services and endpoints.
- Grafana: The 'single pane of glass'. Dashboards and Datasources are loaded automatically via Infrastructure as Code (IaC).
2. Metrics (The "What is happening?")
- Prometheus: Scrapes targets, stores time-series data, and evaluates alert rules.
- Exporters:
- Node Exporter: Collects host hardware and OS metrics.
- Podman Exporter: Collects metrics from rootless Podman containers.
- Blackbox Exporter: Probes endpoints over HTTP/TCP to monitor uptime.
3. Logging (The "Why is it happening?")
- Grafana Loki: Highly efficient log aggregation system. Uses MinIO for storage.
- Grafana Alloy: The collector that reads journald and /var/run/podman.sock (Podman) and pushes logs to Loki.
4. Tracing (The "Where is it happening?")
- Grafana Tempo: High-scale distributed tracing backend. Uses MinIO for storage.
- OpenTelemetry (OTel) Collector: Receives OTLP traces and forwards them to Tempo.
5. Profiling (The "Why is the code consuming resources?")
- Grafana Pyroscope: Continuous profiling backend. Analyzes performance profiles to identify CPU and memory bottlenecks. Uses MinIO for storage.
- Grafana Alloy: Scrapes pprof endpoints from running containers and sends them to Pyroscope.
5. Storage & Infrastructure
- MinIO (fork by pgsty): S3-compatible storage providing scalable object storage for Tempo, Loki and Pyroscope data.
- PostgreSQL: Relational database backend for KeepHQ.
- Traefik: Reverse proxy that acts as the entry point, handling routing and TLS termination for all
*.${DOMAIN}domains.
6. Alerting & AIOps
- Alertmanager: Groups, routes, and throttles alerts from Prometheus and Loki.
- Karma: A clean, concise dashboard for viewing Alertmanager alerts.
- KeepHQ: Centralized alert management and AIOps platform.
- Webhook Tester: A simple tool to view the raw JSON payloads Alertmanager sends out.
This stack is using podman and podman compose where you may be used to docker and docker-compose. While Docker is commonly used, there are good reasons to use Podman due to several key architectural and security advantages:
- Daemonless Architecture: Unlike Docker, which requires a heavy, central background daemon (
dockerd) running as root to manage containers, Podman is daemonless. It interacts directly with the container registry and runtime. This means no single point of failure—if the Docker daemon crashes, container management halts. With Podman, each container runs as an independent process. - Rootless by Design (Enhanced Security): Security is a primary focus for Podman. It allows you to run containers as a standard, non-root user out of the box. If a container is somehow compromised, the attacker is confined to the privileges of that standard user, preventing them from gaining root access to the host machine.
- Fully Open Source & Unrestricted: Podman is a fully open-source project driven by the community and Red Hat. Unlike Docker Desktop, which has introduced commercial licensing and subscription models for enterprise environments, Podman remains completely free and unrestricted for all use cases.
- Drop-in Replacement: The transition is practically seamless. Podman's CLI is intentionally designed to be identical to Docker's. You can simply add
alias docker=podmanto your shell profile, and all your familiar commands (build,run,ps,pull) will work exactly as expected. - Native Systemd Integration: Podman integrates fully into Linux environments. It can easily generate and manage
systemdunit files from running containers, allowing you to treat containers as native system services that start automatically on boot. - Kubernetes Readiness: Podman introduces the concept of "pods" (groups of containers sharing the same network and namespaces) locally, mirroring how Kubernetes operates. It can even generate Kubernetes YAML from local containers or run existing Kubernetes YAML directly, making the transition from local development to production orchestration much smoother.
When working with this stack, you will notice we use the command podman compose (with a space) instead of podman-compose (with a hyphen). While they look almost identical, there is a crucial difference in how they operate:
podman-compose(with a hyphen): This is a community-driven Python script installed via the package manager. It acts as the actual "engine" or provider that parses thecompose.ymlfile, translates it into Podman API calls, and starts the containers.podman compose(with a space): This is a native sub-command built directly into the Podman CLI. It acts as a smart wrapper (a "conductor"). It doesn't process the YAML itself; instead, it prepares the environment and then delegates the actual work to an external provider (like thepodman-composePython script).
Why we use podman compose:
The primary reason is environment variable handling. In our compose.yml, we use dynamic variables like ${DOMAIN:-localhost}. If you run the Python script directly using podman-compose --env-file .env up -d, it injects these variables into the containers, but struggles to substitute them within the YAML file itself.
However, by running the native wrapper using podman compose --env-file .env up -d, Podman correctly loads the .env variables into the host's system environment before passing execution to the Python script. This ensures perfect interpolation of all your variables across the configuration.
Note: Even though we type podman compose, you must not uninstall the podman-compose package. The native wrapper relies on it under the hood to function!
This monitoring stack has been tested on Fedora Linux (tested on Fedora 43 and 44).
Overview of the installation and deployment:
- Clone this repo.
- Configure your
DOMAINand passwords in a.envfile. - Export the environment variables from the
.envfile. - In case you use a HTTP internet proxy, run
source ./prepare_no_proxy.shto add your custom domain to the no_proxy. - Run the installation script
install.sh, it will automatically configure the following:- Tools:
podman,podman-composeandgettextwill be installed if missing. - Podman Socket: The rootless user socket will be enabled for the Podman Exporter, Grafana Alloy and Traefik.
- Networking: Unprivileged ports will be enabled, and
/etc/hostswill be updated dynamically with your chosen domain. - TLS/SSL: A self-signed wildcard certificate will be generated and added to the Fedora trust store.
- Secrets: Create configuration files from
./templatedirectory (for alertmanager, index.html, loki, tempo, traefik and pyroscope) andsubstitute sercrets. - Domain: The stack will be configured to run on your custom
DOMAIN(defaults tolocalhost).
- Tools:
- Start the stack using podman compose.
# 1. clone the repository
git clone https://github.com/tedsluis/monitoring.git
cd monitoring
# 2.show default variables
cat .env.examples
# ==========================================
# Monitoring Stack Environment Variables
# Copy this file to '.env' and fill in your own values before running the stack.
# Don't use any special characters in the values in this file,
# as it may cause issues exporting them to environment variables.
# ==========================================
# Domain name (default: localhost)
DOMAIN=localhost
# Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin
# MinIO Storage. (Needs a password of at least 8 characters.)
MINIO_ROOT_USER=minio
MINIO_ROOT_PASSWORD=minio123
# Keep Database (PostgreSQL)
KEEP_DB_USER=keep
KEEP_DB_PASSWORD=keep
KEEP_DB_NAME=keep
# Keep API & Application
# Generate a secure string for the API key (e.g., via uuidgen)
KEEP_API_KEY=585af6cc-5c07-427f-966f-a263473ad402
# Generate a random string for NextAuth
NEXTAUTH_SECRET=change_me_to_a_secure_string
# External Integrations
OPENAI_API_KEY=dummy-key
# Webhook Tester (UUID for your specific test-endpoint)
WEBHOOK_TESTER_UUID=65ae26f0-131e-4390-8daa-bdaec17e77c2
# 3. Copy the example environment file
cp .env.example .env
# 4. Edit the .env file and
# fill in your secure passwords and custom DOMAIN (using an editor like vi, vim, code or nano).
vi .env
# 5. load environment variables from `.env` file
export $(grep -v '^#' .env | xargs)
# 6. Are you using an HTTP internet proxy? Add the necessary hostnames and IP addresses
# to your no_proxy/NO_PROXY environment variables:
source ./prepare_no_proxy.sh
# 7. Execute the installation script
./install.sh
======================================================
🚀 Starting installation
======================================================
✅ environment variables loaded from .env
✅ Installation is running for domain: localhost
📦 Checking prerequisites...
======================================================
📝 Generating configuration from templates...
copy template/traefik.yaml > traefik/traefik.yaml
copy template/traefik-dynamic.yaml > traefik/dynamic/traefik-dynamic.yaml
copy template/index.html > landing-page/index.html
copy template/alertmanager.yml > alertmanager/alertmanager.yml
copy template/loki-config.yaml > loki/loki-config.yaml
copy template/tempo.yaml > tempo/tempo.yaml
copy template/pyroscope.yaml > pyroscope/pyroscope.yaml
✅ Templates successfully processed.
======================================================
======================================================
🔐 Generating TLS certificates...
=== Start Certificate Renewal for localhost ===
Cleaning up old files...
Generating SAN configuration...
Generating Root CA...
..+.....+.+++++++++++++++++++++++++++++++++++++++*.+...............+++++++++++++++++++++++++++++++++++++++*......+......+...+.+......+......+..............+.+........+......+.+...+......+...+.....+.+..............+..........+...+..+......+.........+.......+.....+......+...+.+...........+.......+...+..+.+......+...+..+...+.......+...+........+...+..........+.....+......+.+.................+...+..........+...+............+.........+...+..+......+...+.........+.+..+....+......+..+.............+..+...+....+.....+......+.......+..+.............+..+.+...+.........+..+..................+.+..+..........+...+.................+...+....+.....+......+.+.....+....+.........+............+........+.......+...+...+..+................+...........+.+............++++++
....+...+.........+...+..+.........+.+......+.........+.....+...+.......+++++++++++++++++++++++++++++++++++++++*.+......+.......+++++++++++++++++++++++++++++++++++++++*..+..........+..+............+...+.+..+....+......+.....+.......+..+.+..+.+...............+..+....+........+..........+........+......................+..+...+.+.........+......+.....+....+...+..+.............+............+...+..+......+.......+.................+...+....+........+...+............+...+.......+..+.....................+...+.+...+..+.+........+....+...........+....+.................+....+.................+....+......+...+......+...........+...+......+...............+......+.......+..+.+.........+........+......+.+......+...+..+...+.+.........+...........+.+..+......+....+..+.+...............+........................+..............+....+........+....+........+.+.....+......+......+...+.........+.......+...+......+..+.......+...+..................+..+...+.....................+....+...........+.+.....+............+...+...+....+...........+.......+.....+....+...+..+......+..................+.+......+..+.+.....+......................+..............+....+...........+..................+...+....+.........+..+............+.+.....+.........+.+..+.......+..+.+..+..................+..........+..+.+......+..............+.+......+...+.....+.......+..+.............+........+.......+.........+...+......+.....+....+..+...+.+.........+............+........+...++++++
-----
Generating Server Certificate...
Certificate request self-signature ok
subject=C=NL, ST=Utrecht, L=Utrecht, O=Utrecht, OU=Utrecht, CN=*.localhost
Fixing permissions (chmod 644)...
Updating Fedora Trust Store...
Checking if System Bundle trusts the certificate...
✓ SUCCESS: System bundle now trusts your certificate!
Restarting Traefik...
>>>> Executing external compose provider "/usr/bin/podman-compose". Please see podman-compose(1) for how to disable this message. <<<<
traefik
traefik
be0526f19960583f2e1ee78fb4098fe07ec7c7fdd9d970616b521afc39993a3a
traefik
=== Done! ===
Test now with: curl -v https://traefik.localhost
======================================================
======================================================
🔀 Configuring proxy settings...
You are not using a HTTP proxy.
Neither http_proxy, https_proxy, HTTP_PROXY nor HTTPS_PROXY is set. The no_proxy variable will not have any effect.
Please set http_proxy, https_proxy, HTTP_PROXY and HTTPS_PROXY environment variables if you intend to use a proxy.
# 8. Start the monitoring stack
podman compose up -dNotes:
- The first time, the
minio-initcontainer will automatically create the required buckets (loki-data,tempo-dataanpyroscope-data). - You can edit the
.envfile, rerun the./install.shscript andpodman compose down && podman compose up -devery time you want to change theDOMAINor update secrets in the templates.
Important:
- Before you go to
https://localhost(or your custom domain), restart your browser! - If you are using a HTTP internet proxy, make sure you add
*.your-domainandyour-domainto your browserno proxy.
podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9a7d19394fea quay.io/prometheus/alertmanager@sha256:88b605de9aba0410775c1eb3438f951115054e0d307f23f274a4c705f51630c1 --config.file=/et... 24 hours ago Up 24 hours (healthy) 9093/tcp alertmanager
4385be7aa616 docker.io/grafana/alloy@sha256:1f40cf52adda8fab3e058f9347a5d165624ecb9fbc1527769cb744748961940d run --server.http... 24 hours ago Up 24 hours alloy
e39d9970fdc9 quay.io/prometheus/blackbox-exporter@sha256:e753ff9f3fc458d02cca5eddab5a77e1c175eee484a8925ac7d524f04366c2fc --config.file=/co... 24 hours ago Up 24 hours 9115/tcp blackbox-exporter
479da2e5f9fa docker.io/library/postgres@sha256:52098013b4b64a746626437d38afc03cabff6cdeb4d3d92e2342aa95f0ce56ea postgres 24 hours ago Up 24 hours (healthy) 5432/tcp keep-db
f19ac9b0fc24 docker.io/pgsty/minio@sha256:14cea493d9a34af32f524e538b8346cf79f3321eff8e708c1e2960462bd8936e server /data --co... 24 hours ago Up 24 hours (healthy) 9000/tcp minio
acafb4c0f6ca docker.io/library/nginx@sha256:5616878291a2eed594aee8db4dade5878cf7edcb475e59193904b198d9b830de nginx -g daemon o... 24 hours ago Up 24 hours (healthy) 80/tcp nginx
545194db4adb quay.io/prometheus/node-exporter@sha256:337ff1d356b68d39cef853e8c6345de11ce7556bb34cda8bd205bcf2ed30b565 --path.rootfs=/ho... 24 hours ago Up 24 hours (healthy) 9100/tcp node-exporter
d475858f4b28 quay.io/navidys/prometheus-podman-exporter@sha256:2ebb9e09101d8cc1e28e3f306b56a722450918e628208435201ed39bd62403cb 24 hours ago Up 24 hours (healthy) 9882/tcp podman-exporter
579a72c7b7e3 quay.io/prometheus/prometheus@sha256:7571a304e67fbd794be02422b13627dc7de822152f74e99e2bef95d29eceecde --config.file=/et... 24 hours ago Up 24 hours (healthy) 9090/tcp prometheus
bf15c54adfd7 docker.io/tarampampam/webhook-tester@sha256:85818267b450d3d386cad6510c561e09b974183ed2832c373bc83b125fc1b221 start 24 hours ago Up 24 hours webhook-tester
1a80fa997e58 ghcr.io/prymitive/karma@sha256:cae0afb8d083756a7a44413480847fa59c072659d909734924a10640e1de600d 24 hours ago Up 24 hours 8080/tcp karma
e56abf8f23b1 us-central1-docker.pkg.dev/keephq/keep/keep-api@sha256:0e95b90210f2caeaf6a654daec274cfe43101cf1c4cdbc9cd1fec1a99e791af6 gunicorn keep.api... 24 hours ago Up 24 hours (healthy) keep-backend
226eb18ecf08 docker.io/pgsty/mc@sha256:a7fe349ef4bd8521fb8497f55c6042871b2ae640607cf99d9bede5e9bdf11727 24 hours ago Exited (0) 24 hours ago minio-init
1c1061ab48b2 us-central1-docker.pkg.dev/keephq/keep/keep-ui@sha256:2041f65c7bbd64c2a800a4d11eedf0e99b89debfd6b88f0bbb109443eb6bcc23 24 hours ago Up 24 hours (healthy) 3000/tcp keep-frontend
9d15366f4ab2 docker.io/grafana/loki@sha256:73e905b51a7f917f7a1075e4be68759df30226e03dcb3cd2213b989cc0dc8eb4 -config.file=/etc... 24 hours ago Up 24 hours 3100/tcp loki
bae17c2ad577 docker.io/grafana/pyroscope:1.13.0 -config.file=/etc... 24 hours ago Up 24 hours 4040/tcp pyroscope
92ec7ed8987b docker.io/grafana/tempo@sha256:a6616c9d224770c883a67b50e4941e99c5df81b076ef05f516bb7cce5a96cec0 -config.file=/etc... 24 hours ago Up 24 hours tempo
c6150f320361 docker.io/otel/opentelemetry-collector-contrib@sha256:a516c26968aa1feb5e5fc0562e3338ea13755cb4f373603226bcc4e276374ad0 --config=/etc/ote... 24 hours ago Up 24 hours 4317-4318/tcp, 55679/tcp otel-collector
820acb069f16 docker.io/grafana/grafana@sha256:2e986801428cd689c2358605289c90ab37d2b39e24808874971f54c99bcdc412 24 hours ago Up 24 hours (healthy) 3000/tcp grafana
1c00aa3f62ae docker.io/library/traefik@sha256:34d5089d0b414945342848518b383f11f5b3a645504ed87b77ffeb9d683d0e48 traefik 22 minutes ago Up 22 minutes (healthy) 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 0.0.0.0:4317->4317/tcp traefikNote: The minio-init container only runs briefly when starting MinIo and will have an Exited (0) status.
To ensure all components are successfully communicating with each other, you can run run-tests.sh, the automated test suite. It includes health verification of all individual components and validate the end-to-end data flows across the entire observability pipeline.
./run-tests.sh
========================================
🚀 Starting Automated Validation Suite
========================================
✔ Continue!
🔍 [CHECK] Smoketest: Are all defined containers running?
[INFO] Expected container count from compose.yml: 20
[INFO] Currently running containers: 20
✅ [SUCCESS] All required containers are running.
----------------------------------------
⏳ [WAIT] Checking container health status (Alertmanager, Grafana, Keep-db, Keep-frontend, Minio, Nginx, Node-exporter, Podman-exporter, Prometheus, Traefik)...
[INFO] Waiting for alertmanager to become healthy...
[SUCCESS] alertmanager is healthy!
[INFO] Waiting for grafana to become healthy...
[SUCCESS] grafana is healthy!
[INFO] Waiting for keep-db to become healthy...
[SUCCESS] keep-db is healthy!
[INFO] Waiting for keep-frontend to become healthy...
[SUCCESS] keep-frontend is healthy!
[INFO] Waiting for minio to become healthy...
[SUCCESS] minio is healthy!
[INFO] Waiting for nginx to become healthy...
[SUCCESS] nginx is healthy!
[INFO] Waiting for node-exporter to become healthy...
[SUCCESS] node-exporter is healthy!
[INFO] Waiting for podman-exporter to become healthy...
[SUCCESS] podman-exporter is healthy!
[INFO] Waiting for prometheus to become healthy...
[SUCCESS] prometheus is healthy!
[INFO] Waiting for traefik to become healthy...
[SUCCESS] traefik is healthy!
🔍 [CHECK] Identifying internal Podman network...
🔌 [INFO] Using internal network: monitoring_monitoring-net
[INFO] Using ephemeral curl container for internal API testing.
----------------------------------------
🔍 [TEST] Prometheus API & Base Health
✅ [SUCCESS] Prometheus API is reachable and reports healthy.
----------------------------------------
🔍 [TEST] Prometheus Targets (Max 2 minutes wait)
[INFO] Fetching Prometheus targets (Attempt 1/12)...
✅ [SUCCESS] All Prometheus targets are UP and successfully scraped.
========================================
🌐 Starting Podman monitoring-net network Tests (via HTTP)
========================================
----------------------------------------
🔍 [TEST] Grafana API
✅ [SUCCESS] http://grafana:3000/api/health is reachable and healthy.
----------------------------------------
🔍 [TEST] Alertmanager
✅ [SUCCESS] http://alertmanager:9093/-/healthy is reachable and healthy.
----------------------------------------
🔍 [TEST] Keep API
✅ [SUCCESS] http://keep-backend:8080/ is reachable and healthy.
----------------------------------------
🔍 [TEST] Traefik Routing (using Nginx)
✅ [SUCCESS] http://traefik:80 is routing requests correctly.
----------------------------------------
🔍 [TEST] Alloy
✅ [SUCCESS] http://alloy:12345/-/healthy is reachable and healthy.
----------------------------------------
🔍 [TEST] Blackbox Exporter
✅ [SUCCESS] http://blackbox-exporter:9115/-/healthy is reachable and healthy.
----------------------------------------
🔍 [TEST] Karma Dashboard
✅ [SUCCESS] http://karma:8080/health is reachable and healthy.
----------------------------------------
🔍 [TEST] Keep Frontend
✅ [SUCCESS] http://keep-frontend:3000/api/healthcheck is reachable and healthy.
----------------------------------------
🔍 [TEST] Loki
✅ [SUCCESS] http://loki:3100/ready is reachable and healthy.
----------------------------------------
🔍 [TEST] MinIO
✅ [SUCCESS] http://minio:9000/minio/health/live is reachable and healthy.
----------------------------------------
🔍 [TEST] Nginx
✅ [SUCCESS] http://nginx:80 is reachable.
----------------------------------------
🔍 [TEST] Node Exporter
✅ [SUCCESS] http://host.containers.internal:9100 is reachable.
----------------------------------------
🔍 [TEST] OpenTelemetry Collector
✅ [SUCCESS] http://otel-collector:8888/metrics is reachable.
----------------------------------------
🔍 [TEST] Podman Exporter
✅ [SUCCESS] http://podman-exporter:9882/metrics is reachable.
----------------------------------------
🔍 [TEST] Pyroscope
✅ [SUCCESS] http://pyroscope:4040/ready is reachable and healthy.
----------------------------------------
🔍 [TEST] Tempo
✅ [SUCCESS] http://tempo:3200/ready is reachable and healthy.
----------------------------------------
🔍 [TEST] Webhook Tester
✅ [SUCCESS] http://webhook-tester:8080 is reachable.
========================================
🌐 Starting Reverse Proxy Tests (via HTTPS/443)
========================================
----------------------------------------
🔍 [TEST] Proxy: Alloy
✅ [SUCCESS] https://alloy.localhost/-/healthy is reachable.
----------------------------------------
🔍 [TEST] Proxy: Alertmanager
✅ [SUCCESS] https://alertmanager.localhost/-/healthy is reachable.
----------------------------------------
🔍 [TEST] Proxy: Grafana
✅ [SUCCESS] https://grafana.localhost/api/health is reachable.
----------------------------------------
🔍 [TEST] Proxy: Karma
✅ [SUCCESS] https://karma.localhost/health is reachable.
----------------------------------------
🔍 [TEST] Proxy: KeepHQ (Frontend)
✅ [SUCCESS] https://keep.localhost/api/healthcheck is reachable.
----------------------------------------
🔍 [TEST] Proxy: MinIO Console
✅ [SUCCESS] https://minio.localhost/ is reachable.
----------------------------------------
🔍 [TEST] Proxy: Traefik Dashboard
✅ [SUCCESS] https://traefik.localhost/dashboard/ is reachable.
----------------------------------------
🔍 [TEST] Proxy: Webhook Tester
✅ [SUCCESS] https://webhook-tester.localhost/ is reachable.
========================================
🔗 Starting End-to-End Tempo Tracing Pipeline Test
========================================
🔍 [TEST] Flow: Traefik -> Grafana -> OTel -> Tempo -> Prometheus
[INFO] Injected Traceparent: 00-b9b8cc6ab6b843d78638c58e1b4f9d0f-59cc819b0ed64b55-01
[INFO] Waiting for the tracing pipeline to buffer and flush (max 30s)...
✔ Continue!
✅ [SUCCESS] Tempo successfully received and stored the exact Trace ID!
[INFO] Verifying tracing metrics flow in Prometheus...
✅ [SUCCESS] Prometheus confirms that tracing metrics are actively flowing!
========================================
📜 Starting End-to-End Loki Logging Pipeline Test
========================================
🔍 [TEST] Flow: Script -> Loki API (Push) -> MinIO (Storage) -> Loki API (Query)
[INFO] Injected Log Message: e2e-test-log-entry-8490416f-e91a-43cd-bf29-cfc5f534ae5c
[INFO] Successfully pushed log to Loki API.
[INFO] Waiting for Loki to index the log (max 50s)...
✔ Continue!
✅ [SUCCESS] Loki successfully ingested, indexed, and returned the test log!
========================================
🪵 Starting Alloy Auto-Discovery Test
========================================
🔍 [TEST] Flow: Container Logs -> Alloy -> Loki
[INFO] Verifying if Alloy is actively scraping containers and sending them to Loki...
✅ [SUCCESS] Alloy is actively scraping container logs and shipping them to Loki!
========================================
🚨 Starting End-to-End Alerting Pipeline Tests
========================================
🔍 [TEST] Flow: Prometheus (Rules Engine) -> Alertmanager
[INFO] Checking if Alertmanager is receiving the 'Watchdog' alert from Prometheus...
✅ [SUCCESS] Alertmanager is receiving alerts from Prometheus!
----------------------------------------
🔍 [TEST] Flow: Loki (Ruler) -> Alertmanager
[INFO] Checking if Alertmanager is receiving the 'LokiWatchdog' alert from Loki...
✅ [SUCCESS] Alertmanager is receiving alerts from Loki!
----------------------------------------
🔍 [TEST] Flow: Alertmanager -> Karma Dashboard
[INFO] Checking if Karma is actively parsing and visualizing alerts from Alertmanager...
✅ [SUCCESS] Karma is successfully receiving and grouping alerts from Alertmanager (Total: 2)!
========================================
📊 Starting PromQL Data Integrity Test
========================================
🔍 [TEST] Flow: Exporters -> Prometheus TSDB -> PromQL Evaluation
[INFO] Evaluating PromQL: up{job="node-exporter"}
✅ [SUCCESS] PromQL successfully evaluated the metric (value: 1).
----------------------------------------
🔍 [TEST] Flow: Verify all Prometheus targets are UP (via PromQL)
[INFO] Evaluating PromQL: up == 0
✅ [SUCCESS] No targets are reporting '0'. All targets are UP in the TSDB!
----------------------------------------
[INFO] Verifying Blackbox Exporter End-to-End flow...
✅ [SUCCESS] Prometheus confirms Blackbox Exporter is successfully executing HTTP probes!
----------------------------------------
[INFO] Verifying Podman Exporter End-to-End flow (Rootless Socket)...
✅ [SUCCESS] Prometheus confirms Podman Exporter is actively reading container metrics from the rootless socket!
----------------------------------------
[INFO] Verifying Traefik Metrics End-to-End flow...
✅ [SUCCESS] Prometheus confirms Traefik is actively exposing internal metrics!
========================================
🔥 Starting End-to-End Pyroscope Profiling Pipeline Test
========================================
🔍 [TEST] Flow: Alloy (Scraper) -> Pyroscope
[INFO] Verifying profiling metrics flow in Prometheus...
✅ [SUCCESS] Prometheus confirms that Alloy is actively scraping and sending profiles to Pyroscope!
========================================
🪣 Starting Storage Verification Test (MinIO)
========================================
🔍 [TEST] Flow: minio-init -> MinIO Buckets
[INFO] Checking if Loki and Tempo buckets exist in MinIO...
✅ [SUCCESS] Bucket 'loki-data' exists.
✅ [SUCCESS] Bucket 'tempo-data' exists.
✅ [SUCCESS] Bucket 'pyroscope-data' exists.
========================================
🎉 [COMPLETE] All tests completed successfully! Stack is stable.podman compose is a utility designed to help you define and run multi-container applications seamlessly without relying on a central daemon.
- What it is:
podman composeis a script that allows you to manage multi-container environments using Podman. It is fully compatible with the Compose specification, meaning you can often use your existingdocker-composeprojects without any modifications. - How it works: Under the hood,
podman composereads your configuration file and translates the instructions into native Podman commands. Because Podman is daemonless and rootless,podman composeexecutes these commands in the context of the user running it. It automatically handles the creation of networks (or Pods, depending on the configuration) so your containers can securely discover and communicate with each other locally. - The Role of ./compose.yml: The ./compose.yml file serves as the definitive blueprint for your application stack. It is a declarative YAML file where you define your entire infrastructure as code: services, image versions, port mappings, persistent volumes, and environment variables. Instead of manually executing long strings of CLI commands, you simply run
podman compose up -d, and the tool reads this file to build, connect, and start your entire environment in a reproducible way.
# Podman Help
podman compose --help
# stop all containers
podman compose down
# start all containers
podman compose up -d
# restart all containers
podman compose down && podman compose up -d
# restart a specific container and include changes from compose.yaml
podman compose down webhook-tester && podman compose up -d --force-recreate webhook-tester
# restart a specific container without applying compose.yaml changes
podman restart webhook-tester # Podman Compose Help
podman --help
# check container log
podman logs prometheus
# keep following container log
podman logs -f blackbox
# list running containers
podman ps
# list all containers (including stopped containers)
podman ps -a
# restart a container
podman restart loki
# execute a query in a Postgres container
podman exec -it keep-db psql -U keep -d keep -c "\d tenant;"
# Look up health state log properties of a container
podman inspect --format='{{json .State.Health}}' tempo | jq '.Log[-1]'
# run an HTTPS request to docker.io in a temporary curl container
podman run --rm docker.io/curlimages/curl:latest -sI "https://auth.docker.io/token?service=registry.docker.io"Docs: https://podman.io/docs
This script is optional if you use an HTTP proxy for your internet connection and you have configured environment variables like http_proxy, https_proxy, no_proxy, HTTP_PROXY, HTTPS_PROXY and NO_PROXY. In that case, you need to add hostnames and IP addresses that are used inside this monitoring stack to your no_proxy and NO_PROXY.
The install.sh script already executes this during setup. However, because environment variables are session-specific, you might need to run this again when you open a new terminal shell:
Source the script below to add the necessary hostnames and IP addresses:
source ./prepare_no_proxy.shImportant: You also need to add your custom domain to the HTTP proxy settings of your browser: no proxy = YOURDOMAIN, *.YOURDOMAIN
To ensure secure connections (https://*.${DOMAIN}) without browser warnings, you need a TLS certificate and a local CA, added it to your Fedora Trust Store.
Note: The install.sh script already generates these TLS certificates automatically. You only need to run this script manually if your certificates expire, or if you have issues with your local tust store.
# set your own DOMAIN, like monitoring.home
vi .env
./renew-certs.sh
=== Start Certificate Renewal ===
Cleaning up old files...
Generating SAN configuration...
Generating Root CA...
..+.......+..+......+............+++++++++++++++++++++++++++++++++++++++*.......+.........+....+.....+.+..+++++++++++++++++++++++++++++++++++++++*.......+...+......+..+...+.+...............+.....+..................+......+.+...+...+..+...+.........+.+...+..............+.+.........+..+...+...+...+...................+...........+.......+.....+...+.......+..+.......+...+...........+....+.....+.+..+.........+...................+.....+.+......+...+......+.....+...+...+...+......+..........+...+..+.........+...................+.....+...+...+....+.....+.......+......+........+.+......+..+.+....................+..........+.....+.......+.....+....+...+.....+.........................+..+.........+......+.+..+...........................+......+.........+.............+...........+...............+....+...+..+.+...........+....+..+....+...+............+...........+.......+......+......+........+...+.......+...+...+..+...+......+...+......+.+......+.........+........+..........++++++
............+++++++++++++++++++++++++++++++++++++++*......+.....+...+....+...+..+...+....+..............+.......+...+..+.......+......+++++++++++++++++++++++++++++++++++++++*....+.....+..........+...........+....+...+..+.+........+...............+.......+..+...++++++
-----
Generating Server Certificate...
Certificate request self-signature ok
subject=C=NL, ST=Utrecht, L=Utrecht, O=Utrecht, OU=Utrecht, CN=*.localhost
Fixing permissions (chmod 644)...
Updating Fedora Trust Store...
Checking if System Bundle trusts the certificate...
✓ SUCCESS: System bundle now trusts your certificate!
Restarting Traefik...
WARN[0010] StopSignal SIGTERM failed to stop container traefik in 10 seconds, resorting to SIGKILL
traefik
traefik
5d693930d305bbc871c7b212eeb1bc0f830ddc24318fd993e721d346f9dca013
traefik
=== Done! ===
Test now with: curl -v https://grafana.localhostnote: Before you try https://localhost in your web browser, make sure you restart your browser first!
Go to https://localhost (or your own custom DOMAIN).
To make navigating this observability stack effortless, we use NGINX to serve a static landing page ./landing-page/index.html. This page acts as the central frontend portal for all the monitoring tools.
Instead of memorizing various ports and subdomains, this portal provides a clean, unified interface with quick links to everything you need:
- Tools: Direct access to all core applications like Grafana, Prometheus, Alertmanager, Karma, KeepHQ and MinIO.
- Metrics Exporters: Quick links to the raw metric endpoints for all running services and exporters.
- Grafana Dashboards: Direct links to instantly open the pre-provisioned dashboards.
- Drilldown & Explore: Shortcuts to advanced Grafana Explore and Drilldown views for metrics, logs, and traces.
See the screenshots below for an impression of the NGINX landing pages:
In case you navigate to Grafana or MinIO, you need to log in with the user accounts defined in your .env file. By default (if you used the example values), these are:
| Service | Username | Password | Note |
|---|---|---|---|
| Grafana | admin | value of GRAFANA_ADMIN_PASSWORD | Configured via .env file. |
| MinIO | minio123 | value of MINIO_ROOT_PASSWORD | Configured via .env file. |
Prometheus is the core metrics engine of this observability stack. It is a powerful time-series database (TSDB) that records numeric data—such as CPU utilization, network traffic, memory consumption, and application-specific metrics.
Unlike traditional monitoring tools that wait for systems to send data to them, Prometheus primarily uses a pull-based model. It actively "scrapes" (fetches over HTTP) metrics from designated target endpoints (like our exporters) at regular intervals. Once the data is ingested, users can leverage its highly flexible query language, PromQL, to slice, dice, and aggregate the metrics for visualization in Grafana. It also continuously evaluates these metrics against custom rules to trigger real-time notifications via Alertmanager when specific thresholds are breached.
How it works in this stack (prometheus.yml): The central brain instructing Prometheus what to do is located in ./prometheus/prometheus.yml. This configuration file orchestrates several crucial tasks:
- Global Settings: It defines the default scrape_interval (typically 15 seconds), dictating how often Prometheus polls the targets for fresh data.
- Rule Files: It instructs Prometheus to load and evaluate the alert rules defined in alert.rules.yml (e.g., "Alert if disk space is > 90%").
- Alerting Configuration: It specifies the destination for fired alerts, pointing Prometheus to the local Alertmanager container (
http://alertmanager:9093). - Scrape Configurations (scrape_configs): This is the most important section. It contains the inventory of all services Prometheus needs to monitor. It maps out jobs and targets using the internal Docker network hostnames, such as node-exporter:9100, podman-exporter:9882, alloy:12345, traefik:8082, and the various blackbox HTTP/TCP probes.
Note: While Prometheus is famous for pulling data, version 3.x also supports pushing metrics natively. In this stack, Tempo is configured to push its internal metrics directly to Prometheus.
Go to https://prometheus.localhost
| Endpoint paths | Description |
|---|---|
/query |
metrics querier. |
/alerts |
alert rule overview. |
/targets |
status of the scrape targets. |
/config |
full prometheus configuration. |
See the screenshot below for an impression of the Prometheus UI - alert rules overview:

| configuration | configuration file |
|---|---|
| scrape target | ./prometheus/prometheus.yml |
| alert rules | ./prometheus/alert.rules.yml |
Prometheus exposes and scrapes its own metrics. Using these metrics, you can monitor Prometheus, see below:
See the screenshot below for an impression of the Prometheus metrics dashboard:

Docs:
- https://prometheus.io/docs/introduction/overview/
- https://prometheus.io/docs/instrumenting/exporters/
- https://github.com/prometheus/prometheus
Grafana Loki is a log aggregation system inspired by Prometheus. Unlike traditional logging systems (such as Elasticsearch) that index the full text of every log line, Loki only indexes the metadata (labels) attached to each log stream. This unique design choice makes it exceptionally lightweight, cost-effective, and fast to operate.
In a typical workflow, a collector like Grafana Alloy gathers logs from your containers or system journals and pushes them to Loki. Loki then compresses this data into chunks and stores it efficiently in an object storage backend. Users can seamlessly search and analyze these logs in Grafana using LogQL (Loki Query Language), leveraging the exact same labels used in Prometheus to instantly correlate metrics spikes with their underlying log events.
How it works in this stack (loki-config.yaml): The core behavior of Loki in this environment is defined in ./loki/loki-config.yaml:
- S3 Storage Backend (MinIO): Rather than saving heavy log files to local disk, Loki is configured to use the s3 storage type. It connects directly to the local MinIO instance (
http://minio:9000) using the credentials defined in your.envfile and stores all log chunks in the loki-data bucket. - TSDB Indexing: The
schema_configdefines that Loki uses tsdb (Time Series Database) for its index. This is the modern, highly optimized index format for Loki that drastically improves query performance and reduces storage costs compared to older formats. - Data Retention & Compactor: To prevent the disk/MinIO from filling up indefinitely, the limits_config enforces a strict retention period of
168h(7 days). The compactor component runs periodically to scan the MinIO bucket and automatically delete log data that has exceeded this age limit. - The Ruler (Alerting): Loki isn't just for searching; it can proactively monitor your logs. The ruler block configures Loki to continuously evaluate LogQL alert rules stored in
/loki/rules(e.g., triggering an alert if the word "ERROR" appears more than 10 times in a minute). If a rule threshold is met, Loki sends the alert directly tohttp://alertmanager:9093.
Loki does not include a built-in user interface. Instead, it relies entirely on Grafana to serve as the unified dashboard for exploring and analyzing your logs, for example:
See the screenshot below for an impression of the Loki logging dashboard:

| configuration | configuration file |
|---|---|
| Loki config | ./loki/loki-config.yaml |
| Loki alert rules | ./loki/rules/fake/loki-alert-rules.yaml |
Like most modern containers, Loki exposes Prometheus metrics too, which are used to monitor Loki using the dashboard below:
See the screenshot below for an impression of the Loki metrics dashboard:

Docs:
Grafana Tempo is a high-volume, distributed tracing backend designed to track the lifecycle of requests as they travel through complex, interconnected microservices. It helps developers and operators pinpoint exactly where latency, bottlenecks, or errors are occurring in a system. Unlike older tracing tools that require heavy, complex databases for indexing (like Elasticsearch or Cassandra), Tempo is exceptionally cost-effective because it only requires a basic object storage backend to store the raw trace data.
In this observability stack, applications (and components like Traefik and Grafana) send their traces to the OpenTelemetry Collector, which acts as a router and pushes them to Tempo. Within Grafana, users can query and visualize these request lifecycles using TraceQL. Thanks to standard Trace IDs, you can seamlessly jump directly from a log line in Loki or an exemplar in Prometheus to the exact corresponding trace span in Tempo for rapid root cause analysis.
How it works in this stack (tempo.yaml): The internal workings and storage behaviors of Tempo are configured in ./tempo/tempo.yaml. This file instructs Tempo on how to handle incoming traces and where to put them:
- Receivers: Configures Tempo to ingest trace data. In our setup, it primarily receives traces via the OTLP protocol directly from the local OpenTelemetry Collector.
- S3 Storage Backend (MinIO): Instructs Tempo to use the s3 storage backend. It connects to our local MinIO instance (
http://minio:9000) using the minio credentials and stores all trace blocks securely in the tempo-data bucket. - WAL (Write-Ahead Log): Defines a local path (
/var/tempo/wal) where Tempo temporarily buffers incoming traces before they are fully batched and uploaded to MinIO. This ensures no traces are lost if the container unexpectedly restarts. - Compactor: A background process that periodically scans the MinIO bucket, combining smaller trace blocks into larger ones to improve querying performance and manage data retention policies.
Tempo does not include a built-in user interface. Instead, it relies entirely on Grafana to serve as the unified dashboard for exploring and analyzing your traces, for example:
See the screenshot below for an impression of a Tempo Trace through Traefik and Grafana:

| configuration | configuration file |
|---|---|
| Tempo config | ./tempo/tempo.yaml |
Tempo exposes Prometheus metrics too, which are used to monitor Tempo using the dashboard below:

Docs:
Grafana Pyroscope is a continuous profiling tool. While metrics tell you what is happening (e.g., CPU is at 100%), and traces tell you where it is happening (e.g., a specific API endpoint is slow), profiling tells you exactly why it is happening by showing you the exact function or line of code responsible for the resource consumption.
Go to https://pyroscope.localhost
How it works in this stack:
- Scraping via Alloy: Instead of pushing profiles directly from applications, Grafana Alloy is configured to actively scrape standard pprof endpoints. Alloy scrapes the CPU and Memory profiles from containers, in this monitoring stack from the monitoring tools themselves.
- S3 Storage Backend (MinIO): Pyroscope connects to the local MinIO instance (http://minio:9000) and stores all profiling data in the pyroscope-data bucket.* Data Retention: Profiling data can grow quickly. Pyroscope's built-in compactor is configured to aggregate this data and enforce a strict 14-day retention policy (block_retention: 336h), automatically cleaning up old profiles from MinIO.
- Trace-to-Profile Integration: In Grafana, the Tempo datasource is explicitly linked to the Pyroscope datasource using the service.name tag. This creates a seamless UI experience where you can jump from a trace span directly into a Flame Graph.
| configuration | configuration file |
|---|---|
Pyroscope config |
./pyroscope/pyroscope.yaml |
Alloy Scrape config |
./alloy/config.alloy |
See the screenshot below for an impression of the Pyroscope metrics dashboard:

Docs:
Alertmanager handles alerts sent by client applications such as the Prometheus server and Loki's Ruler. While Prometheus and Loki evaluate data and fire alerts based on predefined thresholds, Alertmanager takes over the complex logistics of notification management.
Its primary goal is to prevent "alert fatigue" during major incidents. It achieves this by deduplicating redundant alerts, grouping related alerts together into a single notification, and intelligently routing them to the correct downstream receivers (like email, Slack, or webhook endpoints). It also provides operational features such as silencing (temporarily muting specific alerts) and inhibition (suppressing lower-priority alerts, like warnings, when a related critical alert is already active).
How it works in this stack: The core behavior and routing logic of Alertmanager are defined in ./alertmanager/alertmanager.yml. This configuration file orchestrates several key mechanisms:
- The Routing Tree (route): This section defines how incoming alerts are processed. It groups alerts based on specific labels (like alertname or severity). It sets timers such as group_wait (how long to wait to bundle alerts before sending the first notification), group_interval (how long to wait before sending updates about a group), and repeat_interval (how long to wait before re-sending a persistent alert).
- Receivers (receivers): This section defines the actual destinations for your alerts. In our educational stack, instead of sending emails or Slack messages, the receivers are configured as webhooks. Alerts are routed to the Webhook Tester (
http://webhook-tester:8080) so you can easily inspect the raw JSON alert payloads for debugging, and to KeepHQ (http://keep-backend:8080) where the AIOps platform correlates and processes them further. - Inhibition Rules (inhibit_rules): Defines logic to mute certain alerts if other specific alerts are already firing, keeping the dashboard and notifications focused on the root cause.
Go to https://alertmanager.localhost
| Path | Description |
|---|---|
/#/alerts |
Overview of current alerts |
/#/silences |
Ability to silence alerts |
/#/status |
Alertmanager status and configuration overview |
/#/settings |
Alertmanager UI settings |
See the screenshot below for an impression of the Alertmanager UI:

| configuration | configuration file |
|---|---|
| Alertmanager config | ./alertmanager/alertmanager.yml |
Alertmanager exposes prometheus metrics too, which are used to monitor Alertmanager using the dashboard below:
See the screenshot below for an impression of the Alertmanager metrics dashboard:

Docs:
Go to https://grafana.localhost
Grafana is the central visual heart of this stack, functioning as the 'single pane of glass' for all your observability data. While Prometheus, Loki, and Tempo act as the backend storage and query engines, Grafana provides the unified frontend interface. It allows you to query, visualize, alert on, and understand your metrics, logs, and traces all in one place.
A major highlight of this environment is that Grafana is fully pre-provisioned via Infrastructure as Code (IaC). Instead of manually clicking through the UI to connect databases and build dashboards from scratch, everything is automatically injected the moment the container starts.
Docs:
Automated dashboard provisioning: ./grafana-provisioning/dashboards/dashboard.yaml acts as a dashboard provider configuration. It tells Grafana to recursively scan the local directory ./grafana-provisioning/dashboards/json/ for any .json files and automatically load them into the UI. Because of this, all the specialized dashboards (for Node Exporter, Podman, Alloy, Blackbox, MinIO, etc.) are instantly available for use without requiring manual import steps.
See the screenshot below for an overview of the Grafana Dashboards:

The Explore mode provides an advanced interface for ad-hoc analysis and troubleshooting, where users can execute queries directly. Explore thus facilitates rapid incident diagnosis and root-cause analysis, without the need to configure predefined dashboards in advance.
Loki logs explore
The Loki datasource combined with LogQL makes it possible to efficiently filter log streams by labels, search for specific text patterns or regular expressions, and visualize log volumes alongside raw log lines.
See the screenshot below for an impression of the Explore logs:

Prometheus metrics explore
The Prometheus datasource, combined with PromQL queries, enables iterative exploration of time-series data, trend visualization, and comparison of metrics using split-view functionality.
See the screenshot below for an impression of the Explore metrics:

Tempo tracing explore
The Tempo datasource combined with TraceQL provides a detailed visualization of the lifecycle of requests through the distributed architecture. Using the waterfall view, users can analyze latency per component, isolating performance bottlenecks and errors within specific spans. Integration with TraceQL enables targeted filtering of traces, which, combined with correlated logs and metrics, allows efficient root-cause analysis during incidents. For example, it can be interesting to filter for requests that do not have an HTTP status code of 4xx or 5xx, or requests that take longer than 500ms.
See the screenshot below for an impression of the Explore traces:

To manually test the proxy path by sending a traceparent header, run this command in your terminal:
curl -k -H "traceparent: 00-11112222333344445555666677778888-1111222233334444-01" https://grafana.localhost/api/healthNext, in Grafana, go to Tempo Explore and search for the exact Trace ID: 11112222333344445555666677778888. If propagation works, you'll see a beautiful trace tree with the Traefik span at the top and the Grafana span below.
See the screenshot below for an impression of the Explore traces - service graph:

Pyroscope profiling explore
The Pyroscope datasource allows you to query continuous profiling data. Using Flame Graphs, you can visually analyze exactly which functions or lines of code are consuming the most CPU time or Memory allocations over a selected period. You can also use the "Diff" view to compare a profile from a healthy period against a profile from an incident period.
See the screenshot below for an impression of the Explore profiles:

The drill-down functionality within Grafana offers the ability to connect in-depth error analysis through metrics, logs, traces and profiles contextually with each other. From an anomaly in a metrics dashboard, you can directly navigate to the correlated log lines in Loki, and then use automatically detected trace IDs to switch to detailed request spans in Tempo. Finally, you can click on a specific Tempo span to open the exact Pyroscope Flame Graph for that exact millisecond in time. This integration eliminates the need to manually synchronize timestamps and identifiers between different datasources, significantly increasing the efficiency of root cause analysis and performance optimization.
See the screenshot below for an impression of the Metrics drilldown:

See the screenshot below for an impression of the Logs drilldown:

See the screenshot below for an impression of the Traces drilldown:

See the screenshot below for an impression of the Profiling drilldown:

Grafana Alerting provides a central interface for monitoring alerts. This module aggregates alert rules from both Prometheus (for metrics) and Loki (for log data), creating an overview of the operational status. Through this dashboard you can analyze the real-time status of alerts (‘Pending’ or ‘Firing’), examine the underlying query definitions, and gain insight into the evaluation criteria that safeguard the platform’s stability and availability.
See the screenshot below for an impression of the Grafana Alerting:

Datasources in Grafana serve as the technical interface to the underlying data storage systems, allowing the application to retrieve data without persisting it itself. In this configuration, Prometheus, Loki and Tempo are defined as the primary sources for exposing metrics, log files and distributed traces, respectively.
./grafana-provisioning/datasources/datasources.yaml instructs Grafana exactly how to connect to the internal network endpoints for Prometheus (http://prometheus:9090), Loki (http://loki:3100), and Tempo (http://tempo:3200). More importantly, this file configures the contextual correlations between them. For example, it defines "Derived Fields" for Loki, telling Grafana: "If you see a 32-character string that looks like a Trace ID in a log line, make it a clickable button that instantly opens that exact trace in Tempo." It also sets up exemplar links between Prometheus metrics and Tempo traces.
See the screenshot below for an impression of the Grafana Datasources:

The datasources for Prometheus, Loki and Tempo are configured in ./grafana-provisioning/datasources/datasources.yaml.
Karma is a specialized, highly visual dashboard designed specifically for Alertmanager. While Alertmanager excels at routing and grouping alerts, its default UI is quite basic. Karma fills this gap by providing an intuitive, color-coded, and auto-refreshing interface that gives Operations and DevOps teams a consolidated overview of the platform's health at a glance.
Go to https://karma.localhost
How it works in this stack:
- Direct Alertmanager Integration: Karma continuously polls Alertmanager to display active alerts in organized, collapsible groups based on their severity and source.
- Prometheus History: It connects directly to Prometheus to enrich the current alerts with historical context, allowing you to see if an alert has been flapping.
- Custom Color Coding: As defined in karma.yaml, alerts are customized with distinct colors based on their severity (e.g., Red for Critical, Orange for Warning) and the specific job that triggered them (e.g., node-exporter, loki, alloy). This makes visual identification instantaneous.
- Noise Reduction: It automatically filters out constant background alerts like the 'Watchdog' (dead man's switch) and strips redundant receiver labels to keep the dashboard clean and actionable.
- Live Auto-Refresh: The dashboard automatically refreshes every 20 seconds so you never miss a critical state change.
| configuration | configuration file |
|---|---|
| Karma config | ./karma/karma.yaml |
*See the screenshot below for an impression of the Karma UI:
An overview of all active warnings (e.g., "Disk almost full", "Container down" or "Health Check Failed").
See the screenshot below for an impression of the karma metrics dashboard:

Docs:
Webhook-tester is a lightweight and incredibly useful utility for debugging and inspecting incoming HTTP requests. In this observability stack, it acts as a "dummy" or "catch-all" receiver for Alertmanager.
Go to https://webhook-tester.localhost
How it works in this stack: When Prometheus fires an alert, Alertmanager processes and routes it based on its configuration. By configuring Alertmanager to send a webhook to this tester, you can inspect the exact, raw JSON payloads that Alertmanager generates in real-time. This is highly beneficial for:
- Debugging Alert Payloads: Understanding the exact data structure, labels, and annotations that get sent out when an alert triggers.
- Template Development: Testing custom notification templates before connecting them to real-world communication channels (like Slack, Microsoft Teams, or PagerDuty).
- Integration Testing: Verifying that the alert routing rules in Alertmanager are working correctly and actually triggering the appropriate webhooks.
See the screenshot below for an impression of the Webhook-tester UI:

Docs:
KeepHQ is an open-source AIOps and alert management platform. While Alertmanager handles the initial routing and deduplication of alerts, KeepHQ takes alert management a step further by providing advanced correlation, noise reduction, and automated workflow execution (auto-remediation). It acts as a single pane of glass for all your alerts, enriching them with context from various tools.
How it works in this stack: KeepHQ is deployed using three containers: a PostgreSQL database (keep-db), the core API and AIOps engine (keep-backend), and the web interface (keep-frontend).
Automatic Provider Configuration (IaC): For KeepHQ to intelligently correlate alerts and execute workflows, it needs access to your metrics and logs. Instead of manually configuring these connections in the Keep UI, this stack automatically provisions them on startup using provider configuration files located in ./keep/providers/:
| provider config | description |
|---|---|
| prometheus.yml | Automatically configures the local Prometheus instance as a data source (http://prometheus:9090). This allows KeepHQ to dynamically query time-series metrics to gather deeper context when an alert fires. |
| loki.yml | Automatically configures the local Grafana Loki instance as a data source (http://loki:3100). This enables KeepHQ to directly fetch relevant log lines and event streams associated with an incident. |
By injecting these configurations via Infrastructure as Code, KeepHQ is instantly ready to query both metrics and logs the moment the stack boots up, significantly accelerating troubleshooting and providing a seamless AIOps experience.
See the screenshot below for an impression of the KeepHQ feeds UI:

See the screenshot below for an impression of the KeepHQ plugins:

See the screenshot below for an impression of the KeepHQ metrics dashboard:

Docs:
MinIO is a high-performance, S3-compatible object storage server. In this observability stack, it serves as the persistent, long-term storage backend for both Grafana Loki (logs) and Grafana Tempo (traces).
Note: The https://github.com/minio/minio/ project is no longer maintained, so we use the fork https://github.com/pgsty/minio/, see https://vonng.com/en/db/minio-resurrect/ for more info.
Go to https://minio.localhost
Why use MinIO? Modern observability tools like Loki and Tempo have deliberately moved away from requiring heavy, complex databases (like Elasticsearch or Cassandra) for storage. Instead, they maintain a lightweight local index and push the bulk of their compressed log chunks and trace data into cheap, scalable object storage. MinIO provides this exact S3-like API locally, mimicking what you would use in the cloud (like AWS S3 or Google Cloud Storage).
How it works in this stack:
- Automatic Bucket Provisioning: When you start the stack, a temporary helper container named
minio-initruns alongside the main MinIO server. It automatically connects to the server and creates the necessary storage buckets (loki-data and tempo-data). Once done, the helper container gracefully exits. - Storage Flow: Loki and Tempo are configured to treat MinIO just like AWS S3. As they collect logs and traces, they bundle them into chunks and push them to their respective buckets in MinIO.
- Console & Management: Through the MinIO UI (link above), you can browse these objects, inspect bucket policies, and see exactly how much storage your logs and traces are consuming.
See the screenshot below for an impression of the MinIO UI - login:

See the screenshot below for an impression of the MinIO UI - object browser:

See the screenshot below for an impression of the MinIO UI - metrics info:

See the screenshot below for an impression of the MinIO overview dashboard:

See the screenshot below for an impression of the MinIO bucket dashboard:

See the screenshot below for an impression of the MinIO node dashboard:

Docs:
Grafana Alloy is a highly configurable, vendor-neutral observability data pipeline. In this monitoring stack, Alloy acts as the primary log collector, processor and profiling agent, bridging the gap between your raw logs (both container and host-level) and Grafana Loki, as well as collecting continuous profiling data for Pyroscope.
Go to https://alloy.localhost
How it works in this stack (config.alloy): The configuration file located at ./alloy/config.alloy defines three main data streams that converge into a single output pushed to Loki and Pyroscope:
- Stream 1: Container Logs (
Podman Socket): Alloy discovers all running containers via the local Podman socket (/var/run/docker.sock). Instead of just grabbing raw logs, it enriches them with highly useful metadata. It extracts the container_name, shortens the container_id to 12 characters for precision, and tags the image, pod_name, and compose project. This enrichment is what allows you to effortlessly filter logs in Grafana based on specific containers or pods. - Stream 2: Host System Logs (
Journald): Alloy also reads the host machine's system logs directly from/var/log/journal. It extracts the systemd unit (e.g., sshd.service), syslog_identifier, and the log level (e.g., info, warning, err) so you can quickly filter for host-level errors. - Smart Deduplication: Because rootless Podman automatically writes container logs to the host's system journal as well, simply collecting both streams would result in duplicate logs in Loki. The config.alloy explicitly prevents this by applying a loki.relabel rule that drops any journald log containing a container ID. This ensures your logs remain clean and accurate.
- Stream 3: Continuous Profiling (pprof Scraping): Alloy is configured to actively scrape standard Go
pprofendpoints from the monitoring tools in the stack (such as Prometheus, Loki, Tempo, Traefik, Node Exporter, and Alloy itself). It routinely collects CPU, memory, goroutine, block, and mutex profiles, and forwards this data to the Pyroscope backend. This agent-based pull model eliminates the need for each application to explicitly push its own profiles.
Through the Alloy web UI, you can view the health of these components and visually inspect the data flow pipeline using the Graph tab.
See the screenshot below for an impression of the Alloy UI:

See the screenshot below for an impression of the Alloy Graph:

See the screenshot below for an impression of the Alloy metrics dashboard:

Docs:
The Prometheus Blackbox Exporter is a probing tool that allows you to monitor the external health, availability, and response times of your endpoints. Instead of relying on internal application metrics (white-box monitoring), the Blackbox Exporter performs active "black-box" testing by making HTTP requests, TCP connections, or ICMP pings over the network just like a real user or client would.
How it works in this stack: The Blackbox Exporter acts as a proxy. Prometheus asks the Blackbox Exporter to probe a specific target using a specific module, and the Exporter returns metrics based on the result of that probe (e.g., probe_success, probe_duration_seconds).
- Configuration (blackbox.yml): The configuration file located at ./blackbox/blackbox.yml defines the modules (the "how"). For instance, it configures an
http_2xxmodule which dictates that a probe is only successful if the target returns an HTTP 200 OK status. It also defines modules like tcp_connect to verify if a raw network port is open. - Prometheus Scrape Jobs (
prometheus.yaml): While blackbox.yml defines the methods, prometheus.yaml defines the targets (the "what"). This stack includes several dedicated scrape jobs to ensure critical services are running:
| prometheus scrape Job | description |
|---|---|
| blackbox-http | A general-purpose job that probes standard web endpoints to verify if HTTP services are responding correctly. |
| blackbox-keep-api | A targeted probe specifically monitoring the backend API of KeepHQ to ensure the AIOps engine is healthy and accepting requests. |
| blackbox-keep-ui | A targeted probe verifying that the KeepHQ frontend interface is accessible to users. |
| blackbox-tcp | This job uses the TCP module to probe non-HTTP services. It checks if specific ports (like database ports or internal communication sockets) are open and successfully accepting TCP handshakes. |
| blackbox_exporter | This job doesn't probe external targets. Instead, it scrapes the internal metrics of the Blackbox Exporter container itself, allowing you to monitor how many probes have been executed, how long they took, and if the exporter is experiencing any errors. |
See the screenshot below for an impression of the Blackbox dashboard:

Docs:
The Prometheus Node Exporter is a fundamental component for infrastructure monitoring. While other exporters focus on specific applications, databases, or container engines, the Node Exporter focuses entirely on the host machine itself (in this case, your underlying Fedora Workstation).
How it works in this stack: It exposes a wide variety of hardware and OS-level metrics, such as CPU utilization, memory consumption, disk space, disk I/O, network bandwidth, and system load. Prometheus scrapes these metrics, allowing you to trigger alerts (e.g., "Disk almost full") and visualize the overall health of your host hardware.
Bypassing Container Isolation (compose.yml): By design, containers are isolated from the host. To accurately measure the host's hardware, the Node Exporter container requires special configuration. In the compose.yml, it is explicitly set to use network_mode: host and pid: host. Additionally, it mounts the host's entire root filesystem (/) to a /host directory inside the container. This deliberately breaks the container's isolation, allowing the exporter to read the actual /proc and /sys files of the underlying host operating system.
See the screenshot below for an impression of the node-exporter-full dashboard:

Docs:
The Prometheus Podman Exporter is designed to extract metrics specifically from a Podman environment. Since this observability stack intentionally uses daemonless, rootless Podman instead of Docker, traditional Docker exporters will not work. This exporter bridges that gap by providing deep visibility into your container runtime.
How it works in this stack: It exposes comprehensive metrics about running containers, pods, images, and volumes (e.g., container CPU/memory usage, network I/O, and container state). Prometheus scrapes these metrics, which power the dedicated Podman Grafana dashboards, allowing you to track the exact resource footprint of each service in the stack.
Rootless Socket Connection (compose.yml): To gather these metrics securely, the exporter needs to talk to the Podman API. In the compose.yml, this is achieved by mapping the host user's specific rootless Podman socket (/run/user/1000/podman/podman.sock) directly into the container. Furthermore, an environment variable CONTAINER_HOST=unix:///run/podman/podman.sock directs the exporter to listen to this specific socket, allowing it to monitor the containers without requiring root privileges on the host machine.
See the screenshot below for an impression of the podman-exporter dashboard:

Docs:
The OpenTelemetry (OTel) Collector is a vendor-agnostic proxy, router, and processor for telemetry data. While it has the capability to handle metrics and logs, in this observability stack it is primarily dedicated to handling distributed traces.
How it works in this stack: Instead of applications sending trace data directly to the storage backend (Tempo), they send them to the OTel Collector. This architectural pattern decouples your applications from the storage backend, allowing you to easily switch backends, filter sensitive data, or batch requests without needing to change any application code.
- Trace Ingestion (OTLP): The collector listens for incoming traces via the standard OpenTelemetry Protocol (OTLP) over gRPC on port
4317. For instance, Grafana itself is configured in the compose.yml to send its internal traces to this exact port (GF_TRACING_OPENTELEMETRY_OTLP_ADDRESS=otel-collector:4317). - Forwarding to Tempo: Once the collector receives and processes the incoming trace spans, it exports them directly to the local Grafana Tempo container, which subsequently stores them persistently in MinIO.
- Traefik gRPC Routing (compose.yml): To allow external applications or microservices to securely send traces to the collector, Traefik is configured with a dedicated TCP router using Server Name Indication (SNI). The rule
HostSNI('otel-collector.localhost')routes incoming gRPC traffic directly to the collector. Additionally, the collector exposes its own internal health and performance metrics via an HTTP endpoint on port8888.
See the screenshot below for an impression of the OpenTelemetry-collector dashboard:

Docs:
Traefik acts as the Edge Router and Reverse Proxy for this entire observability stack. It is the single entry point that intercepts all incoming requests (like when you visit https://grafana.localhost) and dynamically routes them to the correct backend container. Furthermore, it handles all TLS/SSL termination, ensuring your local connections are secure and free of browser warnings.
Go to: https://traefik.localhost
How it works in this stack: Traefik uses a combination of auto-discovery and file-based configurations to manage routing:
- Container Auto-Discovery (./compose.yml): By mounting the rootless Podman socket, Traefik automatically discovers running containers. The routing rules are defined directly on the containers using Docker labels (e.g.,
traefik.http.routers.grafana.rule=Host('grafana.localhost')). - Static Configuration (./traefik/traefik.yaml): This is the main startup configuration. It defines the global "EntryPoints" (port 80 for HTTP, 443 for HTTPS, and 4317 for OTLP). It enforces an automatic redirect from HTTP to HTTPS for all traffic. Additionally, it configures Traefik to send its own internal distributed traces to the OpenTelemetry Collector and exposes its metrics for Prometheus to scrape.
- Dynamic Certificates (./traefik/dynamic/tls.yaml): Traefik continuously watches the dynamic directory. This specific file instructs Traefik where to find the custom wildcard certificates (
server.crtandserver.key) generated by therenew-certs.shscript, applying them automatically to all*.localhostroutes. - Dynamic Routing (./traefik/dynamic/traefik-dynamic.yaml): While most routing is handled automatically via labels, some services require manual rules. Because the Node Exporter runs on the host network (network_mode: host) to collect accurate hardware data, it lives outside the standard container bridge network. This file explicitly tells Traefik to route requests for node-exporter.localhost out of the container network and into the host machine via
http://host.containers.internal:9100.
See the screenshot below for an impression of the Traefik UI:

See the screenshot below for an impression of the Traefik dashboard:

Docs:
8.1 Changes to files like alertmanager.yaml, index.html, loki-config.yaml, pyroscope.yaml, tempo.yaml, traefik-dynamic.yaml or traefik.yaml have no effect after restarting the stack.
Problem: You modified a configuration file (e.g., alertmanager.yaml, index.html) directly in the service directory, rebuilt or restarted the stack with podman compose down && podman compose up -d, but your changes are not visible.
Cause: These files are not consumed directly by the containers. The real configuration files are generated from templates located in the ./templates/ directory when you run ./install.sh. Modifying the output files is pointless because they will be overwritten on the next installation run.
Solution:
Make your edits in the corresponding template file inside the ./templates/ directory (e.g. ./templates/alertmanager.yaml).
Run the installer to regenerate all configuration files from the templates:
./install.shRestart the stack so the containers pick up the new files:
podman compose down
podman compose up -dProblem: You started the stack with the podman-compose command, but services are misconfigured (e.g. domain names missing, wrong paths).
Cause: The podman-compose tool (the stand-alone Python package) does not automatically substitute environment variables into the compose.yaml file. The native podman compose (a subcommand of the podman client) does perform this substitution when you have exported the variables correctly.
Solution:
Always use the podman compose command (with a space, not a hyphen). If you previously used podman-compose, tear down the stack:
podman-compose down 2>/dev/null || trueMake sure your environment variables are loaded from .env (see the first troubleshooting item). Re-run the installer and start the stack with the correct command:
./install.sh
podman compose down
podman compose up -dProblem: You are behind a corporate or personal HTTP proxy. When you visit https://my-domain the request is forwarded to the proxy instead of staying local, causing connection failures or timeouts.
Cause: Your browser is configured to use a proxy for all traffic, including requests for .localhost domains.
Solution: Configure your browser’s proxy settings to exclude *.localhost and localhost itself (or your custom domain). The exact method depends on the browser:
Firefox: Settings → Network Settings → “No proxy for” → add .localhost, localhost
Chrome/Edge: These browsers usually respect the system proxy settings. Add an exception in your operating system’s proxy configuration for .localhost and localhost. After the change, restart your browser to ensure the new settings take effect.
Problem: Even though the browser bypasses the proxy, components like Grafana Alloy, the OpenTelemetry Collector, or the installer script cannot connect to internal services or the outside world correctly.
Cause: The HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables in your shell are not set or do not include the local monitoring domain. Internal traffic is being sent to the proxy, which either rejects it or cannot resolve the internal addresses.
Solution:
Use the provided helper script to prepare a correct NO_PROXY list:
source prepare_no_proxy.shAlternatively, manually ensure your environment contains the proper NO_PROXY value:
export NO_PROXY=".localhost,localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,*.local,*.internal,.svc,${NO_PROXY}"Re-run the installer and restart the stack to allow all components to pick up the new proxy settings:
./install.sh
podman compose down
podman compose up -dProblem: You updated the DOMAIN value in .env, ran ./install.sh and restarted the stack, but your browser still displays a certificate warning for the new domain.
Cause: The browser has not been restarted since the Traefik-generated certificate for the previous domain was cached, or the certificate for the new domain has not been regenerated yet.
Solution:
Restart your browser completely (close all windows and reopen). If the problem persists, Traefik might need a few seconds to obtain the new certificate. Wait a moment and reload the page. As a last resort, force Traefik to recreate its certificates:
./install.sh...and restart all your browser sessions!
Problem: Running the export command produces an error like xargs: unmatched double quote or variables are set incorrectly (e.g., truncated values, missing characters).
Cause: The .env file contains special characters such as spaces, quotes, $, &, or # in comments that confuse the xargs parser.
Solution:
Inspect your enviroment variables:
env | grep -P '(DOMAIN|GRAFANA_ADMIN_USER|GRAFANA_ADMIN_PASSWORD|MINIO_ROOT_USER|MINIO_ROOT_PASSWORD|KEEP_DB_USER|KEEP_DB_PASSWORD|KEEP_DB_NAME|KEEP_API_KEY|NEXTAUTH_SECRET|OPENAI_API_KEY|WEBHOOK_TESTER_UUID)'
WEBHOOK_TESTER_UUID=65ae26f0-131e-4390-8daa-bdaec17e77c2
MINIO_ROOT_PASSWORD=minio123
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin
OPENAI_API_KEY=dummy-key
DOMAIN=localhost
KEEP_DB_NAME=keep
MINIO_ROOT_USER=minio
KEEP_API_KEY=585af6cc-5c07-427f-966f-a263473ad402
NEXTAUTH_SECRET=change_me_to_a_secure_string
KEEP_DB_PASSWORD=keep
KEEP_DB_USER=keepMissing any environment variables(s)? Or do they contain special charcters like below?
GRAFANA_ADMIN_PASSWORD='My p@$$w0rd!'
Fix your .env file and:
export $(grep -v '^#' .env | xargs)
./install.sh
podman compose down && podman compose up -d8.7 The ./run-tests.sh script reports failures, or podman ps -a shows containers with "Exited" status
Problem: The test script outputs errors (e.g. FAIL: Expected metric ... not found) or services are unreachable. Running podman ps -a reveals one or more containers in the "Exited" state instead of "Up".
Cause: A container terminated unexpectedly. This can be caused by misconfigured templates, missing environment variables, port conflicts, MinIO bucket creation not finished, volume permission issues, or a dependency loop.
Solution:
- Identify the failing container(s):
podman ps -a --filter "status=exited"- Check the specific container logs – this is the most important diagnostic step:
podman logs <container-name> For continuous monitoring while starting:
podman compose logs -f- Inspect the exit code for a quick hint:
podman inspect <container-name> --format='{{.State.ExitCode}}'-
Common causes and their fixes:
Environment variables not loaded from
.env: Symptom: logs show missing or default values. Fix: re-export the variables (export $(grep -v '^#' .env | xargs)) and run./install.shandpodman compose down && podman compose up -d.Port conflict: Symptom: log entry like bind: address already in use. Fix: check occupied ports with
ss -tulpn, stop the conflicting process, or change the port in.envand re-run the installer.MinIO bucket creation not yet completed: Symptom: Loki, Tempo or Pyroscope crash because they cannot find their bucket. Fix: wait for the minio-init container to finish, then restart the dependent services (
podman restart loki tempo pyroscope).Volume permission errors: Symptom: permission denied on a mounted file inside the container. Fix: verify that the host files are readable by your user (the containers run as your UID in rootless Podman). Check SELinux context with
ls -Z. Temporarily disable SELinux for debugging (sudo setenforce 0) and restore later.Template not regenerated after editing: Symptom: container uses outdated configuration. Fix: make changes in the
./templates/directory, run./install.sh, and restart the stack. -
After fixing the root cause, tear down and bring the stack back up cleanly:
podman compose down
podman compose up -d Re-run the test script to verify that everything is healthy:
./run-tests.shTip: Use podman compose ps to see the current status of all containers at a glance. Combine with watch for real-time observation:
watch -n 2 podman compose psThis section explains how to remove everything.
# stop all containers
podman compose down
# (optional) remove the compose network if it still exists
# check the network name first; typically 'monitoring_monitoring-net'
podman network ls | grep monitoring || true
podman network rm monitoring_monitoring-net 2>/dev/null || true
# show volumes
podman volume ls | grep monitoring_
local monitoring_prometheus-data
local monitoring_loki-wal
local monitoring_tempo-wal
local monitoring_minio-data
local monitoring_grafana-data
local monitoring_keep-db-data
local monitoring_keep-state
# one-shot removal of any remaining project volumes
podman volume rm $(podman volume ls -q | grep '^monitoring_') 2>/dev/null || true
# remove certificates
sudo rm /etc/pki/ca-trust/source/anchors/my-local-ca.*
sudo update-ca-trust extract
# disable podman socket
systemctl --user disable --now podman.socket
# remove rootless ports configuration file
sudo rm /etc/sysctl.d/99-rootless-ports.conf
# reset the runtime sysctl to the default privileged port start (1024)
sudo sysctl -w net.ipv4.ip_unprivileged_port_start=1024
# remove images
for I in $(cat compose.yml | grep image: | awk '{print $2}' | sed -r 's/:.+$//'); do echo $I; for ID in $(podman images | grep $I | awk '{print $3}'); do podman rmi $ID; done; done
# (optional) prune any stopped containers, unused networks, and images
# This impacts your whole Podman host, not just this project.
podman system prune -a -f
# remove monitoring repo
rm -rf path-to-your-repo/monitoringNotes:
- If your browser trusted the local CA, restart the browser to ensure trust store changes take effect.
- The compose network is usually removed by
podman compose down, but the explicit removal ensures a clean state.



