Monitoring & Observability
Container metrics, log aggregation, real-time streaming, notifications, and system health.
Vardo ships a full observability stack as part of the production Docker Compose profile. Container metrics come from cAdvisor, get stored in Redis TimeSeries, and stream to the UI over Server-Sent Events. Logs flow through Promtail into Loki. No external services required.
Architecture
Three services run automatically in the production Compose profile:
| Service | Image | Default port | Role |
|---|---|---|---|
vardo-cadvisor | gcr.io/cadvisor/cadvisor:latest | 7300 | Container resource metrics |
vardo-loki | grafana/loki:3.4 | 7400 | Log aggregation and storage |
vardo-promtail | grafana/promtail:3.4 | internal | Log collection from Docker |
These communicate over the internal Docker network and aren't exposed publicly by default.
Container metrics
For each running container labeled host.managed=true, Vardo collects:
| Metric | Description |
|---|---|
| CPU % | Container CPU utilization |
| Memory usage | Bytes used by the container |
| Memory limit | Container memory cap |
| Network Rx | Bytes received |
| Network Tx | Bytes transmitted |
| Disk writes | Bytes written to disk |
Disk usage (total and per-project) is collected separately via df and Docker system df.
Collection schedule
The metrics collector uses a two-phase schedule:
- Warmup (first 20 ticks) — every 5 seconds. Populates time series quickly so charts aren't empty on first load.
- Normal (after warmup) — every 30 seconds. Steady-state collection.
Storage
Metrics are stored in Redis TimeSeries (TS.ADD, TS.RANGE). Each metric key gets:
- 7-day retention (
168h) - Duplicate policy —
LAST(latest value wins on timestamp collision) - Labels —
project,container,metric,organizationfor cross-series queries
Keys are created lazily on first write. An in-process Set avoids redundant TS.CREATE calls.
Disk write alerts
Every 6th tick after warmup (~3 minutes at normal interval), the collector compares recent disk write rates against thresholds and emits disk-write-alert notifications when exceeded. You can set the threshold per-app in the app settings dialog.
Real-time streaming (SSE)
Vardo pushes live metrics to the browser over Server-Sent Events — no polling needed.
Endpoints
| Endpoint | Scope |
|---|---|
/api/v1/organizations/[orgId]/stats/stream | All containers in an org |
/api/v1/organizations/[orgId]/projects/[projectId]/stats/stream | All containers in a project |
/api/v1/organizations/[orgId]/apps/[appId]/stats/stream | Single app's containers |
Broadcast pattern
A shared broadcast loop prevents redundant cAdvisor polls when multiple browser tabs are open:
- First SSE subscriber starts the polling loop (5-second interval).
- Each new subscriber gets the latest cached snapshot on connect, then live updates.
- When the last subscriber disconnects, polling stops.
Snapshot cache
getLatestSnapshot() returns the most recent metrics without waiting for the next poll. The health API uses this to serve fast responses (~20ms) without hitting cAdvisor on every request.
Historical metrics
Historical data is queryable at three scopes:
| Endpoint | Scope |
|---|---|
/api/v1/organizations/[orgId]/stats/history | Org-level |
/api/v1/organizations/[orgId]/projects/[projectId]/stats/history | Project-level |
/api/v1/organizations/[orgId]/apps/[appId]/stats/history | App-level |
Results are bucketed and aggregated from Redis TimeSeries with configurable bucket sizes (default 5 minutes).
Vardo also tracks business metrics per org — deploy counts, success rates, backup totals — in separate TimeSeries keys. These show up on the admin metrics tab.
Log aggregation
How logs flow
- Promtail mounts
/var/run/docker.sockand/var/lib/docker/containersread-only. - It discovers containers via Docker socket service discovery, refreshing every 5 seconds.
- Only containers with the label
host.managed=trueare scraped. - Logs ship to Loki at
http://loki:3100/loki/api/v1/push.
Labels attached to logs
Promtail extracts these Docker labels and attaches them as Loki stream labels:
| Loki label | Docker label | Description |
|---|---|---|
project | host.project | Vardo project name |
project_id | host.project_id | Vardo project UUID |
environment | host.environment | Environment (production, staging) |
service | com.docker.compose.service | Compose service name |
container | container name | Docker container name |
These labels make it possible to query logs by project, environment or container.
Loki configuration
Key settings from config/loki.yml:
| Setting | Value | Notes |
|---|---|---|
| Retention | 168h (7 days) | Matches metrics retention |
| Max ingestion rate | 10 MB/s (burst 20 MB/s) | Per instance |
| Max query series | 500 | Prevents runaway queries |
| Storage | Filesystem (TSDB schema v13) | Local volume loki_data |
| Memory limit | 512 MB | Docker container cap |
Log streaming
/api/v1/organizations/[orgId]/apps/[appId]/logs/stream streams container logs to the browser over SSE by querying Loki in real time. The LOKI_URL environment variable configures the connection (http://loki:3100 in production).
When Loki isn't available, Vardo falls back to reading logs directly from Docker.
System health
GET /api/health returns a fast health snapshot:
- CPU usage (aggregated across all containers)
- Memory usage (vs system total)
- Disk usage (via
df -B1 /var/lib/docker) - Per-resource status (
ok,warning,critical)
Thresholds:
| Resource | Warning | Critical |
|---|---|---|
| CPU | 80% | 95% |
| Memory | 85% | 95% |
| Disk | 80% | 90% |
The health endpoint reads from the in-memory metrics snapshot, so it responds in ~20ms.
System alerts
Vardo runs a background monitor that checks system health every 60 seconds. It watches for:
| Alert type | Trigger |
|---|---|
system-alert-service | A monitored service went down |
system-alert-disk | System disk usage is critical |
system-alert-restart | A container restarted unexpectedly |
system-alert-cert | TLS certificate expiry warning |
system-alert-update | Vardo update available |
Alerts are dispatched through the notification system to any configured channels.
Notifications
Vardo's notification system delivers event alerts to configured channels. Channels are set up per organization under Settings > Notifications.
Event types
| Event | Trigger |
|---|---|
deploy-success | App deployment completed |
deploy-failed | App deployment failed |
backup-success | Backup job completed |
backup-failed | One or more backup volumes failed |
cron-failed | Scheduled cron job failed |
volume-drift | Volume content drifted from baseline |
disk-write-alert | Container disk write rate exceeded threshold |
auto-rollback | Automatic rollback triggered after failed deploy |
invitation-sent | User invited to organization |
invitation-accepted | Invitation accepted |
system-alert-service | Monitored service down |
system-alert-disk | System disk usage critical |
system-alert-restart | Container restarted unexpectedly |
system-alert-cert | TLS certificate expiry warning |
system-alert-update | Vardo update available |
weekly-digest | Weekly summary of deploys, backups and cron failures |
Channels
| Channel | Description |
|---|---|
email | Email via the configured provider (SMTP, Mailpace, Resend, Postmark) |
webhook | HTTP POST to a URL with optional HMAC secret signing |
slack | Slack incoming webhook |
Each channel can subscribe to all events or a specific subset.
How dispatch works
When an event fires:
- Active channels for the org are loaded from the database.
- Each channel's subscribed events filter is checked.
- Matching channels receive the event via their transport.
- Failures are enqueued for retry with backoff.
- All deliveries (success and failure) are logged to the
notification_logstable.
Weekly digest
A scheduled job assembles a weekly summary — deploy counts (succeeded/failed), backup counts, cron failures and disk write alerts. The digest goes to all channels subscribed to weekly-digest.
Grafana integration
Grafana isn't included in the Docker Compose stack, but you can connect it to the bundled Loki instance for advanced log analysis. Add a Loki data source pointing to http://<your-server>:7400 (or http://loki:3100 if Grafana is on the same Docker network).
Don't expose Loki on a public port without authentication. Use Grafana's built-in auth or put Traefik in front of the Loki port.
Setup
If you installed Vardo using install.sh, monitoring is already running. Verify with:
docker ps | grep -E '(cadvisor|loki|promtail)'If you installed manually without the production profile:
cd /opt/vardo
COMPOSE_PROFILES=production docker compose up -d cadvisor loki promtailConfirm cAdvisor is collecting data:
curl http://localhost:7300/api/v1.3/containers/ | jq '.subcontainers | length'Confirm Loki is receiving logs:
curl 'http://localhost:7400/loki/api/v1/labels' | jq .