Vardo

Monitoring & Observability

Container metrics, log aggregation, real-time streaming, notifications, and system health.

Vardo ships a full observability stack as part of the production Docker Compose profile. Container metrics come from cAdvisor, get stored in Redis TimeSeries, and stream to the UI over Server-Sent Events. Logs flow through Promtail into Loki. No external services required.

Architecture

Three services run automatically in the production Compose profile:

ServiceImageDefault portRole
vardo-cadvisorgcr.io/cadvisor/cadvisor:latest7300Container resource metrics
vardo-lokigrafana/loki:3.47400Log aggregation and storage
vardo-promtailgrafana/promtail:3.4internalLog collection from Docker

These communicate over the internal Docker network and aren't exposed publicly by default.

Container metrics

For each running container labeled host.managed=true, Vardo collects:

MetricDescription
CPU %Container CPU utilization
Memory usageBytes used by the container
Memory limitContainer memory cap
Network RxBytes received
Network TxBytes transmitted
Disk writesBytes written to disk

Disk usage (total and per-project) is collected separately via df and Docker system df.

Collection schedule

The metrics collector uses a two-phase schedule:

  • Warmup (first 20 ticks) — every 5 seconds. Populates time series quickly so charts aren't empty on first load.
  • Normal (after warmup) — every 30 seconds. Steady-state collection.

Storage

Metrics are stored in Redis TimeSeries (TS.ADD, TS.RANGE). Each metric key gets:

  • 7-day retention (168h)
  • Duplicate policyLAST (latest value wins on timestamp collision)
  • Labelsproject, container, metric, organization for cross-series queries

Keys are created lazily on first write. An in-process Set avoids redundant TS.CREATE calls.

Disk write alerts

Every 6th tick after warmup (~3 minutes at normal interval), the collector compares recent disk write rates against thresholds and emits disk-write-alert notifications when exceeded. You can set the threshold per-app in the app settings dialog.

Real-time streaming (SSE)

Vardo pushes live metrics to the browser over Server-Sent Events — no polling needed.

Endpoints

EndpointScope
/api/v1/organizations/[orgId]/stats/streamAll containers in an org
/api/v1/organizations/[orgId]/projects/[projectId]/stats/streamAll containers in a project
/api/v1/organizations/[orgId]/apps/[appId]/stats/streamSingle app's containers

Broadcast pattern

A shared broadcast loop prevents redundant cAdvisor polls when multiple browser tabs are open:

  1. First SSE subscriber starts the polling loop (5-second interval).
  2. Each new subscriber gets the latest cached snapshot on connect, then live updates.
  3. When the last subscriber disconnects, polling stops.

Snapshot cache

getLatestSnapshot() returns the most recent metrics without waiting for the next poll. The health API uses this to serve fast responses (~20ms) without hitting cAdvisor on every request.

Historical metrics

Historical data is queryable at three scopes:

EndpointScope
/api/v1/organizations/[orgId]/stats/historyOrg-level
/api/v1/organizations/[orgId]/projects/[projectId]/stats/historyProject-level
/api/v1/organizations/[orgId]/apps/[appId]/stats/historyApp-level

Results are bucketed and aggregated from Redis TimeSeries with configurable bucket sizes (default 5 minutes).

Vardo also tracks business metrics per org — deploy counts, success rates, backup totals — in separate TimeSeries keys. These show up on the admin metrics tab.

Log aggregation

How logs flow

  1. Promtail mounts /var/run/docker.sock and /var/lib/docker/containers read-only.
  2. It discovers containers via Docker socket service discovery, refreshing every 5 seconds.
  3. Only containers with the label host.managed=true are scraped.
  4. Logs ship to Loki at http://loki:3100/loki/api/v1/push.

Labels attached to logs

Promtail extracts these Docker labels and attaches them as Loki stream labels:

Loki labelDocker labelDescription
projecthost.projectVardo project name
project_idhost.project_idVardo project UUID
environmenthost.environmentEnvironment (production, staging)
servicecom.docker.compose.serviceCompose service name
containercontainer nameDocker container name

These labels make it possible to query logs by project, environment or container.

Loki configuration

Key settings from config/loki.yml:

SettingValueNotes
Retention168h (7 days)Matches metrics retention
Max ingestion rate10 MB/s (burst 20 MB/s)Per instance
Max query series500Prevents runaway queries
StorageFilesystem (TSDB schema v13)Local volume loki_data
Memory limit512 MBDocker container cap

Log streaming

/api/v1/organizations/[orgId]/apps/[appId]/logs/stream streams container logs to the browser over SSE by querying Loki in real time. The LOKI_URL environment variable configures the connection (http://loki:3100 in production).

When Loki isn't available, Vardo falls back to reading logs directly from Docker.

System health

GET /api/health returns a fast health snapshot:

  • CPU usage (aggregated across all containers)
  • Memory usage (vs system total)
  • Disk usage (via df -B1 /var/lib/docker)
  • Per-resource status (ok, warning, critical)

Thresholds:

ResourceWarningCritical
CPU80%95%
Memory85%95%
Disk80%90%

The health endpoint reads from the in-memory metrics snapshot, so it responds in ~20ms.

System alerts

Vardo runs a background monitor that checks system health every 60 seconds. It watches for:

Alert typeTrigger
system-alert-serviceA monitored service went down
system-alert-diskSystem disk usage is critical
system-alert-restartA container restarted unexpectedly
system-alert-certTLS certificate expiry warning
system-alert-updateVardo update available

Alerts are dispatched through the notification system to any configured channels.

Notifications

Vardo's notification system delivers event alerts to configured channels. Channels are set up per organization under Settings > Notifications.

Event types

EventTrigger
deploy-successApp deployment completed
deploy-failedApp deployment failed
backup-successBackup job completed
backup-failedOne or more backup volumes failed
cron-failedScheduled cron job failed
volume-driftVolume content drifted from baseline
disk-write-alertContainer disk write rate exceeded threshold
auto-rollbackAutomatic rollback triggered after failed deploy
invitation-sentUser invited to organization
invitation-acceptedInvitation accepted
system-alert-serviceMonitored service down
system-alert-diskSystem disk usage critical
system-alert-restartContainer restarted unexpectedly
system-alert-certTLS certificate expiry warning
system-alert-updateVardo update available
weekly-digestWeekly summary of deploys, backups and cron failures

Channels

ChannelDescription
emailEmail via the configured provider (SMTP, Mailpace, Resend, Postmark)
webhookHTTP POST to a URL with optional HMAC secret signing
slackSlack incoming webhook

Each channel can subscribe to all events or a specific subset.

How dispatch works

When an event fires:

  1. Active channels for the org are loaded from the database.
  2. Each channel's subscribed events filter is checked.
  3. Matching channels receive the event via their transport.
  4. Failures are enqueued for retry with backoff.
  5. All deliveries (success and failure) are logged to the notification_logs table.

Weekly digest

A scheduled job assembles a weekly summary — deploy counts (succeeded/failed), backup counts, cron failures and disk write alerts. The digest goes to all channels subscribed to weekly-digest.

Grafana integration

Grafana isn't included in the Docker Compose stack, but you can connect it to the bundled Loki instance for advanced log analysis. Add a Loki data source pointing to http://<your-server>:7400 (or http://loki:3100 if Grafana is on the same Docker network).

Don't expose Loki on a public port without authentication. Use Grafana's built-in auth or put Traefik in front of the Loki port.

Setup

If you installed Vardo using install.sh, monitoring is already running. Verify with:

docker ps | grep -E '(cadvisor|loki|promtail)'

If you installed manually without the production profile:

cd /opt/vardo
COMPOSE_PROFILES=production docker compose up -d cadvisor loki promtail

Confirm cAdvisor is collecting data:

curl http://localhost:7300/api/v1.3/containers/ | jq '.subcontainers | length'

Confirm Loki is receiving logs:

curl 'http://localhost:7400/loki/api/v1/labels' | jq .

Troubleshooting

On this page