Monitoring & Observability

Container metrics, log aggregation, real-time streaming, notifications, and system health.

Vardo ships a full observability stack as part of the production Docker Compose profile. Container metrics come from cAdvisor, get stored in Redis TimeSeries, and stream to the UI over Server-Sent Events. Logs flow through Promtail into Loki. No external services required.

Architecture

Three services run automatically in the production Compose profile:

Service	Image	Default port	Role
`vardo-cadvisor`	`gcr.io/cadvisor/cadvisor:latest`	7300	Container resource metrics
`vardo-loki`	`grafana/loki:3.4`	7400	Log aggregation and storage
`vardo-promtail`	`grafana/promtail:3.4`	internal	Log collection from Docker

These communicate over the internal Docker network and aren't exposed publicly by default.

Container metrics

For each running container labeled host.managed=true, Vardo collects:

Metric	Description
CPU %	Container CPU utilization
Memory usage	Bytes used by the container
Memory limit	Container memory cap
Network Rx	Bytes received
Network Tx	Bytes transmitted
Disk writes	Bytes written to disk

Disk usage (total and per-project) is collected separately via df and Docker system df.

Collection schedule

The metrics collector uses a two-phase schedule:

Warmup (first 20 ticks) — every 5 seconds. Populates time series quickly so charts aren't empty on first load.
Normal (after warmup) — every 30 seconds. Steady-state collection.

Storage

Metrics are stored in Redis TimeSeries (TS.ADD, TS.RANGE). Each metric key gets:

7-day retention (168h)
Duplicate policy — LAST (latest value wins on timestamp collision)
Labels — project, container, metric, organization for cross-series queries

Keys are created lazily on first write. An in-process Set avoids redundant TS.CREATE calls.

Disk write alerts

Every 6th tick after warmup (~3 minutes at normal interval), the collector compares recent disk write rates against thresholds and emits disk-write-alert notifications when exceeded. You can set the threshold per-app in the app settings dialog.

Real-time streaming (SSE)

Vardo pushes live metrics to the browser over Server-Sent Events — no polling needed.

Endpoints

Endpoint	Scope
`/api/v1/organizations/[orgId]/stats/stream`	All containers in an org
`/api/v1/organizations/[orgId]/projects/[projectId]/stats/stream`	All containers in a project
`/api/v1/organizations/[orgId]/apps/[appId]/stats/stream`	Single app's containers

Broadcast pattern

A shared broadcast loop prevents redundant cAdvisor polls when multiple browser tabs are open:

First SSE subscriber starts the polling loop (5-second interval).
Each new subscriber gets the latest cached snapshot on connect, then live updates.
When the last subscriber disconnects, polling stops.

Snapshot cache

getLatestSnapshot() returns the most recent metrics without waiting for the next poll. The health API uses this to serve fast responses (~20ms) without hitting cAdvisor on every request.

Historical metrics

Historical data is queryable at three scopes:

Endpoint	Scope
`/api/v1/organizations/[orgId]/stats/history`	Org-level
`/api/v1/organizations/[orgId]/projects/[projectId]/stats/history`	Project-level
`/api/v1/organizations/[orgId]/apps/[appId]/stats/history`	App-level

Results are bucketed and aggregated from Redis TimeSeries with configurable bucket sizes (default 5 minutes).

Vardo also tracks business metrics per org — deploy counts, success rates, backup totals — in separate TimeSeries keys. These show up on the admin metrics tab.

Log aggregation

How logs flow

Promtail mounts /var/run/docker.sock and /var/lib/docker/containers read-only.
It discovers containers via Docker socket service discovery, refreshing every 5 seconds.
Only containers with the label host.managed=true are scraped.
Logs ship to Loki at http://loki:3100/loki/api/v1/push.

Labels attached to logs

Promtail extracts these Docker labels and attaches them as Loki stream labels:

Loki label	Docker label	Description
`project`	`host.project`	Vardo project name
`project_id`	`host.project_id`	Vardo project UUID
`environment`	`host.environment`	Environment (production, staging)
`service`	`com.docker.compose.service`	Compose service name
`container`	container name	Docker container name

These labels make it possible to query logs by project, environment or container.

Loki configuration

Key settings from config/loki.yml:

Setting	Value	Notes
Retention	168h (7 days)	Matches metrics retention
Max ingestion rate	10 MB/s (burst 20 MB/s)	Per instance
Max query series	500	Prevents runaway queries
Storage	Filesystem (TSDB schema v13)	Local volume `loki_data`
Memory limit	512 MB	Docker container cap

Log streaming

/api/v1/organizations/[orgId]/apps/[appId]/logs/stream streams container logs to the browser over SSE by querying Loki in real time. The LOKI_URL environment variable configures the connection (http://loki:3100 in production).

When Loki isn't available, Vardo falls back to reading logs directly from Docker.

System health

GET /api/health returns a fast health snapshot:

CPU usage (aggregated across all containers)
Memory usage (vs system total)
Disk usage (via df -B1 /var/lib/docker)
Per-resource status (ok, warning, critical)

Thresholds:

Resource	Warning	Critical
CPU	80%	95%
Memory	85%	95%
Disk	80%	90%

The health endpoint reads from the in-memory metrics snapshot, so it responds in ~20ms.

System alerts

Vardo runs a background monitor that checks system health every 60 seconds. It watches for:

Alert type	Trigger
`system-alert-service`	A monitored service went down
`system-alert-disk`	System disk usage is critical
`system-alert-restart`	A container restarted unexpectedly
`system-alert-cert`	TLS certificate expiry warning
`system-alert-update`	Vardo update available

Alerts are dispatched through the notification system to any configured channels.

Notifications

Vardo's notification system delivers event alerts to configured channels. Channels are set up per organization under Settings > Notifications.

Event types

Event	Trigger
`deploy-success`	App deployment completed
`deploy-failed`	App deployment failed
`backup-success`	Backup job completed
`backup-failed`	One or more backup volumes failed
`cron-failed`	Scheduled cron job failed
`volume-drift`	Volume content drifted from baseline
`disk-write-alert`	Container disk write rate exceeded threshold
`auto-rollback`	Automatic rollback triggered after failed deploy
`invitation-sent`	User invited to organization
`invitation-accepted`	Invitation accepted
`system-alert-service`	Monitored service down
`system-alert-disk`	System disk usage critical
`system-alert-restart`	Container restarted unexpectedly
`system-alert-cert`	TLS certificate expiry warning
`system-alert-update`	Vardo update available
`weekly-digest`	Weekly summary of deploys, backups and cron failures

Channels

Channel	Description
`email`	Email via the configured provider (SMTP, Mailpace, Resend, Postmark)
`webhook`	HTTP POST to a URL with optional HMAC secret signing
`slack`	Slack incoming webhook

Each channel can subscribe to all events or a specific subset.

How dispatch works

When an event fires:

Active channels for the org are loaded from the database.
Each channel's subscribed events filter is checked.
Matching channels receive the event via their transport.
Failures are enqueued for retry with backoff.
All deliveries (success and failure) are logged to the notification_logs table.

Weekly digest

A scheduled job assembles a weekly summary — deploy counts (succeeded/failed), backup counts, cron failures and disk write alerts. The digest goes to all channels subscribed to weekly-digest.

Grafana integration

Grafana isn't included in the Docker Compose stack, but you can connect it to the bundled Loki instance for advanced log analysis. Add a Loki data source pointing to http://<your-server>:7400 (or http://loki:3100 if Grafana is on the same Docker network).

Don't expose Loki on a public port without authentication. Use Grafana's built-in auth or put Traefik in front of the Loki port.

Setup

If you installed Vardo using install.sh, monitoring is already running. Verify with:

docker ps | grep -E '(cadvisor|loki|promtail)'

If you installed manually without the production profile:

cd /opt/vardo
COMPOSE_PROFILES=production docker compose up -d cadvisor loki promtail

Confirm cAdvisor is collecting data:

curl http://localhost:7300/api/v1.3/containers/ | jq '.subcontainers | length'

Confirm Loki is receiving logs:

curl 'http://localhost:7400/loki/api/v1/labels' | jq .

Monitoring & Observability

Architecture

Container metrics

Collection schedule

Storage

Disk write alerts

Real-time streaming (SSE)

Endpoints

Broadcast pattern

Snapshot cache

Historical metrics

Log aggregation

How logs flow

Labels attached to logs

Loki configuration

Log streaming

System health

System alerts

Notifications

Event types

Channels

How dispatch works

Weekly digest

Grafana integration

Setup

Troubleshooting

On this page

Monitoring & Observability

No metrics appearing in the UI

No logs appearing for a container

Historical charts show gaps

Notifications not delivering

On this page