Alert builder + Maintenance

/alerts has three views: Alerts (live incidents), Rules (rule builder), Maintenance (planned-downtime windows). The evaluator worker ticks every 60s and checks every enabled rule.

Rules

A rule is {target, metric, operator, threshold, duration, severity}. The worker only fires when the condition holds for ≥duration_secs continuously on a (rule, agent) pair.

Targets

all — every active agent in the tenant
agents — selected agent IDs
tag — agents with this tag (e.g. production)

Metrics

Per agent:

metric	source	unit
`cpu_pct`	latest `cpu_metrics.cpu_percent`	%
`mem_pct`	latest `memory_metrics.mem_used / mem_total × 100`	%
`disk_pct`	MAX over `disk_metrics.used_bytes / total_bytes × 100` per mount, last 5min	%
`load_avg`	latest `cpu_metrics.load1`
`network_rx_mbps` / `network_tx_mbps`	SUM `network_metrics.{rx,tx}_bytes_per_sec` / 125000, last 90s	Mbps
`agent_offline`	EPOCH(NOW - last_seen_at)	sec
`container_status`	COUNT containers status≠‘running’ (or specific container_name)
`process_running`	COUNT `process_samples` rows with name=X, last 2min

Fleet (multi-agent aggregates):

metric	source
`fleet_offline_count`	COUNT agents in target set with `last_seen_at < NOW - 90s`
`fleet_critical_alerts`	COUNT unresolved critical alerts last 15min
`tag_offline_pct`	(offline / total) × 100 within target set

Fleet rules emit one alert per cycle, attributed to the first agent in the target.

Operators

>, >=, <, <=, ==, !=.

Duration

duration_secs=0 fires immediately on breach. >0 requires the condition to hold ≥X seconds continuously. The worker tracks this per (rule_id, agent_id) in alert_rules.pending_since JSONB. On condition clear → key dropped from map.

Quick-start templates

In the UI (+ New rule modal), 8 templates. Click to populate:

Sustained high CPU >85% for 1h
Memory >95% for 5m (critical)
Disk usage >90% (critical)
Load avg >10 for 10m
Agent offline >5m (critical)
Network TX >100 Mbps for 5m
Production container not running (target_type=tag, tag=production)
Fleet incident ≥3 production agents offline (fleet rule)

Maintenance windows

Time-bounded silence of all alert sources (rules, process_dna, honeypot, heartbeat-silence) for a target set. Use case: rolling deploy of your flows → 30min window → no pages.

maintenance_windows
  name            text not null
  starts_at       timestamptz
  ends_at         timestamptz
  target_type     'all' | 'agents' | 'tag'
  target_filter   jsonb   -- {agent_ids: [...]} | {tag: "..."}
  silence_categories text[]   -- empty = silence everything

The important detail: every alert that falls within the window, not just rule alerts, is suppressed. This goes through a central InsertAlert(...) helper in the hub that checks every insertion against isUnderMaintenanceForCategory.

Categories you can whitelist per-window in silence_categories (e.g. still let security alerts through during a routine upgrade):

rule — alert builder triggers
security — process DNA, honeypot
heartbeat — agent offline
compliance — control failures
cve — new CVE matches

No silence_categories (empty array) = silence everything.

Notify channels

notify_channels on a rule is an array of webhook IDs or strings. On rule fire:

Insert into alerts with category='rule'
NotifyWorker picks it up within 5s, pushes to ntfy for severity=critical
Webhook deliveries to every subscription subscribed to alert.<severity> or alert.*

Webhook payloads are HMAC-SHA256 signed with the subscription secret in header X-Monsys-Signature. Failed deliveries retry exponentially up to 1h (webhook_outbox).

Audit

Every POST/PUT/DELETE /api/v1/alert-rules* and */maintenance-windows* writes to audit_log with method, path, status, resource_id, payload (secrets scrubbed via regex).