Alert builder + Maintenance
/alerts has three views: Alerts (live incidents), Rules (rule builder), Maintenance (planned-downtime windows). The evaluator worker ticks every 60s and checks every enabled rule.
Rules
A rule is {target, metric, operator, threshold, duration, severity}. The worker only fires when the condition holds for ≥duration_secs continuously on a (rule, agent) pair.
Targets
all— every active agent in the tenantagents— selected agent IDstag— agents with this tag (e.g.production)
Metrics
Per agent:
| metric | source | unit |
|---|---|---|
cpu_pct | latest cpu_metrics.cpu_percent | % |
mem_pct | latest memory_metrics.mem_used / mem_total × 100 | % |
disk_pct | MAX over disk_metrics.used_bytes / total_bytes × 100 per mount, last 5min | % |
load_avg | latest cpu_metrics.load1 | |
network_rx_mbps / network_tx_mbps | SUM network_metrics.{rx,tx}_bytes_per_sec / 125000, last 90s | Mbps |
agent_offline | EPOCH(NOW - last_seen_at) | sec |
container_status | COUNT containers status≠‘running’ (or specific container_name) | |
process_running | COUNT process_samples rows with name=X, last 2min |
Fleet (multi-agent aggregates):
| metric | source |
|---|---|
fleet_offline_count | COUNT agents in target set with last_seen_at < NOW - 90s |
fleet_critical_alerts | COUNT unresolved critical alerts last 15min |
tag_offline_pct | (offline / total) × 100 within target set |
Fleet rules emit one alert per cycle, attributed to the first agent in the target.
Operators
>, >=, <, <=, ==, !=.
Duration
duration_secs=0 fires immediately on breach. >0 requires the condition to hold ≥X seconds continuously. The worker tracks this per (rule_id, agent_id) in alert_rules.pending_since JSONB. On condition clear → key dropped from map.
Quick-start templates
In the UI (+ New rule modal), 8 templates. Click to populate:
- Sustained high CPU >85% for 1h
- Memory >95% for 5m (critical)
- Disk usage >90% (critical)
- Load avg >10 for 10m
- Agent offline >5m (critical)
- Network TX >100 Mbps for 5m
- Production container not running (target_type=tag, tag=production)
- Fleet incident ≥3 production agents offline (fleet rule)
Maintenance windows
Time-bounded silence of all alert sources (rules, process_dna, honeypot, heartbeat-silence) for a target set. Use case: rolling deploy of your flows → 30min window → no pages.
maintenance_windows name text not null starts_at timestamptz ends_at timestamptz target_type 'all' | 'agents' | 'tag' target_filter jsonb -- {agent_ids: [...]} | {tag: "..."} silence_categories text[] -- empty = silence everythingThe important detail: every alert that falls within the window, not just rule alerts, is suppressed. This goes through a central InsertAlert(...) helper in the hub that checks every insertion against isUnderMaintenanceForCategory.
Categories you can whitelist per-window in silence_categories (e.g. still let security alerts through during a routine upgrade):
rule— alert builder triggerssecurity— process DNA, honeypotheartbeat— agent offlinecompliance— control failurescve— new CVE matches
No silence_categories (empty array) = silence everything.
Notify channels
notify_channels on a rule is an array of webhook IDs or strings. On rule fire:
- Insert into
alertswithcategory='rule' NotifyWorkerpicks it up within 5s, pushes tontfyfor severity=critical- Webhook deliveries to every subscription subscribed to
alert.<severity>oralert.*
Webhook payloads are HMAC-SHA256 signed with the subscription secret in header X-Monsys-Signature. Failed deliveries retry exponentially up to 1h (webhook_outbox).
Audit
Every POST/PUT/DELETE /api/v1/alert-rules* and */maintenance-windows* writes to audit_log with method, path, status, resource_id, payload (secrets scrubbed via regex).