Insights — anomaly detection
/insights shows metrics that are statistically outside their normal distribution. It is not machine learning and we don’t claim it is anywhere — it’s a z-score over a 7-day baseline.
Method
For every (agent, metric) in {CPU%, memory%, network total bps}:
baseline = AVG(metric) ± STDDEV_POP(metric) over [NOW-7d, NOW-1d]current = AVG(metric) over [NOW-15min, NOW]z = (current - baseline_mean) / baseline_stddevA metric is flagged anomalous when |z| > 2.5. That maps to ~0.6% false-positive rate if the metric is normally distributed (in practice the FP rate is higher because metrics are rarely truly Gaussian — but it’s good enough to surface the real outliers).
Only agents with ≥60 baseline samples (~1 day at 15s metrics) qualify. Otherwise a newly registered agent would always look “abnormal”.
What you see in the UI
Four KPI cards at the top:
- Anomalies (live) — count of current flags
- Agents covered — agents online in the last hour
- Metrics (24h) — total datapoints analysed
- Method — “z-score, threshold |z| > 2.5”
Below the KPIs: table of top-50 anomalies, sorted by |z|. Columns: agent (clickable → agent detail), metric, current value, baseline mean, stddev, z-score, direction (↑ above / ↓ below).
What this does WELL
- Sudden spikes where the system is 4× its normal usage
- Sudden dips (“why is this webserver’s CPU suddenly at 1%?”)
- Pattern breaks: ranges you didn’t have to program explicitly
What this does NOT do
- Seasonal patterns — if your nightly batch pegs CPU at 80% every 02:00, that stays in the baseline and stops being flagged. Intentional.
- Cross-metric correlations — a cross-metric ML would do that; this method doesn’t.
- Prediction — no “this server will fall over in 3 hours”. For predictive growth see Capacity planning.
How to use it operationally
- Check
/insightsonce per shift as part of your status check - Click through a row to see the agent and what’s going on
- If it’s an expected metric (“nightly batch”) → working as intended, ignore
- If it’s an unexpected metric → investigate
Not an alert replacement
Insights is passive observation, not a replacement for alert rules. An alert rule fires push notifications + audit log + webhooks; Insights asks you to look proactively. For pages that should wake you at night: use /alerts → Rules with severity=critical.