Skip to content

Insights — anomaly detection

/insights shows metrics that are statistically outside their normal distribution. It is not machine learning and we don’t claim it is anywhere — it’s a z-score over a 7-day baseline.

Method

For every (agent, metric) in {CPU%, memory%, network total bps}:

baseline = AVG(metric) ± STDDEV_POP(metric) over [NOW-7d, NOW-1d]
current = AVG(metric) over [NOW-15min, NOW]
z = (current - baseline_mean) / baseline_stddev

A metric is flagged anomalous when |z| > 2.5. That maps to ~0.6% false-positive rate if the metric is normally distributed (in practice the FP rate is higher because metrics are rarely truly Gaussian — but it’s good enough to surface the real outliers).

Only agents with ≥60 baseline samples (~1 day at 15s metrics) qualify. Otherwise a newly registered agent would always look “abnormal”.

What you see in the UI

Four KPI cards at the top:

  • Anomalies (live) — count of current flags
  • Agents covered — agents online in the last hour
  • Metrics (24h) — total datapoints analysed
  • Method — “z-score, threshold |z| > 2.5”

Below the KPIs: table of top-50 anomalies, sorted by |z|. Columns: agent (clickable → agent detail), metric, current value, baseline mean, stddev, z-score, direction (↑ above / ↓ below).

What this does WELL

  • Sudden spikes where the system is 4× its normal usage
  • Sudden dips (“why is this webserver’s CPU suddenly at 1%?”)
  • Pattern breaks: ranges you didn’t have to program explicitly

What this does NOT do

  • Seasonal patterns — if your nightly batch pegs CPU at 80% every 02:00, that stays in the baseline and stops being flagged. Intentional.
  • Cross-metric correlations — a cross-metric ML would do that; this method doesn’t.
  • Prediction — no “this server will fall over in 3 hours”. For predictive growth see Capacity planning.

How to use it operationally

  1. Check /insights once per shift as part of your status check
  2. Click through a row to see the agent and what’s going on
  3. If it’s an expected metric (“nightly batch”) → working as intended, ignore
  4. If it’s an unexpected metric → investigate

Not an alert replacement

Insights is passive observation, not a replacement for alert rules. An alert rule fires push notifications + audit log + webhooks; Insights asks you to look proactively. For pages that should wake you at night: use /alerts → Rules with severity=critical.