Agent Integrity Monitoring
The hub runs a IntegrityCheckWorker every 10 minutes over all active agents. It generates no extra network traffic to the agents — it’s a pure server-side analysis of what already came in.
What is being checked
| Kind | Detection | Severity |
|---|---|---|
unsigned_payload | Ingest without X-Monsys-Signature header (or agent without pinned pubkey) | medium / high |
signature_invalid | Sig fails verification against pinned pubkey | critical |
clock_drift | Average abs(captured_at − server_now) > 5 min over last 20 metrics | high |
cadence_anomaly | Median between-ingest-gap outside [7s, 23s] (expected 15s) | medium |
flat_metrics | CPU stddev < 0.001 over 30+ samples in 1u | high |
version_downgrade | Agent_version semver-lower than previously seen | high |
Open / resolved model
integrity_anomalies is append-only for the hub-app user (only INSERT and UPDATE; DELETE/TRUNCATE is revoked privileges). Each finding exists in one of two states:
- open —
resolved_at IS NULL - resolved — admin has “marked as resolved” clicked or the check passes again in the next cycle
Resolved items are kept as an audit trail. An auditor can see exactly when which deviation was detected and who/when it was closed.
Dashboard
/integrity shows 6 KPI cards (one per kind) with number of open items. The table below gives severity, agent (clickable), summary, raw-detail JSON, detection-time, and a resolve-button.
Examples
Example 1: signature_invalid when stolen token
Attacker stole ms_... token, tries to push metrics from own machine without signing key:
POST /api/v1/ingest Authorization: Bearer ms_<stolen> Content-Type: application/json (no X-Monsys-Signature)
→ 403 Forbidden→ integrity_anomalies row: kind: unsigned_payload severity: high summary: agent has a pinned signing_pubkey but sent no X-Monsys-Signature detected_at: 2026-05-09 22:03:40It appears on /integrity, colors red in the KPI strip, and the admin sees within 30 sec via SWR refresh an open critical alert.
Example 2: flat_metrics — agent sends fake telemetry
Attacker with root on host replaces real metrics-collector with a script that constantly sends CPU=4.20%:
SELECT STDDEV_SAMP(cpu_usage_percent), COUNT(*)FROM metricsWHERE agent_id = $1 AND time > NOW() - INTERVAL '1 hour';-- stddev: 0.000 (over 240 samples) → flat_metrics anomalyThe worker upserts after 10 min:
kind: flat_metricsseverity: highsummary: CPU stddev = 0 over the last hour — telemetry seems manipulateddetail: {"samples": 240, "stddev": 0.0}SQL for auditor
-- All open anomalies per agent, sorted by severitySELECT a.name, ia.kind, ia.severity, ia.summary, ia.detected_at FROM integrity_anomalies ia JOIN agents a ON a.id = ia.agent_id WHERE ia.resolved_at IS NULL ORDER BY CASE ia.severity WHEN 'critical' THEN 1 WHEN 'high' THEN 2 WHEN 'medium' THEN 3 WHEN 'low' THEN 4 ELSE 5 END, ia.detected_at DESC;
-- How many resolved items per kind in the last 30 days?SELECT kind, count(*) FROM integrity_anomalies WHERE resolved_at > NOW() - INTERVAL '30 days' GROUP BY kind;
-- Who closed which item?SELECT u.email, ia.kind, ia.summary, ia.resolved_at FROM integrity_anomalies ia JOIN users u ON u.id = ia.resolved_by WHERE ia.resolved_at > NOW() - INTERVAL '30 days' ORDER BY ia.resolved_at DESC;Limitations
flat_metricshas a minimum of 30 samples required — agents that just started fall below the detection threshold at 7-8 min.clock_driftuses the hub’s timestamp as reference. An attacker who also compromises the hub-host (clock manipulation) evades this check.version_downgradeonly recognizes semver-style0.x.y. Build-suffixes like0.1.0-beta1are treated as 0.1.0 for detection purposes.