Agent Integrity Monitoring

The hub runs a IntegrityCheckWorker every 10 minutes over all active agents. It generates no extra network traffic to the agents — it’s a pure server-side analysis of what already came in.

What is being checked

Kind	Detection	Severity
`unsigned_payload`	Ingest without `X-Monsys-Signature` header (or agent without pinned pubkey)	medium / high
`signature_invalid`	Sig fails verification against pinned pubkey	critical
`clock_drift`	Average abs(captured_at − server_now) > 5 min over last 20 metrics	high
`cadence_anomaly`	Median between-ingest-gap outside [7s, 23s] (expected 15s)	medium
`flat_metrics`	CPU stddev < 0.001 over 30+ samples in 1u	high
`version_downgrade`	Agent_version semver-lower than previously seen	high

Open / resolved model

integrity_anomalies is append-only for the hub-app user (only INSERT and UPDATE; DELETE/TRUNCATE is revoked privileges). Each finding exists in one of two states:

open — resolved_at IS NULL
resolved — admin has “marked as resolved” clicked or the check passes again in the next cycle

Resolved items are kept as an audit trail. An auditor can see exactly when which deviation was detected and who/when it was closed.

Dashboard

/integrity shows 6 KPI cards (one per kind) with number of open items. The table below gives severity, agent (clickable), summary, raw-detail JSON, detection-time, and a resolve-button.

Examples

Example 1: signature_invalid when stolen token

Attacker stole ms_... token, tries to push metrics from own machine without signing key:

POST /api/v1/ingest
  Authorization: Bearer ms_<stolen>
  Content-Type: application/json
  (no X-Monsys-Signature)

→ 403 Forbidden
→ integrity_anomalies row:
    kind:     unsigned_payload
    severity: high
    summary:  agent has a pinned signing_pubkey but sent no X-Monsys-Signature
    detected_at: 2026-05-09 22:03:40

It appears on /integrity, colors red in the KPI strip, and the admin sees within 30 sec via SWR refresh an open critical alert.

Example 2: flat_metrics — agent sends fake telemetry

Attacker with root on host replaces real metrics-collector with a script that constantly sends CPU=4.20%:

SELECT STDDEV_SAMP(cpu_usage_percent), COUNT(*)
FROM metrics
WHERE agent_id = $1 AND time > NOW() - INTERVAL '1 hour';
-- stddev: 0.000  (over 240 samples) → flat_metrics anomaly

The worker upserts after 10 min:

kind:     flat_metrics
severity: high
summary:  CPU stddev = 0 over the last hour — telemetry seems manipulated
detail:   {"samples": 240, "stddev": 0.0}

SQL for auditor

-- All open anomalies per agent, sorted by severity
SELECT a.name, ia.kind, ia.severity, ia.summary, ia.detected_at
  FROM integrity_anomalies ia
  JOIN agents a ON a.id = ia.agent_id
 WHERE ia.resolved_at IS NULL
 ORDER BY CASE ia.severity
            WHEN 'critical' THEN 1 WHEN 'high' THEN 2
            WHEN 'medium'   THEN 3 WHEN 'low'  THEN 4 ELSE 5 END,
          ia.detected_at DESC;

-- How many resolved items per kind in the last 30 days?
SELECT kind, count(*)
  FROM integrity_anomalies
 WHERE resolved_at > NOW() - INTERVAL '30 days'
 GROUP BY kind;

-- Who closed which item?
SELECT u.email, ia.kind, ia.summary, ia.resolved_at
  FROM integrity_anomalies ia
  JOIN users u ON u.id = ia.resolved_by
 WHERE ia.resolved_at > NOW() - INTERVAL '30 days'
 ORDER BY ia.resolved_at DESC;

Limitations

flat_metrics has a minimum of 30 samples required — agents that just started fall below the detection threshold at 7-8 min.
clock_drift uses the hub’s timestamp as reference. An attacker who also compromises the hub-host (clock manipulation) evades this check.
version_downgrade only recognizes semver-style 0.x.y. Build-suffixes like 0.1.0-beta1 are treated as 0.1.0 for detection purposes.