Skip to content

Agent Integrity Monitoring

The hub runs a IntegrityCheckWorker every 10 minutes over all active agents. It generates no extra network traffic to the agents — it’s a pure server-side analysis of what already came in.

What is being checked

KindDetectionSeverity
unsigned_payloadIngest without X-Monsys-Signature header (or agent without pinned pubkey)medium / high
signature_invalidSig fails verification against pinned pubkeycritical
clock_driftAverage abs(captured_at − server_now) > 5 min over last 20 metricshigh
cadence_anomalyMedian between-ingest-gap outside [7s, 23s] (expected 15s)medium
flat_metricsCPU stddev < 0.001 over 30+ samples in 1uhigh
version_downgradeAgent_version semver-lower than previously seenhigh

Open / resolved model

integrity_anomalies is append-only for the hub-app user (only INSERT and UPDATE; DELETE/TRUNCATE is revoked privileges). Each finding exists in one of two states:

  • openresolved_at IS NULL
  • resolved — admin has “marked as resolved” clicked or the check passes again in the next cycle

Resolved items are kept as an audit trail. An auditor can see exactly when which deviation was detected and who/when it was closed.

Dashboard

/integrity shows 6 KPI cards (one per kind) with number of open items. The table below gives severity, agent (clickable), summary, raw-detail JSON, detection-time, and a resolve-button.

Examples

Example 1: signature_invalid when stolen token

Attacker stole ms_... token, tries to push metrics from own machine without signing key:

POST /api/v1/ingest
Authorization: Bearer ms_<stolen>
Content-Type: application/json
(no X-Monsys-Signature)
→ 403 Forbidden
→ integrity_anomalies row:
kind: unsigned_payload
severity: high
summary: agent has a pinned signing_pubkey but sent no X-Monsys-Signature
detected_at: 2026-05-09 22:03:40

It appears on /integrity, colors red in the KPI strip, and the admin sees within 30 sec via SWR refresh an open critical alert.

Example 2: flat_metrics — agent sends fake telemetry

Attacker with root on host replaces real metrics-collector with a script that constantly sends CPU=4.20%:

SELECT STDDEV_SAMP(cpu_usage_percent), COUNT(*)
FROM metrics
WHERE agent_id = $1 AND time > NOW() - INTERVAL '1 hour';
-- stddev: 0.000 (over 240 samples) → flat_metrics anomaly

The worker upserts after 10 min:

kind: flat_metrics
severity: high
summary: CPU stddev = 0 over the last hour — telemetry seems manipulated
detail: {"samples": 240, "stddev": 0.0}

SQL for auditor

-- All open anomalies per agent, sorted by severity
SELECT a.name, ia.kind, ia.severity, ia.summary, ia.detected_at
FROM integrity_anomalies ia
JOIN agents a ON a.id = ia.agent_id
WHERE ia.resolved_at IS NULL
ORDER BY CASE ia.severity
WHEN 'critical' THEN 1 WHEN 'high' THEN 2
WHEN 'medium' THEN 3 WHEN 'low' THEN 4 ELSE 5 END,
ia.detected_at DESC;
-- How many resolved items per kind in the last 30 days?
SELECT kind, count(*)
FROM integrity_anomalies
WHERE resolved_at > NOW() - INTERVAL '30 days'
GROUP BY kind;
-- Who closed which item?
SELECT u.email, ia.kind, ia.summary, ia.resolved_at
FROM integrity_anomalies ia
JOIN users u ON u.id = ia.resolved_by
WHERE ia.resolved_at > NOW() - INTERVAL '30 days'
ORDER BY ia.resolved_at DESC;

Limitations

  • flat_metrics has a minimum of 30 samples required — agents that just started fall below the detection threshold at 7-8 min.
  • clock_drift uses the hub’s timestamp as reference. An attacker who also compromises the hub-host (clock manipulation) evades this check.
  • version_downgrade only recognizes semver-style 0.x.y. Build-suffixes like 0.1.0-beta1 are treated as 0.1.0 for detection purposes.