mTLS — hub ↔ agent client certificates

Every agent gets a hub-signed X.509 client certificate at first boot. From that moment on, every TLS handshake to api.monsys.ai proves two things: that the connection has a valid bearer token (as before), and that the client side holds the corresponding private key in /var/lib/monsys/mtls/client.key. A leaked token alone is no longer enough to impersonate an agent — the attacker also needs the on-disk key, which never leaves the host.

Threat model

Before mTLS (bearer-only):

Token theft from an inventory dump, log line, or memory scrape → an attacker can call every agent endpoint as that agent until the operator rotates the token.

After mTLS:

Token theft still requires /var/lib/monsys/mtls/client.key (root- owned, mode 600) to be exfiltrated separately.
A misconfigured proxy that strips the cert is rejected — Caddy sets the X-Monsys-Client-Verified header from the cert fingerprint, and hub middleware aborts if the CN doesn’t match the bearer’s agent_id.

How it works

CA bootstrap (one-time). On first start the hub generates an RSA-4096 CA, encrypts the private key with CLOUD_ENCRYPTION_KEY (AES-256-GCM), and stores both in the singleton hub_settings row. The public CA cert is also exposed unauthenticated at GET /api/v1/agents/ca-cert so agents can pin it.
Per-agent cert issue. The agent calls POST /api/v1/agents/issue-client-cert with its bearer token. The hub signs an RSA-2048 cert (CN = agent_id UUID, OU = tenant_id UUID), 365-day validity, stores the public cert in agent_certificates, returns cert + key + CA PEM once.
Persistence. The agent writes three files mode 600 under /var/lib/monsys/mtls/: client.crt, client.key, hub-ca.crt.
Every subsequent request uses the cert. Caddy is configured with client_auth.mode = verify_if_given against the hub CA, so older bearer-only agents keep working during rollout.
Caddy propagation. After verification, Caddy injects X-Monsys-Client-Subject (full DN) and X-Monsys-Client-Verified (cert fingerprint) into the upstream request. Inbound copies of these headers are stripped first so a non-mTLS client can’t forge them.
Hub cross-check. The AgentAuth middleware extracts the CN from the subject DN and compares it to the bearer-resolved agent_id. Match: success + last_seen_at bumped on the cert row. Mismatch: HTTP 401 + integrity_anomaly recorded — strong signal of token theft or proxy misconfiguration.

Rollout — what changes for existing agents

Nothing immediate. Caddy’s verify_if_given mode allows non-mTLS connections to keep working. On the next agent auto-update (or manual restart), the new binary calls issue-client-cert once and all subsequent traffic is mTLS-authenticated. No downtime, no re-enrollment, no token change.

The Trust Score agent_health component soft-penalises (-5 points) agents that have not yet bootstrapped a cert. This nudges operators to roll out the new binary without forcing a hard break.

Operational notes

Topic	Detail
CA expiry	10 years from first boot. Alert wired into `ca_not_after` for future automation.
Client cert expiry	365 days. The agent re-fetches automatically when within 30 days of expiry.
CA private key	AES-256-GCM-encrypted in `hub_settings.ca_key_enc`. Restore requires the same `CLOUD_ENCRYPTION_KEY` — back it up out-of-band.
Rotation	`POST /api/v1/agents/issue-client-cert` rotates: old row marked `revoked_at = NOW(), revoke_reason = 'rotated'`, new row inserted in one tx.
Revocation	Operator marks `revoked_at` in the DB. Hub middleware rejects revoked certs even though Caddy still accepts them (until a future CRL endpoint).
Storage	RSA-2048 client keys, RSA-4096 CA. Ed25519 reserved for Emergency Action Tokens where we control both sides.

Endpoints

Method	Path	Auth	What
GET	`/api/v1/agents/ca-cert`	none	Hub CA public certificate (for trust pinning)
POST	`/api/v1/agents/issue-client-cert`	bearer	Issue or rotate the calling agent’s client cert + private key

Files on the agent host

/var/lib/monsys/mtls/
├── client.crt   # rsa-2048 client certificate (PEM)
├── client.key   # rsa-2048 private key (PEM, mode 600)
└── hub-ca.crt   # hub CA root (PEM)

Compliance mapping: ISO 27001 A.8.20 (Network security management) + CRA Annex I §3 (secure-by-default communication). Both controls are auto-evaluated by counting active, non-revoked rows in agent_certificates.