DevOps — SLOs, supply chain, LLM observability
1. SLO + error budget per application
In the dashboard
- Sidebar → Apps → click your application → SLO tab
- ‘Define SLO’ button → target 0.999 + 30 day window
- Burn-down sparkline appears in the same tab
- ‘Embed badge’ button copies markdown for your README
Define an application and bind metrics to it:
Or via API (advanced — for automation)
curl -X POST https://app.monsys.ai/api/v1/apps \ -H "Authorization: Bearer $TOKEN" \ -d '{ "name": "checkout-api", "type": "http", "agent_id": "<agent_uuid_of_web_lb>", "endpoint": "https://checkout.example.com/healthz", "interval_seconds": 30 }'Set SLO + window:
curl -X POST https://app.monsys.ai/api/v1/apps/<app_id>/slo \ -H "Authorization: Bearer $TOKEN" \ -d '{ "target": 0.999, "window_days": 30, "minimum_data": 0.95 }'/operations/mttr shows per application:
- Current rolling 30d uptime (% and minutes-down)
- Error budget remaining (in minutes)
- Burn-down sparkline (current burn rate vs target)
- “Days until budget exhausted” if you keep burning at this week’s rate
For your sprint review: pull the SLO CSV via
/api/v1/apps/<id>/slo/history?days=90. For your deploy gate, check
whether burn rate was < 2.0 during the last hour.
2. Slack/ntfy alert when a container restarts > 3× in an hour
In the dashboard
- Sidebar → Settings → Alert rules → ‘New rule’
- Metric: container_restart_count, operator >, threshold 3, window 1h
- Webhook URL field: https://hooks.slack.com/services/T0/B0/xxx
- WebhookDispatchWorker does retries + never leaks secrets
monsys.ai itself doesn’t post to Slack — it posts to ntfy (self-host) or via a webhook. Slack webhook example:
Or via API (advanced — for automation)
curl -X POST https://app.monsys.ai/api/v1/alert-rules \ -H "Authorization: Bearer $TOKEN" \ -d '{ "name": "Container flapping", "metric": "container_restart_count", "operator": ">", "threshold": 3, "window": "1h", "severity": "warning", "group_by": ["agent_id", "container_name"], "webhook_url": "https://hooks.slack.com/services/T0/B0/xxx" }'WebhookDispatchWorker makes the outgoing POST with a minimal JSON
body ({title, severity, agent, link}) — no prompt content or
secrets. Retries 3× with backoff. Logs to /audit-log event_type
webhook_delivery.
3. Auto-update CVE-vulnerable npm deps, but pin our DB driver
In the dashboard
- Sidebar → Recommendations → ‘Application CVEs’ tab
- Per vulnerable package: ‘Auto-update’ button or ‘3-dots → Suppress this fix’
- On suppression: enter expires_at + reason
- Future Auto-update-all shows skip with reason in monthly evidence
Application dependency CVE scanning runs hourly. Open
/recommendations → “Application CVEs” → say you see 12 packages with
available fixes.
Click “Auto-update all” — conditions:
- TOTP required
- Capped at 25 EATs per click (avoids fleet-wide blast)
- Per (project, ecosystem) each host gets one EAT with the combined package list
Want to pin a package (e.g. mysql2@2.3.0 because 2.4.x has a
breaking ABI change)? Add a suppression:
Or via API (advanced — for automation)
curl -X POST https://app.monsys.ai/api/v1/cve-suppressions \ -H "Authorization: Bearer $TOKEN" \ -d '{ "package_name": "mysql2", "ecosystem": "npm", "version_range": ">=2.4.0", "reason": "ABI break — see PROD-123 ticket", "expires_at": "2026-09-01" }'Future auto-update-all skips this upgrade AND records in evidence that it was skipped with the stated reason.
4. Detect npm maintainer change (supply-chain attack indicator)
In the dashboard
- Sidebar → Inventory → Dependencies tab → filter ‘Maintainer changed (7d)’
- Per package: current + previous maintainer set
- Click suspect package → ‘Pin version’ or ‘Investigate’
- Hit auto-triggers webhook to your incident tooling
Since schema 076 monsys tracks the set of publish emails per npm package you depend on. Change = alert.
Or via API (advanced — for automation)
SELECT package_name, ecosystem, maintainers, deprecated, last_maintainer_change_at, seen_at FROM supply_chain_maintainer_cache WHERE tenant_id = $1::UUID AND last_maintainer_change_at >= NOW() - INTERVAL '7 days' ORDER BY last_maintainer_change_at DESC;Workflow on a hit:
- Check whether the new maintainer is a legitimate org collaborator (often yes → accept the change in the UI)
- If not: pin the current version via
cve_suppressions+ open a ticket upstream - Optionally trigger an
IsolateNetworkEAT on staging hosts that already installed the package
5. Terraform-managed config vs drift detection
In the dashboard
- Sidebar → Settings → API tokens → generate scoped readonly token
- Sidebar → AI Insights → ‘Deploy gate template’ → copy bash snippet
- Paste in your CI yaml before the deploy step
- Test with —dry-run — shows Trust Score + open critical alerts
We assume Terraform is the source of truth for /etc/ssh/sshd_config.
Drift = somebody made a manual change.
Or via API (advanced — for automation)
# After every successful terraform apply: bump the baselinecurl -X POST https://app.monsys.ai/api/v1/agents/<id>/config-hashes/rebase \ -H "Authorization: Bearer $TOKEN" \ -d '{ "paths": ["/etc/ssh/sshd_config", "/etc/nginx/nginx.conf"], "reason": "terraform apply commit 7a3c…" }'Drift within 7 days of a baseline rebase by anyone other than the Terraform service user = automatic ticket in your issue tracker via WebhookDispatch.
6. LLM cost per team via SDK
In the dashboard
- Sidebar → AI → ‘Per team’ tab
- Default grouping on ‘team’ tag from SDK
- Filter descending on cost OR p95 latency
- ‘Export monthly CSV’ button for financial cross-charge
Or via API (advanced — for automation)
# In your Python servicefrom monsys_ai import MonsysClient
client = MonsysClient( hub_url="https://hub.monsys.ai", api_key=os.environ["MONSYS_AI_TOKEN"], tenant="acme", team="checkout", # → tag on every trace)
with client.trace("price-lookup") as t: resp = openai.chat.completions.create(...) t.add_llm_call( model=resp.model, prompt_tokens=resp.usage.prompt_tokens, completion_tokens=resp.usage.completion_tokens, )In /ai/quadrant you see per team the breakdown: cost, p95 latency,
top 3 models, refusal rate. CSV export for monthly cross-charging.
PII redaction happens in the SDK, not at the hub. The agent never sees the raw prompt — only token counts and a SHA256 hash of the prompt for dedup.
7. Deploy gate: block release when a critical alert is open
In the dashboard
- Sidebar → Webhooks → ‘Register endpoint’
- URL + shared HMAC secret + events filter (alert.created, alert.resolved)
- Min severity = warning + agent_filter on tag
- Retries 3× with backoff; failures land in audit_log
For your CI pipeline:
Or via API (advanced — for automation)
#!/usr/bin/env bash# Before `helm upgrade` / `kubectl apply` etc.OPEN_CRITICAL=$(curl -s \ "https://app.monsys.ai/api/v1/alerts?severity=critical&is_resolved=false&tag=production" \ -H "Authorization: Bearer $MONSYS_TOKEN" | jq '. | length')
if [[ "$OPEN_CRITICAL" -gt 0 ]]; then echo "✗ refusing deploy: $OPEN_CRITICAL critical alerts open on production" exit 1fi
# Trust Score gateSCORE=$(curl -s "https://app.monsys.ai/api/v1/trust-score/v12/tenant" \ -H "Authorization: Bearer $MONSYS_TOKEN" | jq '.final_score')
if [[ "$SCORE" -lt 70 ]]; then echo "✗ refusing deploy: Trust Score $SCORE/100 below threshold 70" exit 1fi
echo "✓ Trust Score $SCORE, no open critical alerts — proceeding"This prevents the “ship now, look at alerts later” anti-pattern.
8. Webhook integration with your own incident-response tooling
In the dashboard
- Sidebar → Settings → ‘Embed badge’ (Trust Score panel)
- Copy SVG URL with your tenant_id
- Paste markdown in README or internal status page
- Updates every 30 min, no JavaScript, no tracking
WebhookDispatchWorker POSTs on every alert with severity ≥ warning.
Register a custom endpoint:
Or via API (advanced — for automation)
curl -X POST https://app.monsys.ai/api/v1/webhooks \ -H "Authorization: Bearer $TOKEN" \ -d '{ "url": "https://incidents.acme.com/api/v1/intake", "secret": "shared-hmac-secret", "events": ["alert.created","alert.resolved"], "min_severity": "warning", "agent_filter": {"tag":"production"} }'Body shape:
{ "event": "alert.created", "alert": { "id": "…", "severity": "critical", "category": "cpu", "title": "Host CPU > 95% sustained 10min", "agent": { "id": "…", "hostname": "web-03", "tags": ["production"] } }, "tenant_id": "…", "timestamp": "2026-05-19T20:34:00Z"}Headers:
X-Monsys-Signature: sha256=…(HMAC over body withsecret)X-Monsys-Delivery: <uuid>(idempotency key)X-Monsys-Event: alert.created
Retries: 3× with exponential backoff (10s / 60s / 5min). Final
failure lands in /audit-log as webhook_delivery_failed and your
endpoint goes into a 10-min cooldown to avoid hot loops.
9. Embed Trust Score in your README/dashboard
Or via API (advanced — for automation)
Server-rendered SVG. Updates every 30 minutes (Trust Score worker cycle). The badge colour-codes by band: <50 red, 50-70 orange, 70-85 green, >85 dark green. No JavaScript, no tracking — rendered directly from TimescaleDB.
Tip for multi-tenant orgs: per repo use the tenant id of the project (DEV, STG, PROD separate) and put 3 badges side-by-side — instantly visible if staging performs differently from production.