Skip to content

DevOps — SLOs, supply chain, LLM observability

1. SLO + error budget per application

In the dashboard

  1. Sidebar → Apps → click your application → SLO tab
  2. ‘Define SLO’ button → target 0.999 + 30 day window
  3. Burn-down sparkline appears in the same tab
  4. ‘Embed badge’ button copies markdown for your README

Define an application and bind metrics to it:

Or via API (advanced — for automation)

Terminal window
curl -X POST https://app.monsys.ai/api/v1/apps \
-H "Authorization: Bearer $TOKEN" \
-d '{
"name": "checkout-api",
"type": "http",
"agent_id": "<agent_uuid_of_web_lb>",
"endpoint": "https://checkout.example.com/healthz",
"interval_seconds": 30
}'

Set SLO + window:

Terminal window
curl -X POST https://app.monsys.ai/api/v1/apps/<app_id>/slo \
-H "Authorization: Bearer $TOKEN" \
-d '{
"target": 0.999,
"window_days": 30,
"minimum_data": 0.95
}'

/operations/mttr shows per application:

  • Current rolling 30d uptime (% and minutes-down)
  • Error budget remaining (in minutes)
  • Burn-down sparkline (current burn rate vs target)
  • “Days until budget exhausted” if you keep burning at this week’s rate

For your sprint review: pull the SLO CSV via /api/v1/apps/<id>/slo/history?days=90. For your deploy gate, check whether burn rate was < 2.0 during the last hour.


2. Slack/ntfy alert when a container restarts > 3× in an hour

In the dashboard

  1. Sidebar → Settings → Alert rules → ‘New rule’
  2. Metric: container_restart_count, operator >, threshold 3, window 1h
  3. Webhook URL field: https://hooks.slack.com/services/T0/B0/xxx
  4. WebhookDispatchWorker does retries + never leaks secrets

monsys.ai itself doesn’t post to Slack — it posts to ntfy (self-host) or via a webhook. Slack webhook example:

Or via API (advanced — for automation)

Terminal window
curl -X POST https://app.monsys.ai/api/v1/alert-rules \
-H "Authorization: Bearer $TOKEN" \
-d '{
"name": "Container flapping",
"metric": "container_restart_count",
"operator": ">",
"threshold": 3,
"window": "1h",
"severity": "warning",
"group_by": ["agent_id", "container_name"],
"webhook_url": "https://hooks.slack.com/services/T0/B0/xxx"
}'

WebhookDispatchWorker makes the outgoing POST with a minimal JSON body ({title, severity, agent, link}) — no prompt content or secrets. Retries 3× with backoff. Logs to /audit-log event_type webhook_delivery.


3. Auto-update CVE-vulnerable npm deps, but pin our DB driver

In the dashboard

  1. Sidebar → Recommendations → ‘Application CVEs’ tab
  2. Per vulnerable package: ‘Auto-update’ button or ‘3-dots → Suppress this fix’
  3. On suppression: enter expires_at + reason
  4. Future Auto-update-all shows skip with reason in monthly evidence

Application dependency CVE scanning runs hourly. Open /recommendations → “Application CVEs” → say you see 12 packages with available fixes.

Click “Auto-update all” — conditions:

  • TOTP required
  • Capped at 25 EATs per click (avoids fleet-wide blast)
  • Per (project, ecosystem) each host gets one EAT with the combined package list

Want to pin a package (e.g. mysql2@2.3.0 because 2.4.x has a breaking ABI change)? Add a suppression:

Or via API (advanced — for automation)

Terminal window
curl -X POST https://app.monsys.ai/api/v1/cve-suppressions \
-H "Authorization: Bearer $TOKEN" \
-d '{
"package_name": "mysql2",
"ecosystem": "npm",
"version_range": ">=2.4.0",
"reason": "ABI break — see PROD-123 ticket",
"expires_at": "2026-09-01"
}'

Future auto-update-all skips this upgrade AND records in evidence that it was skipped with the stated reason.


4. Detect npm maintainer change (supply-chain attack indicator)

In the dashboard

  1. Sidebar → Inventory → Dependencies tab → filter ‘Maintainer changed (7d)’
  2. Per package: current + previous maintainer set
  3. Click suspect package → ‘Pin version’ or ‘Investigate’
  4. Hit auto-triggers webhook to your incident tooling

Since schema 076 monsys tracks the set of publish emails per npm package you depend on. Change = alert.

Or via API (advanced — for automation)

SELECT package_name,
ecosystem,
maintainers,
deprecated,
last_maintainer_change_at,
seen_at
FROM supply_chain_maintainer_cache
WHERE tenant_id = $1::UUID
AND last_maintainer_change_at >= NOW() - INTERVAL '7 days'
ORDER BY last_maintainer_change_at DESC;

Workflow on a hit:

  1. Check whether the new maintainer is a legitimate org collaborator (often yes → accept the change in the UI)
  2. If not: pin the current version via cve_suppressions + open a ticket upstream
  3. Optionally trigger an IsolateNetwork EAT on staging hosts that already installed the package

5. Terraform-managed config vs drift detection

In the dashboard

  1. Sidebar → Settings → API tokens → generate scoped readonly token
  2. Sidebar → AI Insights → ‘Deploy gate template’ → copy bash snippet
  3. Paste in your CI yaml before the deploy step
  4. Test with —dry-run — shows Trust Score + open critical alerts

We assume Terraform is the source of truth for /etc/ssh/sshd_config. Drift = somebody made a manual change.

Or via API (advanced — for automation)

Terminal window
# After every successful terraform apply: bump the baseline
curl -X POST https://app.monsys.ai/api/v1/agents/<id>/config-hashes/rebase \
-H "Authorization: Bearer $TOKEN" \
-d '{
"paths": ["/etc/ssh/sshd_config", "/etc/nginx/nginx.conf"],
"reason": "terraform apply commit 7a3c…"
}'

Drift within 7 days of a baseline rebase by anyone other than the Terraform service user = automatic ticket in your issue tracker via WebhookDispatch.


6. LLM cost per team via SDK

In the dashboard

  1. Sidebar → AI → ‘Per team’ tab
  2. Default grouping on ‘team’ tag from SDK
  3. Filter descending on cost OR p95 latency
  4. ‘Export monthly CSV’ button for financial cross-charge

Or via API (advanced — for automation)

# In your Python service
from monsys_ai import MonsysClient
client = MonsysClient(
hub_url="https://hub.monsys.ai",
api_key=os.environ["MONSYS_AI_TOKEN"],
tenant="acme",
team="checkout", # → tag on every trace
)
with client.trace("price-lookup") as t:
resp = openai.chat.completions.create(...)
t.add_llm_call(
model=resp.model,
prompt_tokens=resp.usage.prompt_tokens,
completion_tokens=resp.usage.completion_tokens,
)

In /ai/quadrant you see per team the breakdown: cost, p95 latency, top 3 models, refusal rate. CSV export for monthly cross-charging.

PII redaction happens in the SDK, not at the hub. The agent never sees the raw prompt — only token counts and a SHA256 hash of the prompt for dedup.


7. Deploy gate: block release when a critical alert is open

In the dashboard

  1. Sidebar → Webhooks → ‘Register endpoint’
  2. URL + shared HMAC secret + events filter (alert.created, alert.resolved)
  3. Min severity = warning + agent_filter on tag
  4. Retries 3× with backoff; failures land in audit_log

For your CI pipeline:

Or via API (advanced — for automation)

#!/usr/bin/env bash
# Before `helm upgrade` / `kubectl apply` etc.
OPEN_CRITICAL=$(curl -s \
"https://app.monsys.ai/api/v1/alerts?severity=critical&is_resolved=false&tag=production" \
-H "Authorization: Bearer $MONSYS_TOKEN" | jq '. | length')
if [[ "$OPEN_CRITICAL" -gt 0 ]]; then
echo "✗ refusing deploy: $OPEN_CRITICAL critical alerts open on production"
exit 1
fi
# Trust Score gate
SCORE=$(curl -s "https://app.monsys.ai/api/v1/trust-score/v12/tenant" \
-H "Authorization: Bearer $MONSYS_TOKEN" | jq '.final_score')
if [[ "$SCORE" -lt 70 ]]; then
echo "✗ refusing deploy: Trust Score $SCORE/100 below threshold 70"
exit 1
fi
echo "✓ Trust Score $SCORE, no open critical alerts — proceeding"

This prevents the “ship now, look at alerts later” anti-pattern.


8. Webhook integration with your own incident-response tooling

In the dashboard

  1. Sidebar → Settings → ‘Embed badge’ (Trust Score panel)
  2. Copy SVG URL with your tenant_id
  3. Paste markdown in README or internal status page
  4. Updates every 30 min, no JavaScript, no tracking

WebhookDispatchWorker POSTs on every alert with severity ≥ warning. Register a custom endpoint:

Or via API (advanced — for automation)

Terminal window
curl -X POST https://app.monsys.ai/api/v1/webhooks \
-H "Authorization: Bearer $TOKEN" \
-d '{
"url": "https://incidents.acme.com/api/v1/intake",
"secret": "shared-hmac-secret",
"events": ["alert.created","alert.resolved"],
"min_severity": "warning",
"agent_filter": {"tag":"production"}
}'

Body shape:

{
"event": "alert.created",
"alert": {
"id": "",
"severity": "critical",
"category": "cpu",
"title": "Host CPU > 95% sustained 10min",
"agent": { "id": "", "hostname": "web-03", "tags": ["production"] }
},
"tenant_id": "",
"timestamp": "2026-05-19T20:34:00Z"
}

Headers:

  • X-Monsys-Signature: sha256=… (HMAC over body with secret)
  • X-Monsys-Delivery: <uuid> (idempotency key)
  • X-Monsys-Event: alert.created

Retries: 3× with exponential backoff (10s / 60s / 5min). Final failure lands in /audit-log as webhook_delivery_failed and your endpoint goes into a 10-min cooldown to avoid hot loops.


9. Embed Trust Score in your README/dashboard

Or via API (advanced — for automation)

![Trust Score](https://badge.monsys.ai/tenant/<tenant_id>/trust-score.svg)

Server-rendered SVG. Updates every 30 minutes (Trust Score worker cycle). The badge colour-codes by band: <50 red, 50-70 orange, 70-85 green, >85 dark green. No JavaScript, no tracking — rendered directly from TimescaleDB.

Tip for multi-tenant orgs: per repo use the tenant id of the project (DEV, STG, PROD separate) and put 3 badges side-by-side — instantly visible if staging performs differently from production.