Sysadmin — day-to-day
1. Monday morning — what changed over the weekend
In the dashboard
- Sidebar → Audit-log
- Filter: last 72h grouped by event_type
- Click any drift_event row to see operator + reason
- Unexplained drift? → agent → Inventory → Restore
Click path:
/audit-log→ filtercreated_at >= now() - INTERVAL '72 hours'→ sort byevent_type/agents→ “Drift events” panel — all config hashes that changed outside an EAT window- Per suspicious agent:
/agents/<id> → Kernel tab → Reboot history— who rebooted, EAT-driven or manual
Equivalent in one SQL query:
Or via API (advanced — for automation)
SELECT a.hostname, l.event_type, l.event_data->>'reason' AS reason, u.email AS actor, l.created_at FROM audit_log l LEFT JOIN agents a ON a.id = l.agent_id LEFT JOIN users u ON u.id = l.user_id WHERE l.tenant_id = $1::UUID AND l.created_at >= NOW() - INTERVAL '72 hours' ORDER BY l.created_at DESC;What you typically do next: for every drift_event without a matching
emergency_token_issued row → check whether it was a legitimate
change. If not → /agents/<id> → Inventory tab → Config hashes → Restore
or quarantine the host with an IsolateNetwork EAT.
2. Fleet-wide kernel update with no downtime
In the dashboard
- Sidebar → Kernel CVEs → Active batches tab
- ‘New batch’ button → tag selector + target kernel + reboot-strategy
- Enter TOTP → Start canary
- Watch canary → primary → completed in the same view
Say USN-2026-1234 affects 80 Ubuntu hosts in production. Target kernel
is 6.8.0-49.49.
Or via API (advanced — for automation)
curl -X POST https://app.monsys.ai/api/v1/kernel-updates/batches \ -H "Authorization: Bearer $TOKEN" \ -H "X-TOTP-Code: 123456" \ -H "Content-Type: application/json" \ -d '{ "title": "USN-2026-1234 — kernel 6.8.0-49.49", "target_kernel": "6.8.0-49.49", "package_manager": "apt", "reboot_strategy": "auto-at-window", "selector_kind": "tag", "selector_value": {"tag":"production"}, "maintenance_window_id": "8c34…" }'What happens behind the scenes:
- Hub picks 3 canary hosts (10% of 80 = 8, capped at 3)
- For each canary: Ed25519 EAT signed, pushed over WebSocket
- Agent shells out to
/usr/local/sbin/monsys-kernel-update(only via the sudoers rule that allows this one wrapper) - Wrapper installs kernel + headers, runs
update-grub, writes/var/run/reboot-requiredand — if inside the maintenance window —shutdown -r +5 - Hub detects
running_releaseflip on next ingest → marks phaserebooted_new - Only when all 3 canary hosts hit rebooted_new → primary fires (77 EATs in one tick)
- One canary reporting
phase='failed'→ batch aborted, no primary EATs sent - Canary stuck for 2h with no transition → automatic abort (“canary timeout”)
Track progress: /kernel-cves → Active batches tab. Or:
curl https://app.monsys.ai/api/v1/kernel-updates/batches/<id> \ -H "Authorization: Bearer $TOKEN" | jq .members3. One application is slow — where to look
In the dashboard
- Sidebar → Apps → click host web-03 → Metrics tab
- Processes tab for top CPU/RAM consumers
- Topology tab for blast-radius dependencies
- Logs tab for recent agent WARN/ERROR
Click path for host web-03:
/agents/web-03 → Metrics tab— CPU/mem/load over last 6h/agents/web-03 → Processes tab— top 10 CPU/RAM consumers/agents/web-03 → Topology tab— which other agents/apps it depends on/agents/web-03 → Containers tab—restart_countper container/agents/web-03 → Logs tab— last 200 WARN/ERROR lines from the agent itself (not full syslog — that’s in Loki)/alertsfilteragent=web-03 AND is_resolved=false
Common root causes:
| Symptom | First check |
|---|---|
| CPU 100% for hours | Processes tab — which PID? Then /agents/web-03 → Logs |
| Memory slowly fills | Capacity tab — RAM trend + swap usage |
| App unresponsive | Containers tab — restart_count + Inventory → systemd_services for service state |
| DB errors | Topology tab — is the DB agent stale? Check last_seen_at |
4. Auto-reboot only on weekends
In the dashboard
- Sidebar → Settings → Maintenance windows → ‘New window’
- Configure recurrence: Saturday 02:00-06:00 tag-target ‘production’
- On kernel batch creation: reboot_strategy = auto-at-window
- Wrapper only runs shutdown -r +5 inside active window
Define a maintenance window first:
Or via API (advanced — for automation)
curl -X POST https://app.monsys.ai/api/v1/maintenance-windows \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "title": "Weekend reboot window", "starts_at": "2026-05-23T02:00:00Z", "ends_at": "2026-05-23T06:00:00Z", "recurrence": "weekly", "target_type": "tag", "target_filter": {"tag":"production"}, "silence_categories": [] }'When creating a kernel update batch: set reboot_strategy=auto-at-window.
The monsys-kernel-update wrapper only runs shutdown -r +5 when the
host is inside an active window; outside that, it stays at
“reboot required” and waits for an operator confirmation.
5. Config drift after a migration
In the dashboard
- Sidebar → Agents → click host → Inventory tab → Config hashes section
- Table shows sha256 current vs baseline per watched path
- Per drift event: ‘Accept as baseline’ OR ‘Investigate’
- Drift without linked_eat_id = unexplained, requires action
Per agent you see sha256 of all watched paths (default: /etc/passwd,
/etc/shadow, /etc/sudoers, /etc/ssh/sshd_config + 4 others).
Or via API (advanced — for automation)
curl https://app.monsys.ai/api/v1/agents/<id>/drift-events?limit=50 \ -H "Authorization: Bearer $TOKEN"For each drift event you see:
path— which fileprev_sha256/new_sha256detected_atlinked_eat_id—nullmeans: nobody authorised this with an EAT, this is unexplained drift
Workflow:
- Legitimate change (manual patch, migration):
POST /api/v1/agents/<id>/drift-events/<id>/accept— baseline updates - Suspicious:
POST /api/v1/agents/<id>/emergencywith actionIsolateNetwork+ TOTP, then inspect manually
6. Find every Ubuntu 20.04 server (end-of-standard-support 2025-04)
In the dashboard
- Sidebar → Agents → filter os_name=Ubuntu + os_version~20.04
- Bulk-select shown hosts
- ‘Add tag’ button → tag = ‘eol-2025’
- Sidebar → Settings → Alert rules → new with threshold-date 2025-04-30
Or via API (advanced — for automation)
curl 'https://app.monsys.ai/api/v1/agents?os_name=Ubuntu&os_version_like=20.04' \ -H "Authorization: Bearer $TOKEN" | jq '.[] | .hostname'Or in SQL:
SELECT hostname, ip_addresses, last_seen_at FROM agents WHERE tenant_id = $1::UUID AND os_name = 'Ubuntu' AND os_version LIKE '20.04%' AND is_active = true ORDER BY last_seen_at DESC;Typically: tag these hosts with eol-2025 via bulk-tagging in
/groups, then build an SLA alert that fires whenever any agent still
carries that tag after 2025-04-30.
7. “Why did I get 438 CPU alerts last night?”
In the dashboard
- Sidebar → Alerts → ‘Grouped by title’ tab
- Sort descending by count
- Click biggest group — does the title contain a value?
- Settings → Alert rules → edit the rule so the value goes into description
Since the alert-storm fix, InsertAlert does 30-minute dedup per
(tenant, agent, category, title). If you still see a storm: most
likely a changing title (e.g. count embedded) defeating the dedup
key.
Quick check:
Or via API (advanced — for automation)
SELECT title, COUNT(*), MIN(created_at), MAX(created_at) FROM alerts WHERE tenant_id = $1::UUID AND created_at >= NOW() - INTERVAL '24 hours' GROUP BY title ORDER BY 2 DESC LIMIT 10;If one title appears 100+ times → bug somewhere. If 100 unique titles
1× each → title contains a value (count, percentage, timestamp). Fix
the emitting worker to move the value to description.
8. Employee left — what did they have access to
In the dashboard
- Sidebar → Identity surface → search alice@
- See linked systems (hosts, SSH keys, Copilot/OpenAI seats)
- Per system ‘Revoke’ button OR one ‘revoke-user’ playbook EAT across all hosts
- Audit-log shows ‘identity_revoked’ per executed revocation
/identity/surface → search "alice@…" shows:
- Dashboard account (roles per tenant)
- Local users on which hosts (
inventory_users.username='alice') - SSH keys where public-key fingerprint matched
- GitHub Copilot seat (if Copilot Audit module enabled)
- OpenAI org member (if OpenAI Audit module enabled)
- Sudo rights (
inventory_sudo_rules.who='alice')
Bulk-revoke from the Identity surface page: either per system
(POST /api/v1/identity/persons/<id>/revoke), or as a single
RunPlaybook EAT that handles all hosts at once.
9. On-call rotation, timezone-aware
In the dashboard
- Sidebar → Settings → On-call rotations → ‘New rotation’
- For team eu-ops: shifts Mon-Fri 09:00-18:00 Europe/Brussels
- Add buddy fallback for out-of-hours
- NotifyWorker resolves which shift is open per alert and routes to ntfy
Or via API (advanced — for automation)
-- For team eu-ops, 09:00-18:00 Brussels time, Mon-Fri, with buddy fallbackINSERT INTO oncall_shifts (tenant_id, group_id, person_id, weekday, start_time, end_time, timezone)VALUES ($1, $2, $alice, 1, '09:00', '18:00', 'Europe/Brussels'), ($1, $2, $alice, 2, '09:00', '18:00', 'Europe/Brussels'), -- … and ($1, $2, $bob, 1, '18:00', '09:00', 'Europe/Brussels'); -- buddyNotifyWorker resolves per alert which shift is open at
alerts.created_at, then routes to the right ntfy topic + email.
10. Are production DB backups actually still working?
In the dashboard
- Sidebar → Inventory → Backups tab → filter tag ‘production-db’
- Sort ‘Last successful’ ascending
- Per stale host (>25h): click ‘Create alert rule’
- Threshold = 25h + severity = critical
Or via API (advanced — for automation)
SELECT a.hostname, b.tool AS backup_tool, b.destination AS target, b.last_successful_run AS last_ok, NOW() - b.last_successful_run AS age, b.last_failure_message FROM backup_configs b JOIN agents a ON a.id = b.agent_id WHERE b.tenant_id = $1::UUID AND 'production-db' = ANY(a.tags) ORDER BY b.last_successful_run NULLS FIRST;Alert rule if age > 25h (default schedule = daily):
curl -X POST https://app.monsys.ai/api/v1/alert-rules \ -H "Authorization: Bearer $TOKEN" \ -d '{ "name": "Backup stale > 25h on production-db", "tag": "production-db", "category": "backup", "metric": "backup_age_hours", "operator": ">", "threshold": 25, "severity": "critical" }'