Sysadmin — day-to-day

1. Monday morning — what changed over the weekend

In the dashboard

Sidebar → Audit-log
Filter: last 72h grouped by event_type
Click any drift_event row to see operator + reason
Unexplained drift? → agent → Inventory → Restore

Click path:

/audit-log → filter created_at >= now() - INTERVAL '72 hours' → sort by event_type
/agents → “Drift events” panel — all config hashes that changed outside an EAT window
Per suspicious agent: /agents/<id> → Kernel tab → Reboot history — who rebooted, EAT-driven or manual

Equivalent in one SQL query:

Or via API (advanced — for automation)

SELECT a.hostname, l.event_type, l.event_data->>'reason' AS reason,
       u.email AS actor, l.created_at
  FROM audit_log l
  LEFT JOIN agents a ON a.id = l.agent_id
  LEFT JOIN users  u ON u.id = l.user_id
 WHERE l.tenant_id = $1::UUID
   AND l.created_at >= NOW() - INTERVAL '72 hours'
 ORDER BY l.created_at DESC;

What you typically do next: for every drift_event without a matching emergency_token_issued row → check whether it was a legitimate change. If not → /agents/<id> → Inventory tab → Config hashes → Restore or quarantine the host with an IsolateNetwork EAT.

2. Fleet-wide kernel update with no downtime

In the dashboard

Sidebar → Kernel CVEs → Active batches tab
‘New batch’ button → tag selector + target kernel + reboot-strategy
Enter TOTP → Start canary
Watch canary → primary → completed in the same view

Say USN-2026-1234 affects 80 Ubuntu hosts in production. Target kernel is 6.8.0-49.49.

Or via API (advanced — for automation)

curl -X POST https://app.monsys.ai/api/v1/kernel-updates/batches \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-TOTP-Code: 123456" \
  -H "Content-Type: application/json" \
  -d '{
    "title":           "USN-2026-1234 — kernel 6.8.0-49.49",
    "target_kernel":   "6.8.0-49.49",
    "package_manager": "apt",
    "reboot_strategy": "auto-at-window",
    "selector_kind":   "tag",
    "selector_value":  {"tag":"production"},
    "maintenance_window_id": "8c34…"
  }'

What happens behind the scenes:

Hub picks 3 canary hosts (10% of 80 = 8, capped at 3)
For each canary: Ed25519 EAT signed, pushed over WebSocket
Agent shells out to /usr/local/sbin/monsys-kernel-update (only via the sudoers rule that allows this one wrapper)
Wrapper installs kernel + headers, runs update-grub, writes /var/run/reboot-required and — if inside the maintenance window — shutdown -r +5
Hub detects running_release flip on next ingest → marks phase rebooted_new
Only when all 3 canary hosts hit rebooted_new → primary fires (77 EATs in one tick)
One canary reporting phase='failed' → batch aborted, no primary EATs sent
Canary stuck for 2h with no transition → automatic abort (“canary timeout”)

Track progress: /kernel-cves → Active batches tab. Or:

curl https://app.monsys.ai/api/v1/kernel-updates/batches/<id> \
  -H "Authorization: Bearer $TOKEN" | jq .members

3. One application is slow — where to look

In the dashboard

Sidebar → Apps → click host web-03 → Metrics tab
Processes tab for top CPU/RAM consumers
Topology tab for blast-radius dependencies
Logs tab for recent agent WARN/ERROR

Click path for host web-03:

/agents/web-03 → Metrics tab — CPU/mem/load over last 6h
/agents/web-03 → Processes tab — top 10 CPU/RAM consumers
/agents/web-03 → Topology tab — which other agents/apps it depends on
/agents/web-03 → Containers tab — restart_count per container
/agents/web-03 → Logs tab — last 200 WARN/ERROR lines from the agent itself (not full syslog — that’s in Loki)
/alerts filter agent=web-03 AND is_resolved=false

Common root causes:

Symptom	First check
CPU 100% for hours	`Processes tab` — which PID? Then `/agents/web-03 → Logs`
Memory slowly fills	`Capacity tab` — RAM trend + swap usage
App unresponsive	`Containers tab` — restart_count + `Inventory → systemd_services` for service state
DB errors	`Topology tab` — is the DB agent stale? Check `last_seen_at`

4. Auto-reboot only on weekends

In the dashboard

Sidebar → Settings → Maintenance windows → ‘New window’
Configure recurrence: Saturday 02:00-06:00 tag-target ‘production’
On kernel batch creation: reboot_strategy = auto-at-window
Wrapper only runs shutdown -r +5 inside active window

Define a maintenance window first:

Or via API (advanced — for automation)

curl -X POST https://app.monsys.ai/api/v1/maintenance-windows \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "title":             "Weekend reboot window",
    "starts_at":         "2026-05-23T02:00:00Z",
    "ends_at":           "2026-05-23T06:00:00Z",
    "recurrence":        "weekly",
    "target_type":       "tag",
    "target_filter":     {"tag":"production"},
    "silence_categories": []
  }'

When creating a kernel update batch: set reboot_strategy=auto-at-window. The monsys-kernel-update wrapper only runs shutdown -r +5 when the host is inside an active window; outside that, it stays at “reboot required” and waits for an operator confirmation.

5. Config drift after a migration

In the dashboard

Sidebar → Agents → click host → Inventory tab → Config hashes section
Table shows sha256 current vs baseline per watched path
Per drift event: ‘Accept as baseline’ OR ‘Investigate’
Drift without linked_eat_id = unexplained, requires action

Per agent you see sha256 of all watched paths (default: /etc/passwd, /etc/shadow, /etc/sudoers, /etc/ssh/sshd_config + 4 others).

Or via API (advanced — for automation)

curl https://app.monsys.ai/api/v1/agents/<id>/drift-events?limit=50 \
  -H "Authorization: Bearer $TOKEN"

For each drift event you see:

path — which file
prev_sha256 / new_sha256
detected_at
linked_eat_id — null means: nobody authorised this with an EAT, this is unexplained drift

Workflow:

Legitimate change (manual patch, migration): POST /api/v1/agents/<id>/drift-events/<id>/accept — baseline updates
Suspicious: POST /api/v1/agents/<id>/emergency with action IsolateNetwork + TOTP, then inspect manually

6. Find every Ubuntu 20.04 server (end-of-standard-support 2025-04)

In the dashboard

Sidebar → Agents → filter os_name=Ubuntu + os_version~20.04
Bulk-select shown hosts
‘Add tag’ button → tag = ‘eol-2025’
Sidebar → Settings → Alert rules → new with threshold-date 2025-04-30

Or via API (advanced — for automation)

curl 'https://app.monsys.ai/api/v1/agents?os_name=Ubuntu&os_version_like=20.04' \
  -H "Authorization: Bearer $TOKEN" | jq '.[] | .hostname'

Or in SQL:

SELECT hostname, ip_addresses, last_seen_at
  FROM agents
 WHERE tenant_id = $1::UUID
   AND os_name = 'Ubuntu'
   AND os_version LIKE '20.04%'
   AND is_active = true
 ORDER BY last_seen_at DESC;

Typically: tag these hosts with eol-2025 via bulk-tagging in /groups, then build an SLA alert that fires whenever any agent still carries that tag after 2025-04-30.

7. “Why did I get 438 CPU alerts last night?”

In the dashboard

Sidebar → Alerts → ‘Grouped by title’ tab
Sort descending by count
Click biggest group — does the title contain a value?
Settings → Alert rules → edit the rule so the value goes into description

Since the alert-storm fix, InsertAlert does 30-minute dedup per (tenant, agent, category, title). If you still see a storm: most likely a changing title (e.g. count embedded) defeating the dedup key.

Quick check:

Or via API (advanced — for automation)

SELECT title, COUNT(*), MIN(created_at), MAX(created_at)
  FROM alerts
 WHERE tenant_id = $1::UUID
   AND created_at >= NOW() - INTERVAL '24 hours'
 GROUP BY title
 ORDER BY 2 DESC LIMIT 10;

If one title appears 100+ times → bug somewhere. If 100 unique titles 1× each → title contains a value (count, percentage, timestamp). Fix the emitting worker to move the value to description.

8. Employee left — what did they have access to

In the dashboard

Sidebar → Identity surface → search alice@
See linked systems (hosts, SSH keys, Copilot/OpenAI seats)
Per system ‘Revoke’ button OR one ‘revoke-user’ playbook EAT across all hosts
Audit-log shows ‘identity_revoked’ per executed revocation

/identity/surface → search "alice@…" shows:

Dashboard account (roles per tenant)
Local users on which hosts (inventory_users.username='alice')
SSH keys where public-key fingerprint matched
GitHub Copilot seat (if Copilot Audit module enabled)
OpenAI org member (if OpenAI Audit module enabled)
Sudo rights (inventory_sudo_rules.who='alice')

Bulk-revoke from the Identity surface page: either per system (POST /api/v1/identity/persons/<id>/revoke), or as a single RunPlaybook EAT that handles all hosts at once.

9. On-call rotation, timezone-aware

In the dashboard

Sidebar → Settings → On-call rotations → ‘New rotation’
For team eu-ops: shifts Mon-Fri 09:00-18:00 Europe/Brussels
Add buddy fallback for out-of-hours
NotifyWorker resolves which shift is open per alert and routes to ntfy

Or via API (advanced — for automation)

-- For team eu-ops, 09:00-18:00 Brussels time, Mon-Fri, with buddy fallback
INSERT INTO oncall_shifts (tenant_id, group_id, person_id, weekday,
                           start_time, end_time, timezone)
VALUES
  ($1, $2, $alice, 1, '09:00', '18:00', 'Europe/Brussels'),
  ($1, $2, $alice, 2, '09:00', '18:00', 'Europe/Brussels'),
  -- … and
  ($1, $2, $bob,   1, '18:00', '09:00', 'Europe/Brussels'); -- buddy

NotifyWorker resolves per alert which shift is open at alerts.created_at, then routes to the right ntfy topic + email.

10. Are production DB backups actually still working?

In the dashboard

Sidebar → Inventory → Backups tab → filter tag ‘production-db’
Sort ‘Last successful’ ascending
Per stale host (>25h): click ‘Create alert rule’
Threshold = 25h + severity = critical

Or via API (advanced — for automation)

SELECT a.hostname,
       b.tool                            AS backup_tool,
       b.destination                     AS target,
       b.last_successful_run             AS last_ok,
       NOW() - b.last_successful_run     AS age,
       b.last_failure_message
  FROM backup_configs b
  JOIN agents a ON a.id = b.agent_id
 WHERE b.tenant_id = $1::UUID
   AND 'production-db' = ANY(a.tags)
 ORDER BY b.last_successful_run NULLS FIRST;

Alert rule if age > 25h (default schedule = daily):

curl -X POST https://app.monsys.ai/api/v1/alert-rules \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "name":     "Backup stale > 25h on production-db",
    "tag":      "production-db",
    "category": "backup",
    "metric":   "backup_age_hours",
    "operator": ">",
    "threshold": 25,
    "severity": "critical"
  }'