Skip to content

Sysadmin — day-to-day

1. Monday morning — what changed over the weekend

In the dashboard

  1. Sidebar → Audit-log
  2. Filter: last 72h grouped by event_type
  3. Click any drift_event row to see operator + reason
  4. Unexplained drift? → agent → Inventory → Restore

Click path:

  1. /audit-log → filter created_at >= now() - INTERVAL '72 hours' → sort by event_type
  2. /agents → “Drift events” panel — all config hashes that changed outside an EAT window
  3. Per suspicious agent: /agents/<id> → Kernel tab → Reboot history — who rebooted, EAT-driven or manual

Equivalent in one SQL query:

Or via API (advanced — for automation)

SELECT a.hostname, l.event_type, l.event_data->>'reason' AS reason,
u.email AS actor, l.created_at
FROM audit_log l
LEFT JOIN agents a ON a.id = l.agent_id
LEFT JOIN users u ON u.id = l.user_id
WHERE l.tenant_id = $1::UUID
AND l.created_at >= NOW() - INTERVAL '72 hours'
ORDER BY l.created_at DESC;

What you typically do next: for every drift_event without a matching emergency_token_issued row → check whether it was a legitimate change. If not → /agents/<id> → Inventory tab → Config hashes → Restore or quarantine the host with an IsolateNetwork EAT.


2. Fleet-wide kernel update with no downtime

In the dashboard

  1. Sidebar → Kernel CVEs → Active batches tab
  2. ‘New batch’ button → tag selector + target kernel + reboot-strategy
  3. Enter TOTP → Start canary
  4. Watch canary → primary → completed in the same view

Say USN-2026-1234 affects 80 Ubuntu hosts in production. Target kernel is 6.8.0-49.49.

Or via API (advanced — for automation)

Terminal window
curl -X POST https://app.monsys.ai/api/v1/kernel-updates/batches \
-H "Authorization: Bearer $TOKEN" \
-H "X-TOTP-Code: 123456" \
-H "Content-Type: application/json" \
-d '{
"title": "USN-2026-1234 — kernel 6.8.0-49.49",
"target_kernel": "6.8.0-49.49",
"package_manager": "apt",
"reboot_strategy": "auto-at-window",
"selector_kind": "tag",
"selector_value": {"tag":"production"},
"maintenance_window_id": "8c34…"
}'

What happens behind the scenes:

  • Hub picks 3 canary hosts (10% of 80 = 8, capped at 3)
  • For each canary: Ed25519 EAT signed, pushed over WebSocket
  • Agent shells out to /usr/local/sbin/monsys-kernel-update (only via the sudoers rule that allows this one wrapper)
  • Wrapper installs kernel + headers, runs update-grub, writes /var/run/reboot-required and — if inside the maintenance window — shutdown -r +5
  • Hub detects running_release flip on next ingest → marks phase rebooted_new
  • Only when all 3 canary hosts hit rebooted_new → primary fires (77 EATs in one tick)
  • One canary reporting phase='failed'batch aborted, no primary EATs sent
  • Canary stuck for 2h with no transition → automatic abort (“canary timeout”)

Track progress: /kernel-cves → Active batches tab. Or:

Terminal window
curl https://app.monsys.ai/api/v1/kernel-updates/batches/<id> \
-H "Authorization: Bearer $TOKEN" | jq .members

3. One application is slow — where to look

In the dashboard

  1. Sidebar → Apps → click host web-03 → Metrics tab
  2. Processes tab for top CPU/RAM consumers
  3. Topology tab for blast-radius dependencies
  4. Logs tab for recent agent WARN/ERROR

Click path for host web-03:

  1. /agents/web-03 → Metrics tab — CPU/mem/load over last 6h
  2. /agents/web-03 → Processes tab — top 10 CPU/RAM consumers
  3. /agents/web-03 → Topology tab — which other agents/apps it depends on
  4. /agents/web-03 → Containers tabrestart_count per container
  5. /agents/web-03 → Logs tab — last 200 WARN/ERROR lines from the agent itself (not full syslog — that’s in Loki)
  6. /alerts filter agent=web-03 AND is_resolved=false

Common root causes:

SymptomFirst check
CPU 100% for hoursProcesses tab — which PID? Then /agents/web-03 → Logs
Memory slowly fillsCapacity tab — RAM trend + swap usage
App unresponsiveContainers tab — restart_count + Inventory → systemd_services for service state
DB errorsTopology tab — is the DB agent stale? Check last_seen_at

4. Auto-reboot only on weekends

In the dashboard

  1. Sidebar → Settings → Maintenance windows → ‘New window’
  2. Configure recurrence: Saturday 02:00-06:00 tag-target ‘production’
  3. On kernel batch creation: reboot_strategy = auto-at-window
  4. Wrapper only runs shutdown -r +5 inside active window

Define a maintenance window first:

Or via API (advanced — for automation)

Terminal window
curl -X POST https://app.monsys.ai/api/v1/maintenance-windows \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"title": "Weekend reboot window",
"starts_at": "2026-05-23T02:00:00Z",
"ends_at": "2026-05-23T06:00:00Z",
"recurrence": "weekly",
"target_type": "tag",
"target_filter": {"tag":"production"},
"silence_categories": []
}'

When creating a kernel update batch: set reboot_strategy=auto-at-window. The monsys-kernel-update wrapper only runs shutdown -r +5 when the host is inside an active window; outside that, it stays at “reboot required” and waits for an operator confirmation.


5. Config drift after a migration

In the dashboard

  1. Sidebar → Agents → click host → Inventory tab → Config hashes section
  2. Table shows sha256 current vs baseline per watched path
  3. Per drift event: ‘Accept as baseline’ OR ‘Investigate’
  4. Drift without linked_eat_id = unexplained, requires action

Per agent you see sha256 of all watched paths (default: /etc/passwd, /etc/shadow, /etc/sudoers, /etc/ssh/sshd_config + 4 others).

Or via API (advanced — for automation)

Terminal window
curl https://app.monsys.ai/api/v1/agents/<id>/drift-events?limit=50 \
-H "Authorization: Bearer $TOKEN"

For each drift event you see:

  • path — which file
  • prev_sha256 / new_sha256
  • detected_at
  • linked_eat_idnull means: nobody authorised this with an EAT, this is unexplained drift

Workflow:

  1. Legitimate change (manual patch, migration): POST /api/v1/agents/<id>/drift-events/<id>/accept — baseline updates
  2. Suspicious: POST /api/v1/agents/<id>/emergency with action IsolateNetwork + TOTP, then inspect manually

6. Find every Ubuntu 20.04 server (end-of-standard-support 2025-04)

In the dashboard

  1. Sidebar → Agents → filter os_name=Ubuntu + os_version~20.04
  2. Bulk-select shown hosts
  3. ‘Add tag’ button → tag = ‘eol-2025’
  4. Sidebar → Settings → Alert rules → new with threshold-date 2025-04-30

Or via API (advanced — for automation)

Terminal window
curl 'https://app.monsys.ai/api/v1/agents?os_name=Ubuntu&os_version_like=20.04' \
-H "Authorization: Bearer $TOKEN" | jq '.[] | .hostname'

Or in SQL:

SELECT hostname, ip_addresses, last_seen_at
FROM agents
WHERE tenant_id = $1::UUID
AND os_name = 'Ubuntu'
AND os_version LIKE '20.04%'
AND is_active = true
ORDER BY last_seen_at DESC;

Typically: tag these hosts with eol-2025 via bulk-tagging in /groups, then build an SLA alert that fires whenever any agent still carries that tag after 2025-04-30.


7. “Why did I get 438 CPU alerts last night?”

In the dashboard

  1. Sidebar → Alerts → ‘Grouped by title’ tab
  2. Sort descending by count
  3. Click biggest group — does the title contain a value?
  4. Settings → Alert rules → edit the rule so the value goes into description

Since the alert-storm fix, InsertAlert does 30-minute dedup per (tenant, agent, category, title). If you still see a storm: most likely a changing title (e.g. count embedded) defeating the dedup key.

Quick check:

Or via API (advanced — for automation)

SELECT title, COUNT(*), MIN(created_at), MAX(created_at)
FROM alerts
WHERE tenant_id = $1::UUID
AND created_at >= NOW() - INTERVAL '24 hours'
GROUP BY title
ORDER BY 2 DESC LIMIT 10;

If one title appears 100+ times → bug somewhere. If 100 unique titles 1× each → title contains a value (count, percentage, timestamp). Fix the emitting worker to move the value to description.


8. Employee left — what did they have access to

In the dashboard

  1. Sidebar → Identity surface → search alice@
  2. See linked systems (hosts, SSH keys, Copilot/OpenAI seats)
  3. Per system ‘Revoke’ button OR one ‘revoke-user’ playbook EAT across all hosts
  4. Audit-log shows ‘identity_revoked’ per executed revocation

/identity/surface → search "alice@…" shows:

  • Dashboard account (roles per tenant)
  • Local users on which hosts (inventory_users.username='alice')
  • SSH keys where public-key fingerprint matched
  • GitHub Copilot seat (if Copilot Audit module enabled)
  • OpenAI org member (if OpenAI Audit module enabled)
  • Sudo rights (inventory_sudo_rules.who='alice')

Bulk-revoke from the Identity surface page: either per system (POST /api/v1/identity/persons/<id>/revoke), or as a single RunPlaybook EAT that handles all hosts at once.


9. On-call rotation, timezone-aware

In the dashboard

  1. Sidebar → Settings → On-call rotations → ‘New rotation’
  2. For team eu-ops: shifts Mon-Fri 09:00-18:00 Europe/Brussels
  3. Add buddy fallback for out-of-hours
  4. NotifyWorker resolves which shift is open per alert and routes to ntfy

Or via API (advanced — for automation)

-- For team eu-ops, 09:00-18:00 Brussels time, Mon-Fri, with buddy fallback
INSERT INTO oncall_shifts (tenant_id, group_id, person_id, weekday,
start_time, end_time, timezone)
VALUES
($1, $2, $alice, 1, '09:00', '18:00', 'Europe/Brussels'),
($1, $2, $alice, 2, '09:00', '18:00', 'Europe/Brussels'),
-- … and
($1, $2, $bob, 1, '18:00', '09:00', 'Europe/Brussels'); -- buddy

NotifyWorker resolves per alert which shift is open at alerts.created_at, then routes to the right ntfy topic + email.


10. Are production DB backups actually still working?

In the dashboard

  1. Sidebar → Inventory → Backups tab → filter tag ‘production-db’
  2. Sort ‘Last successful’ ascending
  3. Per stale host (>25h): click ‘Create alert rule’
  4. Threshold = 25h + severity = critical

Or via API (advanced — for automation)

SELECT a.hostname,
b.tool AS backup_tool,
b.destination AS target,
b.last_successful_run AS last_ok,
NOW() - b.last_successful_run AS age,
b.last_failure_message
FROM backup_configs b
JOIN agents a ON a.id = b.agent_id
WHERE b.tenant_id = $1::UUID
AND 'production-db' = ANY(a.tags)
ORDER BY b.last_successful_run NULLS FIRST;

Alert rule if age > 25h (default schedule = daily):

Terminal window
curl -X POST https://app.monsys.ai/api/v1/alert-rules \
-H "Authorization: Bearer $TOKEN" \
-d '{
"name": "Backup stale > 25h on production-db",
"tag": "production-db",
"category": "backup",
"metric": "backup_age_hours",
"operator": ">",
"threshold": 25,
"severity": "critical"
}'