Architecture

Components

agent (Rust, on every host)
  │  HTTPS batches: metrics · heartbeats · inventory · alerts · honeypot events
  ▼
hub-api  (Go + Gin)  ──►  TimescaleDB  (Postgres with Timescale extension)
                     ├──  Redis        (rate limiting · cache)
                     ├──  NATS         (internal message bus)
                     └──  Ollama       (local LLM for "Explain this")
hub-api  ──►  dashboard (Next.js)
        └──  websocket emergency push  ──►  agent

Why no Prometheus

All data should be in one database — joins between alerts and metrics are then SQL, not PromQL plus application logic. TimescaleDB hypertables provide time-series performance without introducing a second operational system.

Trade-off: we lose Prometheus’ pull-based service discovery and the exporter ecosystem. For monitoring the platform itself, we expose /metrics on api.monsys.ai in Prometheus exposition format.

Why Caddy

Auto-TLS via Let’s Encrypt without configuration. The subdomains are stable, so there is no reason to manually manage the TLS lifecycle like with nginx.

Why Rust for the agent

The agent runs on every monitored host: footprint and reliability are dominant. Rust provides a statically linked binary of ~12 MB without runtime dependencies, predictable memory under load (no GC pauses that look like CPU spikes), and the type system catches errors that a long-running daemon must not have.

Why Ed25519 for Emergency Action Tokens

We don’t want to give the agent permanent root privileges. Tokens provide the same operational power for several minutes that an incident lasts, and then expire. In case of hub compromise, nonces block replays; in case of agent compromise, there is no long-lived key material on the table.