Files
util.nix/docs/specs/sentinella-p2p-design.md
yukkop 6035397e9b feat(sentinèlla): p2p topology with DNS peer discovery
- Replace central sentinel with watcher: each node polls peers discovered
  via a single DNS name with multiple A records (e.g. peers.sentinella.com)
- Auto-detect own IPs via hostname -I; SELF env var available as optional
  override for NAT/floating-IP setups
- Fix Basic Auth bug in router.sh: compare tok against AUTH_TOKENS instead
  of unset $USER/$PASS
- Rename sentinel binary to watcher; drop unused shellplot dep
- Add inetutils to watcher runtime deps for hostname -I
- Update NixOS module: replace sentinel options with watcher p2p options
  (peersDns, self, peersPort, peersScheme, pollingIntervalSec)
- Add sentinèlla test suite: probe-status-empty, probe-disk, watcher-state-file
2026-04-26 21:54:07 +00:00

4.0 KiB

Spec: sentinella-p2p-design

Scope: feature

sentinèlla P2P Design Spec

Goal

Replace the hub-and-spoke sentinel topology with a fully peer-to-peer model where every node is equal.

Topology

  • Every node runs both probe and watcher
  • No privileged coordinator; any node can go down without breaking monitoring of the others
  • Duplicate Telegram alerts from multiple nodes detecting the same failure are accepted (reliability over deduplication)

Peer Discovery — DNS multi-A record

  • One DNS name (e.g. peers.sentinella.com) has multiple A records, one per node IP
  • Configured externally via any DNS registrar (Cloudflare, Namecheap, etc.)
  • Recommended TTL: 60 seconds so new nodes propagate quickly
  • Each watcher resolves the name via getent hosts $PEERS_DNS on every poll cycle
  • Own IP ($SELF) is stripped from the result so a node never polls itself
  • No per-node DNS names needed; IP addresses are used directly in peer URLs
peers.sentinella.com  A  1.2.3.4    TTL 60
peers.sentinella.com  A  5.6.7.8    TTL 60
peers.sentinella.com  A  9.10.11.12 TTL 60

Environment Variables

watcher (new, replaces sentinel)

Variable Default Required Description
PEERS_DNS yes DNS name resolving to all peer IPs
SELF yes This node's own IP; excluded from peer list
PEERS_PORT 5988 no Port all peers listen on
PEERS_SCHEME http no URL scheme for peer connections
PEERS_TOKEN no Single Basic Auth token sent to all peers (replaces per-server TOKENS)
TG_TOKEN yes Telegram bot token
TG_CHAT_ID yes Telegram chat ID
TIMEOUT 5 no curl timeout seconds
POLLING_INTERVAL_SEC 3 no Seconds between poll rounds
STATE_DIR /var/lib/sentinel no Directory for state files
SPAM 0 no If 1, notify on every poll

probe / router (unchanged)

Variable Default Description
PORT 5988 TCP port to listen on
URLS Space-separated URLs to health-check
VOLUMES all from df -P Mount points to report
TIMEOUT 5 curl timeout
AUTH_FILE Path to user:pass auth file

Key Implementation Details

resolve_peers() in watcher.sh

resolve_peers() {
  getent hosts "$PEERS_DNS" \
    | awk '{print $1}' \
    | grep -v "^${SELF}$" \
    | awk -v s="$PEERS_SCHEME" -v p="$PEERS_PORT" '{print s"://"$1":"p}'
}

Called at the top of every outer poll loop iteration — no restart needed when DNS changes.

Auth simplification

  • Old: per-server CSV TOKENS aligned with SERVERS
  • New: single optional PEERS_TOKEN; either all peers require auth or none do

State files

  • Unchanged: $STATE_DIR/$(cksum url).state contains last known state string
  • Format: up:N/M:200 or down:0/0:000

Binaries

Old name New name Role
sentinel watcher Polls peers, sends alerts
probe probe socat TCP listener (unchanged)
router router HTTP handler (unchanged + auth bug fixed)
base64 base64 awk base64 util (unchanged)

NixOS Module Options

hectic.sentinella.enable            bool
hectic.sentinella.peersDns          string   # e.g. "peers.sentinella.com"
hectic.sentinella.self              string   # this node's own IP
hectic.sentinella.port              int      # default 5988
hectic.sentinella.urls              [string] # URLs for probe to health-check
hectic.sentinella.volumes           [string] # mount points for probe
hectic.sentinella.tgToken           string
hectic.sentinella.tgChatId          string
hectic.sentinella.pollingIntervalSec int     # default 3

Generates two systemd services: sentinella-probe and sentinella-watcher.

Known Bug to Fix (router.sh)

The Basic Auth check references $USER and $PASS which are never populated. Fix: move auth_ok=false before the header loop and compare $tok against each entry in $AUTH_TOKENS (which is correctly populated from AUTH_FILE).