feat(sentinèlla): p2p topology with DNS peer discovery

- Replace central sentinel with watcher: each node polls peers discovered
  via a single DNS name with multiple A records (e.g. peers.sentinella.com)
- Auto-detect own IPs via hostname -I; SELF env var available as optional
  override for NAT/floating-IP setups
- Fix Basic Auth bug in router.sh: compare tok against AUTH_TOKENS instead
  of unset $USER/$PASS
- Rename sentinel binary to watcher; drop unused shellplot dep
- Add inetutils to watcher runtime deps for hostname -I
- Update NixOS module: replace sentinel options with watcher p2p options
  (peersDns, self, peersPort, peersScheme, pollingIntervalSec)
- Add sentinèlla test suite: probe-status-empty, probe-disk, watcher-state-file
This commit is contained in:
2026-04-26 21:54:07 +00:00
parent 249d027f6b
commit 6035397e9b
15 changed files with 687 additions and 134 deletions

View File

@@ -0,0 +1,101 @@
# Spec: sentinella-p2p-design
Scope: feature
# sentinèlla P2P Design Spec
## Goal
Replace the hub-and-spoke sentinel topology with a fully peer-to-peer model where every node is equal.
## Topology
- Every node runs both `probe` and `watcher`
- No privileged coordinator; any node can go down without breaking monitoring of the others
- Duplicate Telegram alerts from multiple nodes detecting the same failure are **accepted** (reliability over deduplication)
## Peer Discovery — DNS multi-A record
- One DNS name (e.g. `peers.sentinella.com`) has multiple A records, one per node IP
- Configured externally via any DNS registrar (Cloudflare, Namecheap, etc.)
- Recommended TTL: **60 seconds** so new nodes propagate quickly
- Each watcher resolves the name via `getent hosts $PEERS_DNS` on every poll cycle
- Own IP (`$SELF`) is stripped from the result so a node never polls itself
- No per-node DNS names needed; IP addresses are used directly in peer URLs
```
peers.sentinella.com A 1.2.3.4 TTL 60
peers.sentinella.com A 5.6.7.8 TTL 60
peers.sentinella.com A 9.10.11.12 TTL 60
```
## Environment Variables
### watcher (new, replaces sentinel)
| Variable | Default | Required | Description |
|---|---|---|---|
| `PEERS_DNS` | — | yes | DNS name resolving to all peer IPs |
| `SELF` | — | yes | This node's own IP; excluded from peer list |
| `PEERS_PORT` | `5988` | no | Port all peers listen on |
| `PEERS_SCHEME` | `http` | no | URL scheme for peer connections |
| `PEERS_TOKEN` | — | no | Single Basic Auth token sent to all peers (replaces per-server TOKENS) |
| `TG_TOKEN` | — | yes | Telegram bot token |
| `TG_CHAT_ID` | — | yes | Telegram chat ID |
| `TIMEOUT` | `5` | no | curl timeout seconds |
| `POLLING_INTERVAL_SEC` | `3` | no | Seconds between poll rounds |
| `STATE_DIR` | `/var/lib/sentinel` | no | Directory for state files |
| `SPAM` | `0` | no | If 1, notify on every poll |
### probe / router (unchanged)
| Variable | Default | Description |
|---|---|---|
| `PORT` | `5988` | TCP port to listen on |
| `URLS` | — | Space-separated URLs to health-check |
| `VOLUMES` | all from df -P | Mount points to report |
| `TIMEOUT` | `5` | curl timeout |
| `AUTH_FILE` | — | Path to user:pass auth file |
## Key Implementation Details
### resolve_peers() in watcher.sh
```sh
resolve_peers() {
getent hosts "$PEERS_DNS" \
| awk '{print $1}' \
| grep -v "^${SELF}$" \
| awk -v s="$PEERS_SCHEME" -v p="$PEERS_PORT" '{print s"://"$1":"p}'
}
```
Called at the top of every outer poll loop iteration — no restart needed when DNS changes.
### Auth simplification
- Old: per-server CSV `TOKENS` aligned with `SERVERS`
- New: single optional `PEERS_TOKEN`; either all peers require auth or none do
### State files
- Unchanged: `$STATE_DIR/$(cksum url).state` contains last known state string
- Format: `up:N/M:200` or `down:0/0:000`
## Binaries
| Old name | New name | Role |
|---|---|---|
| `sentinel` | `watcher` | Polls peers, sends alerts |
| `probe` | `probe` | socat TCP listener (unchanged) |
| `router` | `router` | HTTP handler (unchanged + auth bug fixed) |
| `base64` | `base64` | awk base64 util (unchanged) |
## NixOS Module Options
```
hectic.sentinella.enable bool
hectic.sentinella.peersDns string # e.g. "peers.sentinella.com"
hectic.sentinella.self string # this node's own IP
hectic.sentinella.port int # default 5988
hectic.sentinella.urls [string] # URLs for probe to health-check
hectic.sentinella.volumes [string] # mount points for probe
hectic.sentinella.tgToken string
hectic.sentinella.tgChatId string
hectic.sentinella.pollingIntervalSec int # default 3
```
Generates two systemd services: `sentinella-probe` and `sentinella-watcher`.
## Known Bug to Fix (router.sh)
The Basic Auth check references `$USER` and `$PASS` which are never populated.
Fix: move `auth_ok=false` before the header loop and compare `$tok` against
each entry in `$AUTH_TOKENS` (which is correctly populated from `AUTH_FILE`).