- Replace central sentinel with watcher: each node polls peers discovered via a single DNS name with multiple A records (e.g. peers.sentinella.com) - Auto-detect own IPs via hostname -I; SELF env var available as optional override for NAT/floating-IP setups - Fix Basic Auth bug in router.sh: compare tok against AUTH_TOKENS instead of unset $USER/$PASS - Rename sentinel binary to watcher; drop unused shellplot dep - Add inetutils to watcher runtime deps for hostname -I - Update NixOS module: replace sentinel options with watcher p2p options (peersDns, self, peersPort, peersScheme, pollingIntervalSec) - Add sentinèlla test suite: probe-status-empty, probe-disk, watcher-state-file
101 lines
4.0 KiB
Markdown
101 lines
4.0 KiB
Markdown
# Spec: sentinella-p2p-design
|
|
|
|
Scope: feature
|
|
|
|
# sentinèlla P2P Design Spec
|
|
|
|
## Goal
|
|
Replace the hub-and-spoke sentinel topology with a fully peer-to-peer model where every node is equal.
|
|
|
|
## Topology
|
|
- Every node runs both `probe` and `watcher`
|
|
- No privileged coordinator; any node can go down without breaking monitoring of the others
|
|
- Duplicate Telegram alerts from multiple nodes detecting the same failure are **accepted** (reliability over deduplication)
|
|
|
|
## Peer Discovery — DNS multi-A record
|
|
- One DNS name (e.g. `peers.sentinella.com`) has multiple A records, one per node IP
|
|
- Configured externally via any DNS registrar (Cloudflare, Namecheap, etc.)
|
|
- Recommended TTL: **60 seconds** so new nodes propagate quickly
|
|
- Each watcher resolves the name via `getent hosts $PEERS_DNS` on every poll cycle
|
|
- Own IP (`$SELF`) is stripped from the result so a node never polls itself
|
|
- No per-node DNS names needed; IP addresses are used directly in peer URLs
|
|
|
|
```
|
|
peers.sentinella.com A 1.2.3.4 TTL 60
|
|
peers.sentinella.com A 5.6.7.8 TTL 60
|
|
peers.sentinella.com A 9.10.11.12 TTL 60
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
### watcher (new, replaces sentinel)
|
|
| Variable | Default | Required | Description |
|
|
|---|---|---|---|
|
|
| `PEERS_DNS` | — | yes | DNS name resolving to all peer IPs |
|
|
| `SELF` | — | yes | This node's own IP; excluded from peer list |
|
|
| `PEERS_PORT` | `5988` | no | Port all peers listen on |
|
|
| `PEERS_SCHEME` | `http` | no | URL scheme for peer connections |
|
|
| `PEERS_TOKEN` | — | no | Single Basic Auth token sent to all peers (replaces per-server TOKENS) |
|
|
| `TG_TOKEN` | — | yes | Telegram bot token |
|
|
| `TG_CHAT_ID` | — | yes | Telegram chat ID |
|
|
| `TIMEOUT` | `5` | no | curl timeout seconds |
|
|
| `POLLING_INTERVAL_SEC` | `3` | no | Seconds between poll rounds |
|
|
| `STATE_DIR` | `/var/lib/sentinel` | no | Directory for state files |
|
|
| `SPAM` | `0` | no | If 1, notify on every poll |
|
|
|
|
### probe / router (unchanged)
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `PORT` | `5988` | TCP port to listen on |
|
|
| `URLS` | — | Space-separated URLs to health-check |
|
|
| `VOLUMES` | all from df -P | Mount points to report |
|
|
| `TIMEOUT` | `5` | curl timeout |
|
|
| `AUTH_FILE` | — | Path to user:pass auth file |
|
|
|
|
## Key Implementation Details
|
|
|
|
### resolve_peers() in watcher.sh
|
|
```sh
|
|
resolve_peers() {
|
|
getent hosts "$PEERS_DNS" \
|
|
| awk '{print $1}' \
|
|
| grep -v "^${SELF}$" \
|
|
| awk -v s="$PEERS_SCHEME" -v p="$PEERS_PORT" '{print s"://"$1":"p}'
|
|
}
|
|
```
|
|
Called at the top of every outer poll loop iteration — no restart needed when DNS changes.
|
|
|
|
### Auth simplification
|
|
- Old: per-server CSV `TOKENS` aligned with `SERVERS`
|
|
- New: single optional `PEERS_TOKEN`; either all peers require auth or none do
|
|
|
|
### State files
|
|
- Unchanged: `$STATE_DIR/$(cksum url).state` contains last known state string
|
|
- Format: `up:N/M:200` or `down:0/0:000`
|
|
|
|
## Binaries
|
|
| Old name | New name | Role |
|
|
|---|---|---|
|
|
| `sentinel` | `watcher` | Polls peers, sends alerts |
|
|
| `probe` | `probe` | socat TCP listener (unchanged) |
|
|
| `router` | `router` | HTTP handler (unchanged + auth bug fixed) |
|
|
| `base64` | `base64` | awk base64 util (unchanged) |
|
|
|
|
## NixOS Module Options
|
|
```
|
|
hectic.sentinella.enable bool
|
|
hectic.sentinella.peersDns string # e.g. "peers.sentinella.com"
|
|
hectic.sentinella.self string # this node's own IP
|
|
hectic.sentinella.port int # default 5988
|
|
hectic.sentinella.urls [string] # URLs for probe to health-check
|
|
hectic.sentinella.volumes [string] # mount points for probe
|
|
hectic.sentinella.tgToken string
|
|
hectic.sentinella.tgChatId string
|
|
hectic.sentinella.pollingIntervalSec int # default 3
|
|
```
|
|
Generates two systemd services: `sentinella-probe` and `sentinella-watcher`.
|
|
|
|
## Known Bug to Fix (router.sh)
|
|
The Basic Auth check references `$USER` and `$PASS` which are never populated.
|
|
Fix: move `auth_ok=false` before the header loop and compare `$tok` against
|
|
each entry in `$AUTH_TOKENS` (which is correctly populated from `AUTH_FILE`). |