some

2026-05-23 22:57:09 +00:00
parent 9c6de9d067
commit 4c0a178646
6 changed files with 946 additions and 21 deletions
--- a/nixos/system/bfs.poland.xray/MATRIX-FAILOVER-RUNBOOK.md
+++ b/nixos/system/bfs.poland.xray/MATRIX-FAILOVER-RUNBOOK.md
@@ -0,0 +1,230 @@
+# Matrix Cluster Failover Runbook (`accord.tube`)
+
+Primary: `hectic-lab` (NL, `128.140.75.58`)
+Standby: `bfs.poland.xray` (PL, `91.198.166.181`)
+
+Module: `hectic.generic.matrix-cluster` (`nixos/module/generic/matrix-cluster.nix`).
+Shared secrets: `sus/matrix-cluster.yaml`.
+
+All `psql` and `pg_ctl` invocations use PostgreSQL **17** at data dir
+`/var/lib/postgresql/17`.
+
+## Initial setup
+
+### 1. Provision shared SOPS file (`sus/matrix-cluster.yaml`)
+
+On a workstation with both yukkop and yukkop-alt age keys available:
+
+```sh
+sudo cat /var/lib/matrix-synapse/homeserver.signing.key  # on NL (hectic-lab)
+# Copy the single line value into the buffer for the next step.
+
+sops sus/matrix-cluster.yaml
+```
+
+Populate the editor with:
+
+```yaml
+matrix:
+  signing-key: <paste verbatim signing-key line from NL>
+  postgres-replication-password: <openssl rand -base64 32>
+  object-storage:
+    credentials: |
+      ACCESS_KEY_ID=<verbatim copy from sus/hectic-lab.yaml>
+      SECRET_ACCESS_KEY=<verbatim copy from sus/hectic-lab.yaml>
+  porkbun-api-key: <PORKBUN_API_KEY>
+  porkbun-secret-api-key: <PORKBUN_SECRET_API_KEY>
+```
+
+Verify recipients:
+
+```sh
+sops updatekeys sus/matrix-cluster.yaml
+sops -d sus/matrix-cluster.yaml | grep -E 'signing-key|porkbun-api-key|object-storage'
+```
+
+Expected: all five keys present, exit 0.
+
+### 2. Deploy NL primary first
+
+```sh
+nixos-rebuild switch --flake .#'hectic-lab|x86_64-linux' --target-host root@128.140.75.58
+```
+
+Verify on NL:
+
+```sh
+sudo systemctl status matrix-synapse postgresql matrix-cluster-replication-password
+sudo -u postgres psql -c "select rolname, rolreplication from pg_roles where rolname='replication';"
+# Expected: replication | t
+```
+
+### 3. Seed PL replica with `pg_basebackup`
+
+On PL:
+
+```sh
+sudo systemctl stop postgresql
+sudo rm -rf /var/lib/postgresql/17
+sudo -u postgres install -d -m 0700 /var/lib/postgresql/17
+sudo -u postgres PGPASSWORD="$(sudo cat /run/secrets/matrix/postgres-replication-password)" \
+  pg_basebackup \
+    -h 128.140.75.58 \
+    -p 5432 \
+    -U replication \
+    -D /var/lib/postgresql/17 \
+    -Fp -Xs -P -R \
+    --no-password
+```
+
+`-R` writes `standby.signal` and an initial `primary_conninfo`. The
+matrix-cluster module's `matrix-cluster-standby-bootstrap` service will
+overwrite `primary_conninfo` to use a libpq passfile on next boot.
+
+### 4. Deploy PL standby
+
+```sh
+nixos-rebuild switch --flake .#'bfs.poland.xray|x86_64-linux' --target-host root@91.198.166.181
+sudo systemctl start postgresql
+```
+
+Verify streaming on NL:
+
+```sh
+sudo -u postgres psql -c 'select client_addr, state, sync_state from pg_stat_replication;'
+# Expected: 91.198.166.181 | streaming | async
+```
+
+Verify standby on PL:
+
+```sh
+sudo -u postgres psql -c 'select pg_is_in_recovery();'
+# Expected: t
+sudo systemctl is-active matrix-synapse
+# Expected: inactive (standby keeps Synapse off)
+```
+
+### 5. Remove duplicate S3 credentials from `sus/hectic-lab.yaml`
+
+Only AFTER NL is confirmed healthy reading from the new shared file:
+
+```sh
+sops sus/hectic-lab.yaml
+# Delete the matrix/object-storage/credentials block.
+sudo nixos-rebuild switch --flake .#'hectic-lab|x86_64-linux'
+```
+
+## Normal operations
+
+```sh
+# NL: replication health
+sudo -u postgres psql -c 'select * from pg_stat_replication;'
+# Expected: 1 row, state=streaming, sync_state=async
+
+# PL: replay status
+sudo -u postgres psql -c 'select now() - pg_last_xact_replay_timestamp() as lag;'
+
+# Both: cert renewal
+sudo systemctl status acme-accord.tube.timer
+sudo journalctl -u acme-accord.tube.service --since '24 hours ago'
+
+# Synapse health (NL primary)
+curl -sf https://accord.tube/_matrix/client/versions | head
+```
+
+## Planned failover (NL -> PL)
+
+```sh
+# 1. Drain NL: stop accepting writes.
+sudo systemctl stop matrix-synapse
+sudo systemctl stop postgresql   # ensure no new WAL after this point
+
+# 2. Promote PL replica.
+sudo -u postgres pg_ctl -D /var/lib/postgresql/17 promote
+# Wait until pg_is_in_recovery() returns f:
+sudo -u postgres psql -c 'select pg_is_in_recovery();'
+
+# 3. Make the role switch declarative before rebuilding.
+#    Edit the flake so rebuilds match the promoted database state:
+#      - nixos/system/bfs.poland.xray/bfs.poland.xray.nix:
+#          hectic.generic.matrix-cluster.role = "primary";
+#          hectic.generic.matrix-cluster.overrideEnableSynapse = true;
+#          hectic.generic.matrix-cluster.secretsFile = config.sops.secrets."matrix/secrets".path;
+#      - nixos/system/hectic-lab/hectic-lab.nix:
+#          hectic.generic.matrix-cluster.role = "standby";
+#          hectic.generic.matrix-cluster.overrideEnableSynapse = false;
+#          hectic.generic.matrix-cluster.replication.peerHost = "91.198.166.181";
+#          hectic.generic.matrix-cluster.replication.allowedSourceIPs = [ "128.140.75.58/32" ];
+#    (You will also need a matrix/secrets entry on PL - copy from NL via SOPS.)
+sudo nixos-rebuild switch --flake .#'bfs.poland.xray|x86_64-linux'
+sudo nixos-rebuild switch --flake .#'hectic-lab|x86_64-linux'
+sudo systemctl status matrix-synapse
+
+# 4. Swap DNS A record at Porkbun:
+#    accord.tube  A  91.198.166.181   (was 128.140.75.58)
+#    TTL: set to 300 in advance of any planned failover.
+#    Porkbun UI: https://porkbun.com/account/domainsSpeedy -> accord.tube -> DNS -> edit A record.
+#    Or via API:
+sudo curl -sX POST https://api.porkbun.com/api/json/v3/dns/editByNameType/accord.tube/A \
+  -H 'content-type: application/json' \
+  -d "$(jq -n --arg k "$PORKBUN_API_KEY" --arg s "$PORKBUN_SECRET_API_KEY" \
+        '{secretapikey:$s,apikey:$k,content:"91.198.166.181",ttl:"300"}')"
+
+# 5. Federation smoke test.
+curl -s 'https://federationtester.matrix.org/api/report?server_name=accord.tube' | jq .FederationOK
+# Expected: true
+```
+
+Expected after the rebuilds:
+
+- `bfs.poland.xray` evaluates and runs as `role = "primary"`.
+- `hectic-lab` evaluates as `role = "standby"` with Synapse forced off.
+- Future `nixos-rebuild` runs preserve the promoted topology instead of reapplying standby settings to PL.
+
+## Failback (PL -> NL)
+
+```sh
+# 1. Stop NL postgres if still up; clear its data dir.
+sudo systemctl stop postgresql matrix-synapse
+sudo rm -rf /var/lib/postgresql/17
+
+# 2. Re-seed NL from PL (now the live primary).
+sudo -u postgres install -d -m 0700 /var/lib/postgresql/17
+sudo -u postgres PGPASSWORD="$(sudo cat /run/secrets/matrix/postgres-replication-password)" \
+  pg_basebackup -h 91.198.166.181 -p 5432 -U replication \
+    -D /var/lib/postgresql/17 -Fp -Xs -P -R --no-password
+
+# 3. Temporarily flip roles in the flake:
+#    - hectic-lab.nix: role = "standby";   peerHost = "91.198.166.181";
+#    - bfs.poland.xray.nix: role = "primary"; peerHost = "128.140.75.58";
+#    Rebuild both.
+
+# 4. Once NL is streaming green, do the reverse failover dance:
+sudo systemctl stop matrix-synapse                    # on PL
+sudo -u postgres pg_ctl -D /var/lib/postgresql/17 promote   # on NL
+# Then revert the flake role assignments back to NL=primary / PL=standby and
+# rebuild both hosts.
+
+# 5. Swap DNS back at Porkbun (A -> 128.140.75.58).
+```
+
+## Disaster recovery (NL permanently lost)
+
+```sh
+# 1. Promote PL as the new permanent primary.
+sudo -u postgres pg_ctl -D /var/lib/postgresql/17 promote
+
+# 2. Edit nixos/system/bfs.poland.xray/bfs.poland.xray.nix:
+#      hectic.generic.matrix-cluster.role = "primary";
+#      hectic.generic.matrix-cluster.overrideEnableSynapse = lib.mkForce null;
+#      hectic.generic.matrix-cluster.replication.peerHost = "<new-standby-ip>";
+#      hectic.generic.matrix-cluster.replication.allowedSourceIPs = [ "<new-standby-ip>/32" ];
+
+# 3. Provision a new host (any region with Porkbun-managed DNS) and import
+#    self.nixosModules.matrix-cluster with role = "standby" pointed at PL's IP.
+
+# 4. Bootstrap the new standby via pg_basebackup from PL exactly as in
+#    "Initial setup" step 3, replacing 128.140.75.58 with PL's IP.
+
+# 5. Update Porkbun A record to PL's IP permanently.
+```