This commit is contained in:
2026-05-23 22:57:09 +00:00
parent 9c6de9d067
commit 4c0a178646
6 changed files with 946 additions and 21 deletions

View File

@@ -0,0 +1,230 @@
# Matrix Cluster Failover Runbook (`accord.tube`)
Primary: `hectic-lab` (NL, `128.140.75.58`)
Standby: `bfs.poland.xray` (PL, `91.198.166.181`)
Module: `hectic.generic.matrix-cluster` (`nixos/module/generic/matrix-cluster.nix`).
Shared secrets: `sus/matrix-cluster.yaml`.
All `psql` and `pg_ctl` invocations use PostgreSQL **17** at data dir
`/var/lib/postgresql/17`.
## Initial setup
### 1. Provision shared SOPS file (`sus/matrix-cluster.yaml`)
On a workstation with both yukkop and yukkop-alt age keys available:
```sh
sudo cat /var/lib/matrix-synapse/homeserver.signing.key # on NL (hectic-lab)
# Copy the single line value into the buffer for the next step.
sops sus/matrix-cluster.yaml
```
Populate the editor with:
```yaml
matrix:
signing-key: <paste verbatim signing-key line from NL>
postgres-replication-password: <openssl rand -base64 32>
object-storage:
credentials: |
ACCESS_KEY_ID=<verbatim copy from sus/hectic-lab.yaml>
SECRET_ACCESS_KEY=<verbatim copy from sus/hectic-lab.yaml>
porkbun-api-key: <PORKBUN_API_KEY>
porkbun-secret-api-key: <PORKBUN_SECRET_API_KEY>
```
Verify recipients:
```sh
sops updatekeys sus/matrix-cluster.yaml
sops -d sus/matrix-cluster.yaml | grep -E 'signing-key|porkbun-api-key|object-storage'
```
Expected: all five keys present, exit 0.
### 2. Deploy NL primary first
```sh
nixos-rebuild switch --flake .#'hectic-lab|x86_64-linux' --target-host root@128.140.75.58
```
Verify on NL:
```sh
sudo systemctl status matrix-synapse postgresql matrix-cluster-replication-password
sudo -u postgres psql -c "select rolname, rolreplication from pg_roles where rolname='replication';"
# Expected: replication | t
```
### 3. Seed PL replica with `pg_basebackup`
On PL:
```sh
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/17
sudo -u postgres install -d -m 0700 /var/lib/postgresql/17
sudo -u postgres PGPASSWORD="$(sudo cat /run/secrets/matrix/postgres-replication-password)" \
pg_basebackup \
-h 128.140.75.58 \
-p 5432 \
-U replication \
-D /var/lib/postgresql/17 \
-Fp -Xs -P -R \
--no-password
```
`-R` writes `standby.signal` and an initial `primary_conninfo`. The
matrix-cluster module's `matrix-cluster-standby-bootstrap` service will
overwrite `primary_conninfo` to use a libpq passfile on next boot.
### 4. Deploy PL standby
```sh
nixos-rebuild switch --flake .#'bfs.poland.xray|x86_64-linux' --target-host root@91.198.166.181
sudo systemctl start postgresql
```
Verify streaming on NL:
```sh
sudo -u postgres psql -c 'select client_addr, state, sync_state from pg_stat_replication;'
# Expected: 91.198.166.181 | streaming | async
```
Verify standby on PL:
```sh
sudo -u postgres psql -c 'select pg_is_in_recovery();'
# Expected: t
sudo systemctl is-active matrix-synapse
# Expected: inactive (standby keeps Synapse off)
```
### 5. Remove duplicate S3 credentials from `sus/hectic-lab.yaml`
Only AFTER NL is confirmed healthy reading from the new shared file:
```sh
sops sus/hectic-lab.yaml
# Delete the matrix/object-storage/credentials block.
sudo nixos-rebuild switch --flake .#'hectic-lab|x86_64-linux'
```
## Normal operations
```sh
# NL: replication health
sudo -u postgres psql -c 'select * from pg_stat_replication;'
# Expected: 1 row, state=streaming, sync_state=async
# PL: replay status
sudo -u postgres psql -c 'select now() - pg_last_xact_replay_timestamp() as lag;'
# Both: cert renewal
sudo systemctl status acme-accord.tube.timer
sudo journalctl -u acme-accord.tube.service --since '24 hours ago'
# Synapse health (NL primary)
curl -sf https://accord.tube/_matrix/client/versions | head
```
## Planned failover (NL -> PL)
```sh
# 1. Drain NL: stop accepting writes.
sudo systemctl stop matrix-synapse
sudo systemctl stop postgresql # ensure no new WAL after this point
# 2. Promote PL replica.
sudo -u postgres pg_ctl -D /var/lib/postgresql/17 promote
# Wait until pg_is_in_recovery() returns f:
sudo -u postgres psql -c 'select pg_is_in_recovery();'
# 3. Make the role switch declarative before rebuilding.
# Edit the flake so rebuilds match the promoted database state:
# - nixos/system/bfs.poland.xray/bfs.poland.xray.nix:
# hectic.generic.matrix-cluster.role = "primary";
# hectic.generic.matrix-cluster.overrideEnableSynapse = true;
# hectic.generic.matrix-cluster.secretsFile = config.sops.secrets."matrix/secrets".path;
# - nixos/system/hectic-lab/hectic-lab.nix:
# hectic.generic.matrix-cluster.role = "standby";
# hectic.generic.matrix-cluster.overrideEnableSynapse = false;
# hectic.generic.matrix-cluster.replication.peerHost = "91.198.166.181";
# hectic.generic.matrix-cluster.replication.allowedSourceIPs = [ "128.140.75.58/32" ];
# (You will also need a matrix/secrets entry on PL - copy from NL via SOPS.)
sudo nixos-rebuild switch --flake .#'bfs.poland.xray|x86_64-linux'
sudo nixos-rebuild switch --flake .#'hectic-lab|x86_64-linux'
sudo systemctl status matrix-synapse
# 4. Swap DNS A record at Porkbun:
# accord.tube A 91.198.166.181 (was 128.140.75.58)
# TTL: set to 300 in advance of any planned failover.
# Porkbun UI: https://porkbun.com/account/domainsSpeedy -> accord.tube -> DNS -> edit A record.
# Or via API:
sudo curl -sX POST https://api.porkbun.com/api/json/v3/dns/editByNameType/accord.tube/A \
-H 'content-type: application/json' \
-d "$(jq -n --arg k "$PORKBUN_API_KEY" --arg s "$PORKBUN_SECRET_API_KEY" \
'{secretapikey:$s,apikey:$k,content:"91.198.166.181",ttl:"300"}')"
# 5. Federation smoke test.
curl -s 'https://federationtester.matrix.org/api/report?server_name=accord.tube' | jq .FederationOK
# Expected: true
```
Expected after the rebuilds:
- `bfs.poland.xray` evaluates and runs as `role = "primary"`.
- `hectic-lab` evaluates as `role = "standby"` with Synapse forced off.
- Future `nixos-rebuild` runs preserve the promoted topology instead of reapplying standby settings to PL.
## Failback (PL -> NL)
```sh
# 1. Stop NL postgres if still up; clear its data dir.
sudo systemctl stop postgresql matrix-synapse
sudo rm -rf /var/lib/postgresql/17
# 2. Re-seed NL from PL (now the live primary).
sudo -u postgres install -d -m 0700 /var/lib/postgresql/17
sudo -u postgres PGPASSWORD="$(sudo cat /run/secrets/matrix/postgres-replication-password)" \
pg_basebackup -h 91.198.166.181 -p 5432 -U replication \
-D /var/lib/postgresql/17 -Fp -Xs -P -R --no-password
# 3. Temporarily flip roles in the flake:
# - hectic-lab.nix: role = "standby"; peerHost = "91.198.166.181";
# - bfs.poland.xray.nix: role = "primary"; peerHost = "128.140.75.58";
# Rebuild both.
# 4. Once NL is streaming green, do the reverse failover dance:
sudo systemctl stop matrix-synapse # on PL
sudo -u postgres pg_ctl -D /var/lib/postgresql/17 promote # on NL
# Then revert the flake role assignments back to NL=primary / PL=standby and
# rebuild both hosts.
# 5. Swap DNS back at Porkbun (A -> 128.140.75.58).
```
## Disaster recovery (NL permanently lost)
```sh
# 1. Promote PL as the new permanent primary.
sudo -u postgres pg_ctl -D /var/lib/postgresql/17 promote
# 2. Edit nixos/system/bfs.poland.xray/bfs.poland.xray.nix:
# hectic.generic.matrix-cluster.role = "primary";
# hectic.generic.matrix-cluster.overrideEnableSynapse = lib.mkForce null;
# hectic.generic.matrix-cluster.replication.peerHost = "<new-standby-ip>";
# hectic.generic.matrix-cluster.replication.allowedSourceIPs = [ "<new-standby-ip>/32" ];
# 3. Provision a new host (any region with Porkbun-managed DNS) and import
# self.nixosModules.matrix-cluster with role = "standby" pointed at PL's IP.
# 4. Bootstrap the new standby via pg_basebackup from PL exactly as in
# "Initial setup" step 3, replacing 128.140.75.58 with PL's IP.
# 5. Update Porkbun A record to PL's IP permanently.
```