Multi-Region Replication
SapixDB replicates across cloud regions with push + pull sync, automatic retry on transient failures, and an in-process Codios auth cache that removes the auth round-trip from your write latency critical path.
How replication works
Every write committed to the local WAL is immediately pushed to all registered peers via POST /v1/mesh/ingest. Each block carries its BLAKE3 content hash and Ed25519 signature — replicas verify authenticity on receipt. Ingest is idempotent: the same block arriving twice is silently ignored.
Push (primary → replicas) happens in a tokio::spawn fire-and-forget after every write. A per-peer in-memory retry queue with exponential backoff handles transient failures automatically — no blocks are silently dropped.
Pull (catch-up / bootstrap) — a replica that fell behind calls POST /v1/mesh/sync/{peer_agent_id}, which fetches paginated history from the peer's export endpoint starting from a persisted cursor. Bootstrap (POST /v1/cluster/divide) does this automatically for all existing data when you first add a node.
Automatic push retry
When a push to a peer fails (network error or non-2xx response), the block is placed in a per-peer in-memory retry queue. A background task wakes every 500 ms, checks which peers have ready messages, and drains them at full speed. When a peer comes back online, its backoff resets immediately and the entire queue drains without waiting.
Setting up multi-region (two nodes)
Step 1 — Start the primary (Region A)
docker run -d \ -e SAPIX_AGENT_ID=primary \ -e SAPIX_MASTER_SEED=<32-byte-hex> \ -e SAPIX_STRAND_DIR=/data/strand \ -e SAPIX_GRAPH_DIR=/data/graph \ -e SAPIX_BLOB_DIR=/data/blobs \ -e SAPIX_BIND_ADDR=0.0.0.0:7475 \ -p 7475:7475 \ -v sapix_strand:/data/strand \ -v sapix_graph:/data/graph \ -v sapix_blobs:/data/blobs \ sapixdb/agent:latest
Step 2 — Start the replica (Region B)
docker run -d \ -e SAPIX_AGENT_ID=replica-eu \ -e SAPIX_MASTER_SEED=<same-32-byte-hex> # HKDF derives a different keypair per agent_id -e SAPIX_STRAND_DIR=/data/strand \ -e SAPIX_GRAPH_DIR=/data/graph \ -e SAPIX_BLOB_DIR=/data/blobs \ -e SAPIX_BIND_ADDR=0.0.0.0:7475 \ -p 7475:7475 \ -v sapix_strand:/data/strand \ -v sapix_graph:/data/graph \ -v sapix_blobs:/data/blobs \ sapixdb/agent:latest
Step 3 — Bootstrap the replica (one call)
POST http://primary-region-a:7475/v1/cluster/divide
Authorization: Bearer <SAPIX_ROOT_KEY>
Content-Type: application/json
{
"new_agent_id": "replica-eu",
"endpoint": "http://replica-region-b:7475",
"role": "Replica"
}
# Response
{
"replica_agent_id": "replica-eu",
"blocks_pushed": 1842,
"endpoint": "http://replica-region-b:7475"
}This copies all existing blocks to the replica synchronously, then registers it as a peer for all future pushes. The replica is immediately readable after the call returns.
Option B — docker-compose for local/staging
cp docker-compose.replication.yml docker-compose.yml SAPIX_MASTER_SEED=<32-byte-hex> docker compose up -d # Node 1 on :7475, node 2 on :7476 — both on a shared bridge network
CP vs AP consistency
By default SapixDB runs CP — writes return 503 if any peer is unreachable (partition detected). Switch to AP for cross-region deployments where you can tolerate a brief divergence window during a network partition.
- Writes return 503 if any peer is unreachable
- Guarantees all nodes share the same strand tip
- Best for financial ledgers, audit trails, SOX
- Writes always succeed on reachable nodes
- Divergent state reconciled on partition heal via pull-sync
- Best for high-throughput ingest, IoT, analytics
# Switch to AP consistency (tolerate partition, reconcile on heal)
POST /v1/cluster/consistency
Content-Type: application/json
{ "mode": "Ap" }
# Response
{ "consistency": "ap", "previous": "cp" }Codios auth cache
Every POST /v1/records previously called Codios synchronously before appending — one round-trip per write. The auth cache eliminates this latency for repeated operations: the first call to Codios is made normally, the result is cached for 5 seconds, and all subsequent writes during that window skip the network call entirely.
Cache hit (permit)Write proceeds immediately. No Codios call.Cache hit (deny)Write blocked immediately. No Codios call.Cache missCodios is called; result is cached for the next 5 s (permit) or 2 s (deny).Codios unreachable + stale permitStale permit is served; write proceeds. Logged as auth_degraded.SAPIX_CODIOS_URL). Never cross regions for a synchronous auth call. The cache then handles burst throughput — cross-region latency is eliminated entirely.Circuit breaker (opt-in)
If Codios is continuously unreachable for longer than SAPIX_AUTH_CIRCUIT_OPEN_AFTER_SECS (default: 30 s), the circuit opens. Writes proceed using the last cached auth decisions, and every bypassed call is logged with auth_degraded: true. When Codios responds again, the circuit closes automatically and normal enforcement resumes.
environment: SAPIX_CODIOS_URL: "https://codios-api.midlantics.com" SAPIX_CODIOS_API_KEY: "codios_sk_..." SAPIX_AUTH_CACHE_TTL_SECS: "5" SAPIX_AUTH_CIRCUIT_BREAKER: "true" SAPIX_AUTH_CIRCUIT_OPEN_AFTER_SECS: "30"
| Env var | Default | Description |
|---|---|---|
SAPIX_AUTH_CACHE_TTL_SECS | 5 | Permit cache TTL in seconds. Deny results always use a 2 s TTL. |
SAPIX_AUTH_CIRCUIT_BREAKER | false | Set true to enable the circuit breaker. |
SAPIX_AUTH_CIRCUIT_OPEN_AFTER_SECS | 30 | Seconds of continuous Codios failures before the circuit opens. |
SAPIX_REQUIRE_CODIOS=true to keep fail-closed semantics instead.Ops runbook
Monitor replica lag
The retry queue handles transient failures automatically. For prolonged outages (hours), check lag and trigger a manual catch-up:
# Check replica lag — run this periodically curl http://replica-region-b:7475/v1/cluster/status # If a peer shows last_sync_ms is stale (> 5 min), trigger catch-up curl -X POST http://replica-region-b:7475/v1/mesh/sync/primary \ -H "Authorization: Bearer $SAPIX_ROOT_KEY" # The cursor is persisted in sync_cursors.json so catch-up only fetches what's missing
Production monitoring cron
# Alert if last_sync_ms is more than 5 minutes stale
*/5 * * * * curl -sf http://replica:7475/v1/cluster/status | \
python3 -c "import sys,json; s=json.load(sys.stdin); \
[print('WARN lag',n['agent_id']) for n in s['nodes'] if not n['reachable']]"Cluster status response
{
"nodes": [
{
"agent_id": "primary",
"endpoint": "http://primary-region-a:7475",
"role": "Primary",
"added_at_ms": 1715000000000,
"last_sync_ms": 1715001234567,
"reachable": true
},
{
"agent_id": "replica-eu",
"endpoint": "http://replica-region-b:7475",
"role": "Replica",
"added_at_ms": 1715000050000,
"last_sync_ms": 1715001230000,
"reachable": true
}
],
"partitioned": false,
"consistency": "ap"
}