SapixDBSapixDB/Docs
Home
Multi-Region · Replication

Multi-Region Replication

SapixDB replicates across cloud regions with push + pull sync, automatic retry on transient failures, and an in-process Codios auth cache that removes the auth round-trip from your write latency critical path.

Push + Pull replication
Writes fan out to peers instantly. A background retry queue with exponential backoff covers transient failures automatically.
CP or AP consistency
Switch between CP (writes block during partition) and AP (writes continue, reconcile on heal) with a single API call.
Auth cache
Codios auth results are cached in-process. Cache hits skip the network round-trip entirely — no added latency on hot paths.
Circuit breaker
Optional: if Codios is unreachable for 30 s, the circuit opens and writes proceed with stale cached decisions until Codios recovers.

How replication works

Every write committed to the local WAL is immediately pushed to all registered peers via POST /v1/mesh/ingest. Each block carries its BLAKE3 content hash and Ed25519 signature — replicas verify authenticity on receipt. Ingest is idempotent: the same block arriving twice is silently ignored.

Push (primary → replicas) happens in a tokio::spawn fire-and-forget after every write. A per-peer in-memory retry queue with exponential backoff handles transient failures automatically — no blocks are silently dropped.

Pull (catch-up / bootstrap) — a replica that fell behind calls POST /v1/mesh/sync/{peer_agent_id}, which fetches paginated history from the peer's export endpoint starting from a persisted cursor. Bootstrap (POST /v1/cluster/divide) does this automatically for all existing data when you first add a node.

Automatic push retry

When a push to a peer fails (network error or non-2xx response), the block is placed in a per-peer in-memory retry queue. A background task wakes every 500 ms, checks which peers have ready messages, and drains them at full speed. When a peer comes back online, its backoff resets immediately and the entire queue drains without waiting.

50 000 blocks
Queue cap
~50 MB per peer
1 s
Initial backoff
doubles on each failure
60 s
Max backoff
extended-failure alert logged
Queue is in-memory by designThe WAL on the primary is already durable. If the primary restarts during a partition, pull-sync from the persisted cursor covers catch-up. The retry queue only needs to survive transient peer-down scenarios within a running process.

Setting up multi-region (two nodes)

Step 1 — Start the primary (Region A)

shell — Region A
docker run -d \
  -e SAPIX_AGENT_ID=primary \
  -e SAPIX_MASTER_SEED=<32-byte-hex> \
  -e SAPIX_STRAND_DIR=/data/strand \
  -e SAPIX_GRAPH_DIR=/data/graph \
  -e SAPIX_BLOB_DIR=/data/blobs \
  -e SAPIX_BIND_ADDR=0.0.0.0:7475 \
  -p 7475:7475 \
  -v sapix_strand:/data/strand \
  -v sapix_graph:/data/graph \
  -v sapix_blobs:/data/blobs \
  sapixdb/agent:latest

Step 2 — Start the replica (Region B)

shell — Region B
docker run -d \
  -e SAPIX_AGENT_ID=replica-eu \
  -e SAPIX_MASTER_SEED=<same-32-byte-hex>   # HKDF derives a different keypair per agent_id
  -e SAPIX_STRAND_DIR=/data/strand \
  -e SAPIX_GRAPH_DIR=/data/graph \
  -e SAPIX_BLOB_DIR=/data/blobs \
  -e SAPIX_BIND_ADDR=0.0.0.0:7475 \
  -p 7475:7475 \
  -v sapix_strand:/data/strand \
  -v sapix_graph:/data/graph \
  -v sapix_blobs:/data/blobs \
  sapixdb/agent:latest

Step 3 — Bootstrap the replica (one call)

HTTP
POST http://primary-region-a:7475/v1/cluster/divide
Authorization: Bearer <SAPIX_ROOT_KEY>
Content-Type: application/json

{
  "new_agent_id": "replica-eu",
  "endpoint":     "http://replica-region-b:7475",
  "role":         "Replica"
}

# Response
{
  "replica_agent_id": "replica-eu",
  "blocks_pushed":    1842,
  "endpoint":         "http://replica-region-b:7475"
}

This copies all existing blocks to the replica synchronously, then registers it as a peer for all future pushes. The replica is immediately readable after the call returns.

Option B — docker-compose for local/staging

shell
cp docker-compose.replication.yml docker-compose.yml
SAPIX_MASTER_SEED=<32-byte-hex> docker compose up -d

# Node 1 on :7475, node 2 on :7476 — both on a shared bridge network

CP vs AP consistency

By default SapixDB runs CP — writes return 503 if any peer is unreachable (partition detected). Switch to AP for cross-region deployments where you can tolerate a brief divergence window during a network partition.

CP (default)
  • Writes return 503 if any peer is unreachable
  • Guarantees all nodes share the same strand tip
  • Best for financial ledgers, audit trails, SOX
AP
  • Writes always succeed on reachable nodes
  • Divergent state reconciled on partition heal via pull-sync
  • Best for high-throughput ingest, IoT, analytics
HTTP
# Switch to AP consistency (tolerate partition, reconcile on heal)
POST /v1/cluster/consistency
Content-Type: application/json

{ "mode": "Ap" }

# Response
{ "consistency": "ap", "previous": "cp" }

Codios auth cache

Every POST /v1/records previously called Codios synchronously before appending — one round-trip per write. The auth cache eliminates this latency for repeated operations: the first call to Codios is made normally, the result is cached for 5 seconds, and all subsequent writes during that window skip the network call entirely.

Cache hit (permit)Write proceeds immediately. No Codios call.
Cache hit (deny)Write blocked immediately. No Codios call.
Cache missCodios is called; result is cached for the next 5 s (permit) or 2 s (deny).
Codios unreachable + stale permitStale permit is served; write proceeds. Logged as auth_degraded.
For multi-region: co-locate SapixDB and CodiosConfigure each regional SapixDB node to point to a Codios endpoint in the same AZ (SAPIX_CODIOS_URL). Never cross regions for a synchronous auth call. The cache then handles burst throughput — cross-region latency is eliminated entirely.

Circuit breaker (opt-in)

If Codios is continuously unreachable for longer than SAPIX_AUTH_CIRCUIT_OPEN_AFTER_SECS (default: 30 s), the circuit opens. Writes proceed using the last cached auth decisions, and every bypassed call is logged with auth_degraded: true. When Codios responds again, the circuit closes automatically and normal enforcement resumes.

docker-compose.yml — circuit breaker enabled
environment:
  SAPIX_CODIOS_URL:                  "https://codios-api.midlantics.com"
  SAPIX_CODIOS_API_KEY:              "codios_sk_..."
  SAPIX_AUTH_CACHE_TTL_SECS:         "5"
  SAPIX_AUTH_CIRCUIT_BREAKER:        "true"
  SAPIX_AUTH_CIRCUIT_OPEN_AFTER_SECS: "30"
Env varDefaultDescription
SAPIX_AUTH_CACHE_TTL_SECS5Permit cache TTL in seconds. Deny results always use a 2 s TTL.
SAPIX_AUTH_CIRCUIT_BREAKERfalseSet true to enable the circuit breaker.
SAPIX_AUTH_CIRCUIT_OPEN_AFTER_SECS30Seconds of continuous Codios failures before the circuit opens.
Circuit breaker is opt-in for a reasonWhen the circuit is open, writes proceed without real-time Codios authorization. This is the right trade-off for high-availability deployments but should not be used where strict fail-closed behavior is required (e.g. financial compliance). Use SAPIX_REQUIRE_CODIOS=true to keep fail-closed semantics instead.

Ops runbook

Monitor replica lag

The retry queue handles transient failures automatically. For prolonged outages (hours), check lag and trigger a manual catch-up:

shell
# Check replica lag — run this periodically
curl http://replica-region-b:7475/v1/cluster/status

# If a peer shows last_sync_ms is stale (> 5 min), trigger catch-up
curl -X POST http://replica-region-b:7475/v1/mesh/sync/primary \
  -H "Authorization: Bearer $SAPIX_ROOT_KEY"

# The cursor is persisted in sync_cursors.json so catch-up only fetches what's missing

Production monitoring cron

crontab
# Alert if last_sync_ms is more than 5 minutes stale
*/5 * * * * curl -sf http://replica:7475/v1/cluster/status | \
  python3 -c "import sys,json; s=json.load(sys.stdin); \
  [print('WARN lag',n['agent_id']) for n in s['nodes'] if not n['reachable']]"

Cluster status response

JSON
{
  "nodes": [
    {
      "agent_id":     "primary",
      "endpoint":     "http://primary-region-a:7475",
      "role":         "Primary",
      "added_at_ms":  1715000000000,
      "last_sync_ms": 1715001234567,
      "reachable":    true
    },
    {
      "agent_id":     "replica-eu",
      "endpoint":     "http://replica-region-b:7475",
      "role":         "Replica",
      "added_at_ms":  1715000050000,
      "last_sync_ms": 1715001230000,
      "reachable":    true
    }
  ],
  "partitioned": false,
  "consistency":  "ap"
}

Summary

Push replication retries automatically with exponential backoff. No blocks are silently dropped on transient peer failures.
The Codios auth cache removes the auth round-trip from your write latency critical path for repeated operations.
Stale-while-revalidate keeps writes flowing during brief Codios blips without operator intervention.
The circuit breaker (opt-in) handles extended Codios outages but bypasses real-time authorization — use only where availability outranks strict enforcement.
Push retry has no second durability layer by design — the WAL on the primary is already durable. If the primary restarts during a partition, pull-sync catches up from the persisted cursor.
Deploy SapixDB across regions
Automatic push retry, auth caching, and CAP-selectable consistency — available now.
Start Free Trial →