Overview

The platform's KV store. Each cluster runs an independent Redis HA pair (one primary, two replicas) fronted by three Sentinels for in-cluster failover. DC and DR are not directly replicated at the Redis layer — instead, Kafka acts as a write-ahead log and the redis-applier replays the WAL into the local Redis primary on each side. The result is two eventually-consistent copies of the same dataset, decoupled at the storage layer but coupled at the streaming layer.

This pattern (ADR-0018) was chosen because Redis-native replication across heterogeneous regions is fragile (split-brain, cascading auth, network blips), but Kafka MM2 between two RKE2 clusters is well-understood and already needed for other replication paths.

Operator
Spotahome redis-operator 3.2.13 reconciling a RedisFailover CR
Topology per cluster
3 Redis pods (1 primary + 2 replicas) + 3 Sentinel pods
Persistence
Longhorn-backed PVCs (one per Redis pod)
Failover (within cluster)
Sentinel quorum-driven primary election
Failover (DC ⟷ DR)
Independent instances; no Redis-native replication. Convergence via Kafka WAL (ADR-0018) replayed by redis-applier on each side.
Cross-cluster carrier
Kafka topic redis-writes, mirrored DC→DR by MirrorMaker 2
Validation
End-to-end DR test PASSED 2026-05-05 — write to DC Redis appeared on DR Redis after the MM2 + applier hop

The Redis layer (per cluster)

Topology

Per-cluster, the Redis layout is the standard Spotahome RedisFailover arrangement: one primary, two replicas, three Sentinels watching them, plus the local redis-applier consumer (covered in the next section).

┌── one cluster (DC or DR) ────────────────────────────────┐
│                                                          │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐             │
│   │ sentinel │   │ sentinel │   │ sentinel │             │
│   │   pod    │   │   pod    │   │   pod    │   quorum=2  │
│   └─────┬────┘   └─────┬────┘   └─────┬────┘             │
│         └──────────────┼──────────────┘                  │
│                        │ monitors                        │
│         ┌──────────────┼──────────────┐                  │
│   ┌─────▼────┐   ┌─────▼────┐   ┌─────▼────┐             │
│   │  redis   │   │  redis   │   │  redis   │             │
│   │ primary  │   │ replica  │   │ replica  │  PVC each   │
│   └──────────┘   └──────────┘   └──────────┘             │
│                                                          │
│   ┌────────────────┐                                     │
│   │ redis-applier  │  ← consumes redis-writes from local │
│   │   (3 pods)     │     Kafka and writes to primary     │
│   └────────────────┘                                     │
│                                                          │
└──────────────────────────────────────────────────────────┘

Setup & configuration

Deployed via Argo CD. Each cluster's app definition under clusters/<dc|dr>/apps/redis/ renders a RedisFailover CR that the Spotahome operator reconciles.

In-cluster failover

Sentinel quorum (2 of 3) elects a new primary in seconds. Sentinel-aware clients receive a +switch-master notification, re-resolve, and continue. The redis-applier reconnects via Sentinel and resumes consuming from its committed offset, so no Kafka messages are lost during a primary swap.

Day-to-day operations

Cross-cluster: the Kafka WAL pattern

Flow

Across clusters, the Kafka WAL stitches the two independent Redis instances into one eventually-consistent dataset. App writes go into Kafka rather than directly into Redis; on each side, the local redis-applier consumes the same topic and materialises it into the local Redis primary.

      ┌──── DC cluster ────┐                ┌──── DR cluster ────┐
      │                    │                │                    │
      │  app  ─────write──▶│  Kafka         │                    │
      │                    │  redis-writes  │                    │
      │                    │  topic         │                    │
      │                    │     │          │                    │
      │  redis-applier ◀───┘     │          │                    │
      │     │                    │          │                    │
      │     ▼ apply              │          │                    │
      │  Redis primary           │          │                    │
      │     │                    │          │                    │
      │  Sentinels               │          │                    │
      │                          ▼          │                    │
      │                       MM2 ─────────▶│ Kafka              │
      │                                     │ redis-writes topic │
      │                                     │      │             │
      │                                     │      ▼             │
      │                                     │ redis-applier      │
      │                                     │      │             │
      │                                     │      ▼ apply       │
      │                                     │ Redis primary      │
      │                                     │  Sentinels         │
      └─────────────────────────────────────┴────────────────────┘

Two properties make this work:

How redis-applier consumes Kafka

The redis-applier is a small Go consumer (16.9 MB distroless, v0.1.1) running 3 replicas per cluster. Its contract:

On DR, MirrorMaker 2 mirrors redis-writes from DC Kafka into DR Kafka with the original partition keys preserved. The DR redis-applier sees the same messages in the same per-key order and produces the same Redis state, modulo replication lag. End-to-end was validated on 2026-05-05 (11:21 UTC).

DC ⟷ DR cutover semantics

Today there is no edge HAProxy frontend for Redis itself, so DC/DR cutover for Redis is implicit: the application's Kafka client (which is failover-aware via the kafka-rke2-be backup backend) starts producing into DR Kafka when DC drops. The DR redis-applier — which has been steadily replaying the mirrored topic the whole time — keeps writing into DR Redis. The application's read path needs to switch to DR Redis at the same time; this is currently the application's responsibility.

Lag during a cutover is the sum of:

Client guidance

There are two distinct client roles. Pick the right one for what you're building.

1. Application writes (durable, replicated)

Don't write to Redis directly from the application. Instead, produce a structured Redis op message into Kafka topic redis-writes. The redis-applier on the local side will replay it into Redis; MM2 + the DR applier will replay it on DR.

Recommended client setup:

2. Application reads (low-latency, local)

Reads go directly to the local Redis primary, via Sentinel discovery. Use a Sentinel-aware client; do not hard-code a Redis pod address.

Java / JBoss example (Lettuce, Sentinel)

RedisURI uri = RedisURI.Builder
    .sentinel("rfs-redis.redis.svc.cluster.local", 26379, "mymaster")
    .withTimeout(Duration.ofSeconds(2))
    .build();

RedisClient client = RedisClient.create(uri);
StatefulRedisConnection<String,String> conn = client.connect();
RedisCommands<String,String> redis = conn.sync();

// Reads only — writes go through Kafka redis-writes
String v = redis.get("orders:42");

Off-cluster clients

Redis is not currently exposed via the edge HAProxy. Off-cluster reads aren't supported today — if you need them, the right path is a TCP backend on HAProxy with DC-primary/DR-backup and a Sentinel-aware client pointing at the public hostname. Until that's wired, off-cluster traffic should go via an in-cluster service (HTTP API, gateway, etc.) that fronts Redis.

Evaluation

Strengths

Weaknesses & known limitations

How it should be improved

References