Terrakube
Self-hosted Terraform / OpenTofu collaboration platform — UI, API, executor, registry; OIDC via Keycloak.
Quick facts
What it is
The DR images azbuilder/api-server:2.30.1 and azbuilder/executor:2.30.1 were hand-imported via ctr because Docker Hub pulls on DR are unreliable and slow.
Architecture
Four components per cluster:
terrakube-ui— React SPA (port 80 ingress asterrakube-ui.apps...)terrakube-api— Spring Boot backend (terrakube-api.apps...)terrakube-executor— runs Terraform/OpenTofu plans (no edge ingress; called internally by api)terrakube-registry— module registry (terrakube-reg.apps...)
Each cluster has its own Postgres + Redis StatefulSets (chart prereqs). Remote state shared via MinIO bucket terrakube-state (chmod-protected by per-app credentials at ~/cloud-init/minio-terrakube-secret-key).
OIDC: public client terrakube in the Keycloak comptech realm, with PKCE and a groups claim mapper (so a user's Keycloak group membership becomes their Terrakube role).
Configuration
Source: shared/helm-values/terrakube.yaml + clusters/<cluster>/values/terrakube.yaml. Two ArgoCD Applications: terrakube-prereqs (Postgres + Redis) at sync-wave N-1, then terrakube (the four components) at wave N.
DR-specific issue: azbuilder/api-server:2.30.1 and azbuilder/executor:2.30.1 Docker Hub pulls on DR take >15 min and often fail. Workaround: ctr import the image tarballs onto each DR node manually after a podman pull on the dl385 host. Long-term fix: Nexus docker-proxy (planned).
Memorable settings: nginx.ingress.kubernetes.io/ssl-redirect=false (MR #32 — registry needs to serve HTTP for Terraform's CLI), registry resource limit bumped to 1Gi (MR #31 — OOM at 512Mi).
Operations
- UI:
https://terrakube-ui.apps.sub.comptech-lab.com— login via Keycloak. - Onboard a new user: in Keycloak admin UI, add the user to a group (e.g.
TERRAKUBE_ADMIN); first login auto-provisions on Terrakube. - Workspace state: stored in MinIO under
terrakube-state/<org>/<workspace>/... - Executor logs:
kubectl -n terrakube logs deploy/terrakube-executor - Re-run failed plans: from the UI's job history, click "Re-run" — pulls the same Terraform module + variables.
Failover
Three edge HAProxy backends — terrakube-{ui,api,reg}-rke2-be, each with DC primary + DR backup. Healthcheck: GET /actuator/health for api/registry, GET / for ui. ^(200|301|302|307|308)$.
Smoke test PASSED 2026-05-05: cutover ~16 s; cutback ~65 s — Spring Boot api cold-start dominates the cutback window. 200s served continuously from DR backup throughout.
Caveat: workspace state is in MinIO so it's shared across clusters cleanly. Job history is per-cluster Postgres so DR has no record of DC's plans until manually exported.
References
- GitOps:
clusters/dc/manifests/terrakube/ - MR #28-#32 (rollout, ssl-redirect, registry mem)
- Keycloak — IdP
- MinIO — remote state
- Terrakube docs · terrakube source