
README
Temporal workflow workers for automated infrastructure operations. Each domain (backup, vulnerability scanning, node cleanup) runs as an independent worker with its own container image, task queue, and Nomad service job. The cleanup worker hosts two workflows — orphaned-data cleanup and Docker registry garbage collection — on a shared task queue. Workflows fire on cron from Temporal Schedules, managed as code in the infrastructure/terragrunt repo.
Each worker is a long-running Temporal worker process that polls its dedicated task queue. Workflows are pure orchestration; all I/O happens in activities. Activities are registered as struct methods, sharing pooled connections (DB, S3) across invocations.
Table of Contents
- Workflow Domains
- Registry GC Saga
- Shared Infrastructure
- Scheduling
- Retry and Error Handling
- Configuration
- Observability
- Development
- Deployment
- Project Structure
Workflow Domains
Backup
Snapshots Nomad Raft state, Consul Raft state (includes Vault data), and PostgreSQL. The three legs run concurrently. The PostgreSQL leg dumps cluster-wide globals (roles, tablespaces, grants) once, enumerates the databases, then dumps each one to its own file with bounded concurrency (PG_DUMP_CONCURRENCY, default 4). Each artifact is stored locally on an NFS mount and uploaded to S3 for off-site redundancy. Old backups are cleaned up based on configurable retention (default: 7 days local, 30 days S3).
Task queue: backup-task-queue
Schedule: Nomad periodic job
Image: backup-worker
Dependencies: Nomad CLI, Consul CLI, pg_dump/pg_dumpall, S3-compatible storage
Trivy Scan
Discovers all running Docker images from the Nomad API, scans them through the Trivy server with bounded concurrency, and stores CVE results in PostgreSQL. Transient errors (server down) are retried by Temporal; permanent errors (image not found) are recorded and skipped.
Task queue: trivy-task-queue
Schedule: Nomad periodic job
Image: trivy-scan-worker
Dependencies: Trivy CLI, Nomad API, PostgreSQL
Node Cleanup
SSHes to each Nomad client node, identifies job data directories that no longer correspond to running allocations, and removes those older than the grace period. Optionally prunes unused Docker images. Supports dry-run mode for safe previewing.
Task queue: cleanup-task-queue
Schedule: Nomad periodic job
Image: cleanup-worker
Dependencies: SSH access to all Nomad client nodes, Nomad API
Registry Garbage Collection
Reclaims disk space from the Docker registry by running the registry’s garbage-collect against its bind-mounted storage. Because the registry tool requires the registry to be offline, the workflow scales the registry Nomad job to 0, waits for its allocations to drain, runs GC over SSH, then scales it back to 1. It reports blobs deleted and bytes reclaimed (before/after sizes). Runs on the cleanup worker, sharing its SSH and Nomad-client infrastructure. See Registry GC Saga for the compensation guarantees.
Task queue: cleanup-task-queue (shared with node cleanup)
Workflow: RegistryGC
Image: cleanup-worker
Dependencies: SSH access to the registry host, Nomad API (job scaling), Docker on the registry host
Registry GC Saga
The registry GC workflow is structured as a saga so the registry is never left stranded offline. The sequence decomposes into per-step activities, each with its own retry policy:
| Step | Activity | Retry |
|---|---|---|
| Locate registry host | FindRegistryNode | 3 attempts, exponential backoff |
| Measure storage (before) | MeasureRegistryDataDir | 3 attempts |
| Scale registry to 0 | ScaleRegistry | 3 attempts (idempotent) |
| Wait for allocs to drain | WaitRegistryAllocsDrained | bounded by timeout, heartbeats each poll |
| Run garbage-collect | RunRegistryGarbageCollect | 1 attempt (no retry on partial GC) |
| Measure storage (after) | MeasureRegistryDataDir | 3 attempts |
Once the scale-down to 0 succeeds, a compensation is registered with defer and workflow.NewDisconnectedContext. It scales the registry back to 1 and waits for a running allocation — and it always fires, even if GC fails, an activity times out, or the workflow is cancelled mid-flight. Scaling is idempotent, so re-issuing count=1 is a safe no-op on the happy path. If the scale-back itself fails, the workflow logs a CRITICAL recovery message and joins the error so it surfaces to the operator.
| Config | Default | Description |
|---|---|---|
JobName | registry | Nomad job for the registry |
GroupName | (= JobName) | Task group to scale |
RegistryDataDir | /mnt/gdrive/munchbox-data/registry | Host path bind-mounted as /var/lib/registry |
RegistryImage | registry:3 | Image used for the one-shot GC run |
DryRun | true (overridden by schedule input) | Report blobs that would be deleted without freeing space |
DeleteUntagged | true | Also remove manifests not referenced by any tag |
Shared Infrastructure
The shared/ package provides common functionality used by all workers:
| File | Purpose |
|---|---|
telemetry.go | OpenTelemetry tracer initialization with OTLP gRPC export to Tempo |
logging.go | JSON slog logger wrapped for Temporal SDK compatibility |
metrics.go | Prometheus metrics handler for Temporal SDK metrics (Tally bridge) |
nomad.go | OTel-instrumented Nomad API client factory |
All workers use StartClientSpan with PeerServiceAttr to produce service graph edges in Tempo/Grafana for every external call (Nomad, Consul, PostgreSQL, S3, Trivy server).
Scheduling
Workflows fire on cron from Temporal Schedules, defined as code in infrastructure/terragrunt (the temporal-config module, applied via the global/temporal-config leaf). Each schedule starts one workflow on its task queue with a JSON input that deserializes into the workflow’s config struct. The workers themselves just poll their queues — nothing in this repo triggers them.
| Schedule | Workflow | Task Queue | Cron | Input |
|---|---|---|---|---|
backup-daily | Backup | backup-task-queue | 0 1 * * * | BackupConfig (local/S3 days, dump concurrency) |
trivy-daily | Scan | trivy-task-queue | 0 3 * * * | ScanConfig (scan concurrency) |
cleanup-daily | Cleanup | cleanup-task-queue | 0 5 * * * | CleanupConfig (data dir, grace days, dry-run, docker prune) |
registry-gc-weekly | RegistryGC | cleanup-task-queue | 0 2 * * 0 | RegistryGCConfig (job/dir/image, dry-run, delete-untagged) |
Retry and Error Handling
Activity Timeouts
Each activity has both a StartToCloseTimeout (max time for a single attempt) and a ScheduleToCloseTimeout (max total time including all retries). Quick operations (Nomad/Consul snapshots, globals dump, database listing, S3 uploads, cleanup) use 5/15 minute timeouts. Long operations (per-database pg_dump, image scanning) use 30/60 minute timeouts; per-database dumps additionally heartbeat with a 2 minute timeout.
Retry Policy
All activities share a common retry policy with exponential backoff:
| Parameter | Value |
|---|---|
| Initial interval | 1 second |
| Backoff coefficient | 2.0 |
| Maximum interval | 1 minute |
| Maximum attempts | 3 |
Error Classification (Trivy Scan)
Trivy scan activities distinguish between transient and permanent failures:
| Error Type | Examples | Behavior |
|---|---|---|
| Transient | Connection refused, timeout, connection reset | Returns error; Temporal retries automatically |
| Permanent | Image not found, manifest unknown | Returns NonRetryableApplicationError; Temporal stops immediately |
| Parse failure | Invalid trivy JSON output | Returns NonRetryableApplicationError |
Backup Failure Behavior
| Step | On Failure |
|---|---|
| Nomad/Consul snapshot | Leg fails; workflow terminates with error |
| PostgreSQL globals dump or database listing | Leg fails; workflow terminates with error |
| Per-database dump | Leg fails after all databases are attempted; workflow terminates with error |
| S3 upload | Warning logged, workflow continues |
| S3 quota exceeded | Oldest backup evicted, upload retried (up to 3 evictions) |
| Local/S3 cleanup | Warning logged, workflow continues |
Cleanup Safety Features
- Dry-run mode (default: enabled) reports what would be deleted without removing anything
- Grace period (default: 7 days) prevents deletion of recently-used directories
- System directories (
alloc,plugins,tmp,server,client) are always excluded - Node failures are tracked; the workflow reports which nodes failed
Configuration
All configuration is via environment variables, injected by Nomad job templates from Vault.
Common (all workers)
| Variable | Default | Description |
|---|---|---|
TEMPORAL_ADDRESS | localhost:7233 | Temporal server endpoint |
OTEL_EXPORTER_OTLP_ENDPOINT | tempo.service.consul:4317 | OTLP gRPC endpoint |
METRICS_LISTEN | :9090 | Prometheus metrics listen address |
Backup Worker
| Variable | Default | Description |
|---|---|---|
S3_ENDPOINT | – | S3-compatible endpoint URL |
S3_BUCKET | – | Target bucket name |
S3_ACCESS_KEY | – | S3 access key ID |
S3_SECRET_KEY | – | S3 secret access key |
NOMAD_TOKEN | – | Nomad API token (snapshot permissions) |
CONSUL_HTTP_TOKEN | – | Consul API token (snapshot permissions) |
PG_HOST | postgres-primary.service.consul | PostgreSQL host for dumps |
PG_USER | postgres | PostgreSQL user for dumps |
PGPASSWORD | – | PostgreSQL password for pg_dump/pg_dumpall |
Trivy Scan Worker
| Variable | Default | Description |
|---|---|---|
TRIVY_SERVER_ADDR | http://trivy-server.service.consul:4954 | Trivy server endpoint |
TRIVY_DB_HOST | postgres-shared.service.consul | PostgreSQL host for scan results |
TRIVY_DB_PORT | 5432 | PostgreSQL port |
TRIVY_DB_USER | – | PostgreSQL username |
TRIVY_DB_PASSWORD | – | PostgreSQL password |
TRIVY_DB_NAME | trivy | Database name |
DB_SSLMODE | verify-ca | PostgreSQL SSL mode |
DB_SSLROOTCERT | – | Path to CA certificate |
NOMAD_TOKEN | – | Nomad API token (read allocations) |
Cleanup Worker
| Variable | Default | Description |
|---|---|---|
SSH_KEY_PATH | /root/.ssh/id_ed25519 | SSH private key path |
SSH_CERT_PATH | /root/.ssh/id_ed25519-cert.pub | SSH client certificate path |
SSH_HOST_CA_PATH | /root/.ssh/ssh-host-ca.pub | SSH host CA public key path |
NOMAD_TOKEN | – | Nomad API token (read nodes/allocations) |
Registry GC
Registry GC runs on the cleanup worker, so it inherits the SSH and NOMAD_TOKEN settings above (the Nomad token additionally needs job-scale permission). Its workflow config (job name, data dir, image, dry-run, delete-untagged) comes from the registry-gc-weekly schedule input — see the RegistryGCConfig table under Registry Garbage Collection.
Observability
Tracing
All workers initialize OpenTelemetry with OTLP gRPC export. The Temporal SDK tracing interceptor automatically creates spans for workflow execution and activity dispatch. Activities create explicit client spans with peer.service attributes for service graph edges:
backup-worker-> nomad, consul, postgres-primary, s3-orchestratortrivy-scan-worker-> nomad, trivy-server, postgres (via otelsql)cleanup-worker-> nomad
Metrics
Temporal SDK metrics are exposed via Prometheus on :9090/metrics. Key metrics include:
| Metric prefix | Description |
|---|---|
temporal_workflow_* | Workflow execution counts, latency, failures |
temporal_activity_* | Activity execution counts, latency, retries |
temporal_task_queue_* | Task queue depth and poll latency |
Logging
JSON structured logs via log/slog to stdout. The Temporal SDK logger is wrapped via log.NewStructuredLogger so SDK-internal logs (task polling, activity dispatch, retries) share the same JSON format for Alloy/Loki collection.
Development
Each domain also has its own Makefile in its subdirectory for independent builds:
Versioning
A .version file holds the tag for the artifact built from its directory, and each is bumped independently:
| File | Tags | Bumped when |
|---|---|---|
backup/.version | backup-worker image | the backup worker changes |
trivyscan/.version | trivy-scan-worker image | the trivy worker changes |
nodecleanup/.version | cleanup-worker image | the cleanup worker changes |
./.version (root) | git release tag + temporal-workers-web image | cutting a repo release |
make push-backup (and push-all) read each worker’s own .version — the root version is never used for worker images. The worker tags drift apart on purpose: rebuild only what changed, and an image tag tells you exactly which worker code it holds.
Deployment
Each domain is deployed as a separate Nomad service job. Workflows are started on cron by Temporal Schedules (Terraform-managed in infrastructure/terragrunt), not by Nomad jobs.
Nomad Jobs
| Job | Type | Image | Task Queue |
|---|---|---|---|
backup-worker | service | backup-worker | backup-task-queue |
trivy-scan-worker | service | trivy-scan-worker | trivy-task-queue |
cleanup-worker | service | cleanup-worker | cleanup-task-queue |
Manual Runs
Schedules can be triggered on demand (temporal schedule trigger --schedule-id backup-daily), or a one-off workflow started directly with the Temporal CLI:
Project Structure
License
MIT