README

Temporal workflow workers for automated infrastructure operations. Each domain (backup, vulnerability scanning, node cleanup) runs as an independent worker with its own container image, task queue, and Nomad service job. The cleanup worker hosts two workflows — orphaned-data cleanup and Docker registry garbage collection — on a shared task queue. Workflows fire on cron from Temporal Schedules, managed as code in the infrastructure/terragrunt repo.

  Temporal Schedules            Temporal Server
  (Terraform-managed)           (temporal-server.service.consul:7233)
         |                               |
         | start workflow on cron        | Task Queue dispatch
         v                               v
                            +------------------------+
                            | backup-worker          |
                            | backup-task-queue      |
                            +------------------------+
                            | trivy-scan-worker      |
                            | trivy-task-queue       |
                            +------------------------+
                            | cleanup-worker         |
                            | cleanup-task-queue     |
                            | (cleanup + registry-gc)|
                            +------------------------+
                                        |
                                        v
                              Nomad, Consul, S3,
                              PostgreSQL, Trivy,
                              SSH (client nodes)

Each worker is a long-running Temporal worker process that polls its dedicated task queue. Workflows are pure orchestration; all I/O happens in activities. Activities are registered as struct methods, sharing pooled connections (DB, S3) across invocations.

Workflow Domains

Backup

Snapshots Nomad Raft state, Consul Raft state (includes Vault data), and PostgreSQL. The three legs run concurrently. The PostgreSQL leg dumps cluster-wide globals (roles, tablespaces, grants) once, enumerates the databases, then dumps each one to its own file with bounded concurrency (PG_DUMP_CONCURRENCY, default 4). Each artifact is stored locally on an NFS mount and uploaded to S3 for off-site redundancy. Old backups are cleaned up based on configurable retention (default: 7 days local, 30 days S3).

Task queue: backup-task-queue Schedule: Nomad periodic job Image: backup-worker Dependencies: Nomad CLI, Consul CLI, pg_dump/pg_dumpall, S3-compatible storage

Trivy Scan

Discovers all running Docker images from the Nomad API, scans them through the Trivy server with bounded concurrency, and stores CVE results in PostgreSQL. Transient errors (server down) are retried by Temporal; permanent errors (image not found) are recorded and skipped.

Task queue: trivy-task-queue Schedule: Nomad periodic job Image: trivy-scan-worker Dependencies: Trivy CLI, Nomad API, PostgreSQL

Node Cleanup

SSHes to each Nomad client node, identifies job data directories that no longer correspond to running allocations, and removes those older than the grace period. Optionally prunes unused Docker images. Supports dry-run mode for safe previewing.

Task queue: cleanup-task-queue Schedule: Nomad periodic job Image: cleanup-worker Dependencies: SSH access to all Nomad client nodes, Nomad API

Registry Garbage Collection

Reclaims disk space from the Docker registry by running the registry’s garbage-collect against its bind-mounted storage. Because the registry tool requires the registry to be offline, the workflow scales the registry Nomad job to 0, waits for its allocations to drain, runs GC over SSH, then scales it back to 1. It reports blobs deleted and bytes reclaimed (before/after sizes). Runs on the cleanup worker, sharing its SSH and Nomad-client infrastructure. See Registry GC Saga for the compensation guarantees.

Task queue: cleanup-task-queue (shared with node cleanup) Workflow: RegistryGC Image: cleanup-worker Dependencies: SSH access to the registry host, Nomad API (job scaling), Docker on the registry host

Registry GC Saga

The registry GC workflow is structured as a saga so the registry is never left stranded offline. The sequence decomposes into per-step activities, each with its own retry policy:

Step	Activity	Retry
Locate registry host	`FindRegistryNode`	3 attempts, exponential backoff
Measure storage (before)	`MeasureRegistryDataDir`	3 attempts
Scale registry to 0	`ScaleRegistry`	3 attempts (idempotent)
Wait for allocs to drain	`WaitRegistryAllocsDrained`	bounded by timeout, heartbeats each poll
Run garbage-collect	`RunRegistryGarbageCollect`	1 attempt (no retry on partial GC)
Measure storage (after)	`MeasureRegistryDataDir`	3 attempts

Once the scale-down to 0 succeeds, a compensation is registered with defer and workflow.NewDisconnectedContext. It scales the registry back to 1 and waits for a running allocation — and it always fires, even if GC fails, an activity times out, or the workflow is cancelled mid-flight. Scaling is idempotent, so re-issuing count=1 is a safe no-op on the happy path. If the scale-back itself fails, the workflow logs a CRITICAL recovery message and joins the error so it surfaces to the operator.

Config	Default	Description
`JobName`	`registry`	Nomad job for the registry
`GroupName`	(= `JobName`)	Task group to scale
`RegistryDataDir`	`/mnt/gdrive/munchbox-data/registry`	Host path bind-mounted as `/var/lib/registry`
`RegistryImage`	`registry:3`	Image used for the one-shot GC run
`DryRun`	`true` (overridden by schedule input)	Report blobs that would be deleted without freeing space
`DeleteUntagged`	`true`	Also remove manifests not referenced by any tag

Shared Infrastructure

The shared/ package provides common functionality used by all workers:

File	Purpose
`telemetry.go`	OpenTelemetry tracer initialization with OTLP gRPC export to Tempo
`logging.go`	JSON slog logger wrapped for Temporal SDK compatibility
`metrics.go`	Prometheus metrics handler for Temporal SDK metrics (Tally bridge)
`nomad.go`	OTel-instrumented Nomad API client factory

All workers use StartClientSpan with PeerServiceAttr to produce service graph edges in Tempo/Grafana for every external call (Nomad, Consul, PostgreSQL, S3, Trivy server).

Scheduling

Workflows fire on cron from Temporal Schedules, defined as code in infrastructure/terragrunt (the temporal-config module, applied via the global/temporal-config leaf). Each schedule starts one workflow on its task queue with a JSON input that deserializes into the workflow’s config struct. The workers themselves just poll their queues — nothing in this repo triggers them.

Schedule	Workflow	Task Queue	Cron	Input
`backup-daily`	`Backup`	`backup-task-queue`	`0 1 * * *`	`BackupConfig` (local/S3 days, dump concurrency)
`trivy-daily`	`Scan`	`trivy-task-queue`	`0 3 * * *`	`ScanConfig` (scan concurrency)
`cleanup-daily`	`Cleanup`	`cleanup-task-queue`	`0 5 * * *`	`CleanupConfig` (data dir, grace days, dry-run, docker prune)
`registry-gc-weekly`	`RegistryGC`	`cleanup-task-queue`	`0 2 * * 0`	`RegistryGCConfig` (job/dir/image, dry-run, delete-untagged)

Retry and Error Handling

Activity Timeouts

Each activity has both a StartToCloseTimeout (max time for a single attempt) and a ScheduleToCloseTimeout (max total time including all retries). Quick operations (Nomad/Consul snapshots, globals dump, database listing, S3 uploads, cleanup) use 5/15 minute timeouts. Long operations (per-database pg_dump, image scanning) use 30/60 minute timeouts; per-database dumps additionally heartbeat with a 2 minute timeout.

Retry Policy

All activities share a common retry policy with exponential backoff:

Parameter	Value
Initial interval	1 second
Backoff coefficient	2.0
Maximum interval	1 minute
Maximum attempts	3

Error Classification (Trivy Scan)

Trivy scan activities distinguish between transient and permanent failures:

Error Type	Examples	Behavior
Transient	Connection refused, timeout, connection reset	Returns error; Temporal retries automatically
Permanent	Image not found, manifest unknown	Returns `NonRetryableApplicationError`; Temporal stops immediately
Parse failure	Invalid trivy JSON output	Returns `NonRetryableApplicationError`

Backup Failure Behavior

Step	On Failure
Nomad/Consul snapshot	Leg fails; workflow terminates with error
PostgreSQL globals dump or database listing	Leg fails; workflow terminates with error
Per-database dump	Leg fails after all databases are attempted; workflow terminates with error
S3 upload	Warning logged, workflow continues
S3 quota exceeded	Oldest backup evicted, upload retried (up to 3 evictions)
Local/S3 cleanup	Warning logged, workflow continues

Cleanup Safety Features

Dry-run mode (default: enabled) reports what would be deleted without removing anything
Grace period (default: 7 days) prevents deletion of recently-used directories
System directories (alloc, plugins, tmp, server, client) are always excluded
Node failures are tracked; the workflow reports which nodes failed

Configuration

All configuration is via environment variables, injected by Nomad job templates from Vault.

Common (all workers)

Variable	Default	Description
`TEMPORAL_ADDRESS`	`localhost:7233`	Temporal server endpoint
`OTEL_EXPORTER_OTLP_ENDPOINT`	`tempo.service.consul:4317`	OTLP gRPC endpoint
`METRICS_LISTEN`	`:9090`	Prometheus metrics listen address

Backup Worker

Variable	Default	Description
`S3_ENDPOINT`	–	S3-compatible endpoint URL
`S3_BUCKET`	–	Target bucket name
`S3_ACCESS_KEY`	–	S3 access key ID
`S3_SECRET_KEY`	–	S3 secret access key
`NOMAD_TOKEN`	–	Nomad API token (snapshot permissions)
`CONSUL_HTTP_TOKEN`	–	Consul API token (snapshot permissions)
`PG_HOST`	`postgres-primary.service.consul`	PostgreSQL host for dumps
`PG_USER`	`postgres`	PostgreSQL user for dumps
`PGPASSWORD`	–	PostgreSQL password for pg_dump/pg_dumpall

Trivy Scan Worker

Variable	Default	Description
`TRIVY_SERVER_ADDR`	`http://trivy-server.service.consul:4954`	Trivy server endpoint
`TRIVY_DB_HOST`	`postgres-shared.service.consul`	PostgreSQL host for scan results
`TRIVY_DB_PORT`	`5432`	PostgreSQL port
`TRIVY_DB_USER`	–	PostgreSQL username
`TRIVY_DB_PASSWORD`	–	PostgreSQL password
`TRIVY_DB_NAME`	`trivy`	Database name
`DB_SSLMODE`	`verify-ca`	PostgreSQL SSL mode
`DB_SSLROOTCERT`	–	Path to CA certificate
`NOMAD_TOKEN`	–	Nomad API token (read allocations)

Cleanup Worker

Variable	Default	Description
`SSH_KEY_PATH`	`/root/.ssh/id_ed25519`	SSH private key path
`SSH_CERT_PATH`	`/root/.ssh/id_ed25519-cert.pub`	SSH client certificate path
`SSH_HOST_CA_PATH`	`/root/.ssh/ssh-host-ca.pub`	SSH host CA public key path
`NOMAD_TOKEN`	–	Nomad API token (read nodes/allocations)

Registry GC

Registry GC runs on the cleanup worker, so it inherits the SSH and NOMAD_TOKEN settings above (the Nomad token additionally needs job-scale permission). Its workflow config (job name, data dir, image, dry-run, delete-untagged) comes from the registry-gc-weekly schedule input — see the RegistryGCConfig table under Registry Garbage Collection.

Observability

Tracing

All workers initialize OpenTelemetry with OTLP gRPC export. The Temporal SDK tracing interceptor automatically creates spans for workflow execution and activity dispatch. Activities create explicit client spans with peer.service attributes for service graph edges:

backup-worker -> nomad, consul, postgres-primary, s3-orchestrator
trivy-scan-worker -> nomad, trivy-server, postgres (via otelsql)
cleanup-worker -> nomad

Metrics

Temporal SDK metrics are exposed via Prometheus on :9090/metrics. Key metrics include:

Metric prefix	Description
`temporal_workflow_*`	Workflow execution counts, latency, failures
`temporal_activity_*`	Activity execution counts, latency, retries
`temporal_task_queue_*`	Task queue depth and poll latency

Logging

JSON structured logs via log/slog to stdout. The Temporal SDK logger is wrapped via log.NewStructuredLogger so SDK-internal logs (task polling, activity dispatch, retries) share the same JSON format for Alloy/Loki collection.

Development

make build

make test

make vet

make lint

make govulncheck

make push-all

make push-backup
make push-trivy
make push-cleanup

Each domain also has its own Makefile in its subdirectory for independent builds:

cd backup && make help
cd trivyscan && make help
cd nodecleanup && make help

Versioning

A .version file holds the tag for the artifact built from its directory, and each is bumped independently:

File	Tags	Bumped when
`backup/.version`	`backup-worker` image	the backup worker changes
`trivyscan/.version`	`trivy-scan-worker` image	the trivy worker changes
`nodecleanup/.version`	`cleanup-worker` image	the cleanup worker changes
`./.version` (root)	git release tag + `temporal-workers-web` image	cutting a repo release

make push-backup (and push-all) read each worker’s own .version — the root version is never used for worker images. The worker tags drift apart on purpose: rebuild only what changed, and an image tag tells you exactly which worker code it holds.

Deployment

Each domain is deployed as a separate Nomad service job. Workflows are started on cron by Temporal Schedules (Terraform-managed in infrastructure/terragrunt), not by Nomad jobs.

Nomad Jobs

Job	Type	Image	Task Queue
`backup-worker`	service	`backup-worker`	`backup-task-queue`
`trivy-scan-worker`	service	`trivy-scan-worker`	`trivy-task-queue`
`cleanup-worker`	service	`cleanup-worker`	`cleanup-task-queue`

Manual Runs

Schedules can be triggered on demand (temporal schedule trigger --schedule-id backup-daily), or a one-off workflow started directly with the Temporal CLI:

temporal workflow start --task-queue backup-task-queue --type Backup \
  --address temporal-server.service.consul:7233 \
  --input '{"local_days":7,"s3_days":30,"dump_concurrency":4}'

temporal workflow start --task-queue trivy-task-queue --type Scan \
  --address temporal-server.service.consul:7233 \
  --input '{"concurrency":10}'

temporal workflow start --task-queue cleanup-task-queue --type Cleanup \
  --address temporal-server.service.consul:7233 \
  --input '{"data_dir":"/opt/nomad/data","grace_days":7,"dry_run":true,"docker_prune":false}'

temporal workflow start --task-queue cleanup-task-queue --type RegistryGC \
  --address temporal-server.service.consul:7233 \
  --input '{"job_name":"registry","registry_data_dir":"/mnt/gdrive/munchbox-data/registry","registry_image":"registry:3","dry_run":true,"delete_untagged":true}'

Project Structure

nomad-temporal-jobs/
  .github/workflows/
    ci.yml                           CI: lint, test, vet, govulncheck, version check
  .gitignore                         Ignores coverage output and dist artifacts
  .golangci.yml                      Linter configuration (gocritic, misspell)
  .version                           Root version tag
  LICENSE                            MIT
  Makefile                           Root: build, test, lint, push-all targets
  README.md                          This file
  go.mod                             Module definition
  go.sum                             Dependency lock file
  shared/
    telemetry.go                     OTel tracer init, span helpers, peer.service attributes
    logging.go                       JSON slog logger with Temporal SDK adapter
    metrics.go                       Prometheus metrics handler via Tally bridge
    nomad.go                         OTel-instrumented Nomad API client factory
  backup/
    .version                         Image version tag
    Dockerfile                       Multi-stage build (Debian, Nomad/Consul/PG18 CLI)
    Makefile                         Build and push targets
    activities/
      activities.go                  Activity struct: snapshots, S3 upload, retention cleanup
    workflows/
      backup.go                      Concurrent snapshot legs + per-database PostgreSQL fan-out
    worker/
      main.go                       Worker entry point (tracing, slog, metrics, Temporal client)
  trivyscan/
    .version                         Image version tag
    Dockerfile                       Multi-stage build (Alpine, Trivy CLI)
    Makefile                         Build and push targets
    activities/
      activities.go                  Activity struct: Nomad image discovery, Trivy scan, DB save
    workflows/
      scan.go                        Bounded-concurrency scanning with per-image error handling
    worker/
      main.go                       Worker entry point (tracing, slog, metrics, Temporal client)
  nodecleanup/
    .version                         Image version tag
    Dockerfile                       Multi-stage build (Alpine, SSH only)
    Makefile                         Build and push targets
    activities/
      activities.go                  Activity struct: node discovery, SSH cleanup, script gen
      registry_gc.go                 Saga activities: find node, scale, drain, GC, measure
    workflows/
      cleanup.go                     Sequential per-node cleanup with failure tracking
      registry_gc.go                 Saga orchestration with deferred scale-back compensation
    worker/
      main.go                       Worker entry point (registers Cleanup + RegistryGC workflows)

License

MIT