nomad-temporal-jobs

README

Temporal workflow workers for automated infrastructure operations. Each domain (backup, vulnerability scanning, node cleanup) runs as an independent worker with its own container image, task queue, and Nomad service job. The cleanup worker hosts two workflows — orphaned-data cleanup and Docker registry garbage collection — on a shared task queue. Workflows fire on cron from Temporal Schedules, managed as code in the infrastructure/terragrunt repo.

  Temporal Schedules            Temporal Server
  (Terraform-managed)           (temporal-server.service.consul:7233)
         |                               |
         | start workflow on cron        | Task Queue dispatch
         v                               v
                            +------------------------+
                            | backup-worker          |
                            | backup-task-queue      |
                            +------------------------+
                            | trivy-scan-worker      |
                            | trivy-task-queue       |
                            +------------------------+
                            | cleanup-worker         |
                            | cleanup-task-queue     |
                            | (cleanup + registry-gc)|
                            +------------------------+
                                        |
                                        v
                              Nomad, Consul, S3,
                              PostgreSQL, Trivy,
                              SSH (client nodes)

Each worker is a long-running Temporal worker process that polls its dedicated task queue. Workflows are pure orchestration; all I/O happens in activities. Activities are registered as struct methods, sharing pooled connections (DB, S3) across invocations.

Table of Contents

Workflow Domains

Backup

Snapshots Nomad Raft state, Consul Raft state (includes Vault data), and PostgreSQL. The three legs run concurrently. The PostgreSQL leg dumps cluster-wide globals (roles, tablespaces, grants) once, enumerates the databases, then dumps each one to its own file with bounded concurrency (PG_DUMP_CONCURRENCY, default 4). Each artifact is stored locally on an NFS mount and uploaded to S3 for off-site redundancy. Old backups are cleaned up based on configurable retention (default: 7 days local, 30 days S3).

Task queue: backup-task-queue Schedule: Nomad periodic job Image: backup-worker Dependencies: Nomad CLI, Consul CLI, pg_dump/pg_dumpall, S3-compatible storage

Trivy Scan

Discovers all running Docker images from the Nomad API, scans them through the Trivy server with bounded concurrency, and stores CVE results in PostgreSQL. Transient errors (server down) are retried by Temporal; permanent errors (image not found) are recorded and skipped.

Task queue: trivy-task-queue Schedule: Nomad periodic job Image: trivy-scan-worker Dependencies: Trivy CLI, Nomad API, PostgreSQL

Node Cleanup

SSHes to each Nomad client node, identifies job data directories that no longer correspond to running allocations, and removes those older than the grace period. Optionally prunes unused Docker images. Supports dry-run mode for safe previewing.

Task queue: cleanup-task-queue Schedule: Nomad periodic job Image: cleanup-worker Dependencies: SSH access to all Nomad client nodes, Nomad API

Registry Garbage Collection

Reclaims disk space from the Docker registry by running the registry’s garbage-collect against its bind-mounted storage. Because the registry tool requires the registry to be offline, the workflow scales the registry Nomad job to 0, waits for its allocations to drain, runs GC over SSH, then scales it back to 1. It reports blobs deleted and bytes reclaimed (before/after sizes). Runs on the cleanup worker, sharing its SSH and Nomad-client infrastructure. See Registry GC Saga for the compensation guarantees.

Task queue: cleanup-task-queue (shared with node cleanup) Workflow: RegistryGC Image: cleanup-worker Dependencies: SSH access to the registry host, Nomad API (job scaling), Docker on the registry host

Registry GC Saga

The registry GC workflow is structured as a saga so the registry is never left stranded offline. The sequence decomposes into per-step activities, each with its own retry policy:

StepActivityRetry
Locate registry hostFindRegistryNode3 attempts, exponential backoff
Measure storage (before)MeasureRegistryDataDir3 attempts
Scale registry to 0ScaleRegistry3 attempts (idempotent)
Wait for allocs to drainWaitRegistryAllocsDrainedbounded by timeout, heartbeats each poll
Run garbage-collectRunRegistryGarbageCollect1 attempt (no retry on partial GC)
Measure storage (after)MeasureRegistryDataDir3 attempts

Once the scale-down to 0 succeeds, a compensation is registered with defer and workflow.NewDisconnectedContext. It scales the registry back to 1 and waits for a running allocation — and it always fires, even if GC fails, an activity times out, or the workflow is cancelled mid-flight. Scaling is idempotent, so re-issuing count=1 is a safe no-op on the happy path. If the scale-back itself fails, the workflow logs a CRITICAL recovery message and joins the error so it surfaces to the operator.

ConfigDefaultDescription
JobNameregistryNomad job for the registry
GroupName(= JobName)Task group to scale
RegistryDataDir/mnt/gdrive/munchbox-data/registryHost path bind-mounted as /var/lib/registry
RegistryImageregistry:3Image used for the one-shot GC run
DryRuntrue (overridden by schedule input)Report blobs that would be deleted without freeing space
DeleteUntaggedtrueAlso remove manifests not referenced by any tag

Shared Infrastructure

The shared/ package provides common functionality used by all workers:

FilePurpose
telemetry.goOpenTelemetry tracer initialization with OTLP gRPC export to Tempo
logging.goJSON slog logger wrapped for Temporal SDK compatibility
metrics.goPrometheus metrics handler for Temporal SDK metrics (Tally bridge)
nomad.goOTel-instrumented Nomad API client factory

All workers use StartClientSpan with PeerServiceAttr to produce service graph edges in Tempo/Grafana for every external call (Nomad, Consul, PostgreSQL, S3, Trivy server).

Scheduling

Workflows fire on cron from Temporal Schedules, defined as code in infrastructure/terragrunt (the temporal-config module, applied via the global/temporal-config leaf). Each schedule starts one workflow on its task queue with a JSON input that deserializes into the workflow’s config struct. The workers themselves just poll their queues — nothing in this repo triggers them.

ScheduleWorkflowTask QueueCronInput
backup-dailyBackupbackup-task-queue0 1 * * *BackupConfig (local/S3 days, dump concurrency)
trivy-dailyScantrivy-task-queue0 3 * * *ScanConfig (scan concurrency)
cleanup-dailyCleanupcleanup-task-queue0 5 * * *CleanupConfig (data dir, grace days, dry-run, docker prune)
registry-gc-weeklyRegistryGCcleanup-task-queue0 2 * * 0RegistryGCConfig (job/dir/image, dry-run, delete-untagged)

Retry and Error Handling

Activity Timeouts

Each activity has both a StartToCloseTimeout (max time for a single attempt) and a ScheduleToCloseTimeout (max total time including all retries). Quick operations (Nomad/Consul snapshots, globals dump, database listing, S3 uploads, cleanup) use 5/15 minute timeouts. Long operations (per-database pg_dump, image scanning) use 30/60 minute timeouts; per-database dumps additionally heartbeat with a 2 minute timeout.

Retry Policy

All activities share a common retry policy with exponential backoff:

ParameterValue
Initial interval1 second
Backoff coefficient2.0
Maximum interval1 minute
Maximum attempts3

Error Classification (Trivy Scan)

Trivy scan activities distinguish between transient and permanent failures:

Error TypeExamplesBehavior
TransientConnection refused, timeout, connection resetReturns error; Temporal retries automatically
PermanentImage not found, manifest unknownReturns NonRetryableApplicationError; Temporal stops immediately
Parse failureInvalid trivy JSON outputReturns NonRetryableApplicationError

Backup Failure Behavior

StepOn Failure
Nomad/Consul snapshotLeg fails; workflow terminates with error
PostgreSQL globals dump or database listingLeg fails; workflow terminates with error
Per-database dumpLeg fails after all databases are attempted; workflow terminates with error
S3 uploadWarning logged, workflow continues
S3 quota exceededOldest backup evicted, upload retried (up to 3 evictions)
Local/S3 cleanupWarning logged, workflow continues

Cleanup Safety Features

  • Dry-run mode (default: enabled) reports what would be deleted without removing anything
  • Grace period (default: 7 days) prevents deletion of recently-used directories
  • System directories (alloc, plugins, tmp, server, client) are always excluded
  • Node failures are tracked; the workflow reports which nodes failed

Configuration

All configuration is via environment variables, injected by Nomad job templates from Vault.

Common (all workers)

VariableDefaultDescription
TEMPORAL_ADDRESSlocalhost:7233Temporal server endpoint
OTEL_EXPORTER_OTLP_ENDPOINTtempo.service.consul:4317OTLP gRPC endpoint
METRICS_LISTEN:9090Prometheus metrics listen address

Backup Worker

VariableDefaultDescription
S3_ENDPOINTS3-compatible endpoint URL
S3_BUCKETTarget bucket name
S3_ACCESS_KEYS3 access key ID
S3_SECRET_KEYS3 secret access key
NOMAD_TOKENNomad API token (snapshot permissions)
CONSUL_HTTP_TOKENConsul API token (snapshot permissions)
PG_HOSTpostgres-primary.service.consulPostgreSQL host for dumps
PG_USERpostgresPostgreSQL user for dumps
PGPASSWORDPostgreSQL password for pg_dump/pg_dumpall

Trivy Scan Worker

VariableDefaultDescription
TRIVY_SERVER_ADDRhttp://trivy-server.service.consul:4954Trivy server endpoint
TRIVY_DB_HOSTpostgres-shared.service.consulPostgreSQL host for scan results
TRIVY_DB_PORT5432PostgreSQL port
TRIVY_DB_USERPostgreSQL username
TRIVY_DB_PASSWORDPostgreSQL password
TRIVY_DB_NAMEtrivyDatabase name
DB_SSLMODEverify-caPostgreSQL SSL mode
DB_SSLROOTCERTPath to CA certificate
NOMAD_TOKENNomad API token (read allocations)

Cleanup Worker

VariableDefaultDescription
SSH_KEY_PATH/root/.ssh/id_ed25519SSH private key path
SSH_CERT_PATH/root/.ssh/id_ed25519-cert.pubSSH client certificate path
SSH_HOST_CA_PATH/root/.ssh/ssh-host-ca.pubSSH host CA public key path
NOMAD_TOKENNomad API token (read nodes/allocations)

Registry GC

Registry GC runs on the cleanup worker, so it inherits the SSH and NOMAD_TOKEN settings above (the Nomad token additionally needs job-scale permission). Its workflow config (job name, data dir, image, dry-run, delete-untagged) comes from the registry-gc-weekly schedule input — see the RegistryGCConfig table under Registry Garbage Collection.

Observability

Tracing

All workers initialize OpenTelemetry with OTLP gRPC export. The Temporal SDK tracing interceptor automatically creates spans for workflow execution and activity dispatch. Activities create explicit client spans with peer.service attributes for service graph edges:

  • backup-worker -> nomad, consul, postgres-primary, s3-orchestrator
  • trivy-scan-worker -> nomad, trivy-server, postgres (via otelsql)
  • cleanup-worker -> nomad

Metrics

Temporal SDK metrics are exposed via Prometheus on :9090/metrics. Key metrics include:

Metric prefixDescription
temporal_workflow_*Workflow execution counts, latency, failures
temporal_activity_*Activity execution counts, latency, retries
temporal_task_queue_*Task queue depth and poll latency

Logging

JSON structured logs via log/slog to stdout. The Temporal SDK logger is wrapped via log.NewStructuredLogger so SDK-internal logs (task polling, activity dispatch, retries) share the same JSON format for Alloy/Loki collection.

Development

make build

make test

make vet

make lint

make govulncheck

make push-all

make push-backup
make push-trivy
make push-cleanup

Each domain also has its own Makefile in its subdirectory for independent builds:

cd backup && make help
cd trivyscan && make help
cd nodecleanup && make help

Versioning

A .version file holds the tag for the artifact built from its directory, and each is bumped independently:

FileTagsBumped when
backup/.versionbackup-worker imagethe backup worker changes
trivyscan/.versiontrivy-scan-worker imagethe trivy worker changes
nodecleanup/.versioncleanup-worker imagethe cleanup worker changes
./.version (root)git release tag + temporal-workers-web imagecutting a repo release

make push-backup (and push-all) read each worker’s own .version — the root version is never used for worker images. The worker tags drift apart on purpose: rebuild only what changed, and an image tag tells you exactly which worker code it holds.

Deployment

Each domain is deployed as a separate Nomad service job. Workflows are started on cron by Temporal Schedules (Terraform-managed in infrastructure/terragrunt), not by Nomad jobs.

Nomad Jobs

JobTypeImageTask Queue
backup-workerservicebackup-workerbackup-task-queue
trivy-scan-workerservicetrivy-scan-workertrivy-task-queue
cleanup-workerservicecleanup-workercleanup-task-queue

Manual Runs

Schedules can be triggered on demand (temporal schedule trigger --schedule-id backup-daily), or a one-off workflow started directly with the Temporal CLI:

temporal workflow start --task-queue backup-task-queue --type Backup \
  --address temporal-server.service.consul:7233 \
  --input '{"local_days":7,"s3_days":30,"dump_concurrency":4}'

temporal workflow start --task-queue trivy-task-queue --type Scan \
  --address temporal-server.service.consul:7233 \
  --input '{"concurrency":10}'

temporal workflow start --task-queue cleanup-task-queue --type Cleanup \
  --address temporal-server.service.consul:7233 \
  --input '{"data_dir":"/opt/nomad/data","grace_days":7,"dry_run":true,"docker_prune":false}'

temporal workflow start --task-queue cleanup-task-queue --type RegistryGC \
  --address temporal-server.service.consul:7233 \
  --input '{"job_name":"registry","registry_data_dir":"/mnt/gdrive/munchbox-data/registry","registry_image":"registry:3","dry_run":true,"delete_untagged":true}'

Project Structure

nomad-temporal-jobs/
  .github/workflows/
    ci.yml                           CI: lint, test, vet, govulncheck, version check
  .gitignore                         Ignores coverage output and dist artifacts
  .golangci.yml                      Linter configuration (gocritic, misspell)
  .version                           Root version tag
  LICENSE                            MIT
  Makefile                           Root: build, test, lint, push-all targets
  README.md                          This file
  go.mod                             Module definition
  go.sum                             Dependency lock file
  shared/
    telemetry.go                     OTel tracer init, span helpers, peer.service attributes
    logging.go                       JSON slog logger with Temporal SDK adapter
    metrics.go                       Prometheus metrics handler via Tally bridge
    nomad.go                         OTel-instrumented Nomad API client factory
  backup/
    .version                         Image version tag
    Dockerfile                       Multi-stage build (Debian, Nomad/Consul/PG18 CLI)
    Makefile                         Build and push targets
    activities/
      activities.go                  Activity struct: snapshots, S3 upload, retention cleanup
    workflows/
      backup.go                      Concurrent snapshot legs + per-database PostgreSQL fan-out
    worker/
      main.go                       Worker entry point (tracing, slog, metrics, Temporal client)
  trivyscan/
    .version                         Image version tag
    Dockerfile                       Multi-stage build (Alpine, Trivy CLI)
    Makefile                         Build and push targets
    activities/
      activities.go                  Activity struct: Nomad image discovery, Trivy scan, DB save
    workflows/
      scan.go                        Bounded-concurrency scanning with per-image error handling
    worker/
      main.go                       Worker entry point (tracing, slog, metrics, Temporal client)
  nodecleanup/
    .version                         Image version tag
    Dockerfile                       Multi-stage build (Alpine, SSH only)
    Makefile                         Build and push targets
    activities/
      activities.go                  Activity struct: node discovery, SSH cleanup, script gen
      registry_gc.go                 Saga activities: find node, scale, drain, GC, measure
    workflows/
      cleanup.go                     Sequential per-node cleanup with failure tracking
      registry_gc.go                 Saga orchestration with deferred scale-back compensation
    worker/
      main.go                       Worker entry point (registers Cleanup + RegistryGC workflows)

License

MIT