nomad-temporal-jobs

nomad-temporal-jobs

Temporal Workflows Automated Backups Trivy Scanning Node Cleanup Registry GC Prometheus Metrics OpenTelemetry

Architecture Go API GitHub


Temporal workflow workers for infrastructure automation

Three independent Temporal workers handle backup orchestration, container vulnerability scanning, and orphaned data cleanup across Nomad client nodes — with the cleanup worker also reclaiming Docker registry storage via a saga-style garbage-collect.

  • Automated nightly backups of Nomad, Consul, and PostgreSQL with S3 offsite replication and configurable retention
  • Vulnerability scanning of all running container images with parallel batched Trivy scans and CVE persistence
  • Orphaned data directory cleanup across Nomad nodes with dry-run safety and grace period filtering
  • Docker registry garbage collection that scales the registry offline, runs GC, and always scales it back via saga compensation

Key Features

Automated Backups

Nomad Raft, Consul Raft, and PostgreSQL snapshots with S3 upload and retention cleanup.

Runs as a Nomad periodic job. Snapshots are stored locally on NFS and uploaded to S3. S3 uploads are non-fatal — local backups succeed even if S3 is unreachable. Configurable retention: 7 days local, 30 days S3 by default.
Vulnerability Scanning

Discover images from Nomad, batch parallel scans via Trivy, persist CVE results to PostgreSQL.

Automatically discovers all running Docker images from Nomad allocations. Scans in parallel batches of 10 using a Trivy server. Classifies errors as transient (retryable) or permanent (skipped). Results stored in PostgreSQL with per-vulnerability detail.
Node Cleanup

SSH to each Nomad client node, identify orphaned directories, and remove stale data safely.

Connects to every Nomad client node via SSH. Enumerates job data directories, cross-references against running jobs, and removes orphaned entries older than the grace period. Optional Docker image pruning. Dry-run mode enabled by default for safe preview.
Registry Garbage Collection

Reclaim Docker registry storage with a saga that never leaves the registry offline.

Scales the registry Nomad job to 0, waits for allocations to drain, runs garbage-collect over SSH, then scales back to 1. The scale-back is a deferred compensation on a disconnected context — it always fires, even if GC fails or the workflow is cancelled. Reports blobs deleted and bytes reclaimed.
OpenTelemetry Tracing

Every activity traced end-to-end with Tempo export and service graph edges.

All workers initialize an OTLP gRPC exporter to Tempo. Client spans with peer.service attributes produce service graph edges in Grafana for every external call — Nomad, Consul, PostgreSQL, S3, and Trivy.
Prometheus Metrics

Temporal SDK metrics via Tally-Prometheus bridge exposed on :9090/metrics.

Exposes workflow and activity latency, task queue depth, retry counts, and failure rates. Each worker runs its own metrics HTTP handler. Scraped by Prometheus for dashboard and alerting integration.
Structured Logging

JSON slog output with service identity for Alloy/Loki collection.

Uses Go's log/slog with JSON handler writing to stdout. A custom adapter bridges Temporal SDK logging into slog. Service name injected as a default attribute for log correlation in Loki.