2026-04-20 23:48:12 +00:00
<!-- last - reviewed: 2c5fb6abc2b680cacf9a3c3e29dce9c3031fd535 -->
2026-04-16 07:54:06 +00:00
# nomad/ — Agent Instructions
Nomad + Vault HCL for the factory's single-node cluster. These files are
the source of truth that `lib/init/nomad/cluster-up.sh` copies onto a
factory box under `/etc/nomad.d/` and `/etc/vault.d/` at init time.
2026-04-18 09:55:21 +00:00
This directory covers the **Nomad+Vault migration (Steps 0– 5)** —
see issues #821 – #992 for the step breakdown.
2026-04-16 07:54:06 +00:00
## What lives here
2026-04-16 18:10:18 +00:00
| File/Dir | Deployed to | Owned by |
2026-04-16 07:54:06 +00:00
|---|---|---|
| `server.hcl` | `/etc/nomad.d/server.hcl` | agent role, bind, ports, `data_dir` (S0.2) |
2026-04-17 14:45:56 +00:00
| `client.hcl` | `/etc/nomad.d/client.hcl` | Docker driver cfg + `host_volume` declarations (S0.2); `allow_privileged = true` for woodpecker-agent Docker-in-Docker (S3-fix-5, #961 ) |
2026-04-16 07:54:06 +00:00
| `vault.hcl` | `/etc/vault.d/vault.hcl` | Vault storage, listener, UI, `disable_mlock` (S0.3) |
2026-04-16 18:10:18 +00:00
| `jobs/forgejo.hcl` | submitted via `lib/init/nomad/deploy.sh` | Forgejo job; reads creds from Vault via consul-template stanza (S2.4) |
2026-04-17 14:59:13 +00:00
| `jobs/woodpecker-server.hcl` | submitted via `lib/init/nomad/deploy.sh` | Woodpecker CI server; host networking, Vault KV for `WOODPECKER_AGENT_SECRET` + Forgejo OAuth creds (S3.1) |
2026-04-17 21:06:33 +00:00
| `jobs/woodpecker-agent.hcl` | submitted via `lib/init/nomad/deploy.sh` | Woodpecker CI agent; host networking, `docker.sock` mount, Vault KV for `WOODPECKER_AGENT_SECRET` ; `WOODPECKER_SERVER` uses `${attr.unique.network.ip-address}:9000` (Nomad interpolation) — port binds to LXC alloc IP, not localhost (S3.2, S3-fix-6, #964 ) |
| `jobs/agents.hcl` | submitted via `lib/init/nomad/deploy.sh` | All 7 agent roles (dev, review, gardener, planner, predictor, supervisor, architect) + llama variant; Vault-templated bot tokens via `service-agents` policy; `force_pull = false` — image is built locally by `bin/disinto --with agents` , no registry (S4.1, S4-fix-2, S4-fix-5, #955 , #972 , #978 ) |
2026-04-18 16:20:53 +00:00
| `jobs/staging.hcl` | submitted via `lib/init/nomad/deploy.sh` | Caddy file-server mounting `docker/` as `/srv/site:ro` ; no Vault integration; **dynamic host port** (no static 80 — edge owns 80/443, collision fixed in S5-fix-7 #1018 ); edge discovers via Nomad service registration (S5.2, #989 ) |
2026-04-20 23:48:12 +00:00
| `jobs/chat.hcl` | submitted via `lib/init/nomad/deploy.sh` | Claude chat UI; custom `disinto/chat:local` image; sandbox hardening (cap_drop ALL, **tmpfs via mount block** not `tmpfs=` arg — S5-fix-5 #1012 , pids_limit 128); Vault-templated OAuth secrets via `service-chat` policy (S5.2, #989 ); rate limiting removed (#1084 ); **workspace volume** `chat-workspace` host_volume bind-mounted to `/var/workspace` for Claude project access (#1027 ) — operator must register `host_volume "chat-workspace"` in `client.hcl` on each node |
2026-04-20 17:19:23 +00:00
| `jobs/edge.hcl` | submitted via `lib/init/nomad/deploy.sh` | Caddy reverse proxy + dispatcher sidecar; routes /forge, /woodpecker, /staging, /chat; uses `disinto/edge:local` image built by `bin/disinto --with edge` ; **both Caddy and dispatcher tasks use `network_mode = "host"`** — upstreams are `127.0.0.1:<port>` (forgejo :3000, woodpecker :8000, chat :8080), not Docker hostnames (#1031 , #1034 ); `FORGE_URL` rendered via Nomad service discovery template (not static env) to handle bridge vs. host network differences (#1034 ); dispatcher Vault secret path changed to `kv/data/disinto/shared/ops-repo` (#1041 ); Vault-templated ops-repo creds via `service-dispatcher` policy (S5.1, #988 ); `/staging/*` strips `/staging` prefix before proxying (#1079 ); WebSocket endpoint `/chat/ws` added for streaming (#1026 ) |
2026-04-16 07:54:06 +00:00
Nomad auto-merges every `*.hcl` under `-config=/etc/nomad.d/` , so the
split between `server.hcl` and `client.hcl` is for readability, not
semantics. The top-of-file header in each config documents which blocks
it owns.
2026-04-16 18:10:18 +00:00
## Vault ACL policies
`vault/policies/` holds one `.hcl` file per Vault policy; see
[`vault/policies/AGENTS.md` ](../vault/policies/AGENTS.md ) for the naming
convention, KV path summary, and JWT-auth role bindings (S2.1/S2.3).
## Not yet implemented
- **TLS, ACLs, gossip encryption** — deliberately absent for now; land
alongside multi-node support.
2026-04-16 07:54:06 +00:00
## Adding a jobspec (Step 1 and later)
2026-04-16 12:39:09 +00:00
1. Drop a file in `nomad/jobs/<service>.hcl` . The `.hcl` suffix is
load-bearing: `.woodpecker/nomad-validate.yml` globs on exactly that
suffix to auto-pick up new jobspecs (see step 2 in "How CI validates
these files" below). Anything else in `nomad/jobs/` is silently
skipped by CI.
2026-04-16 07:54:06 +00:00
2. If it needs persistent state, reference a `host_volume` already
declared in `client.hcl` — *don't* add ad-hoc host paths in the
jobspec. If a new volume is needed, add it to **both** :
- `nomad/client.hcl` — the `host_volume "<name>" { path = … }` block
- `lib/init/nomad/cluster-up.sh` — the `HOST_VOLUME_DIRS` array
The two must stay in sync or nomad fingerprinting will fail and the
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
node stays in "initializing". Note that offline `nomad job validate`
will NOT catch a typo in the jobspec's `source = "..."` against the
client.hcl host_volume list (see step 2 below) — the scheduler
rejects the mismatch at placement time instead.
2026-04-16 07:54:06 +00:00
3. Pin image tags — `image = "forgejo/forgejo:1.22.5"` , not `:latest` .
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
4. No pipeline edit required — step 2 of `nomad-validate.yml` globs
2026-04-16 12:39:09 +00:00
over `nomad/jobs/*.hcl` and validates every match. Just make sure
the existing `nomad/**` trigger path still covers your file (it
does for anything under `nomad/jobs/` ).
2026-04-16 07:54:06 +00:00
## How CI validates these files
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
`.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/`
2026-04-16 18:15:03 +00:00
(including `nomad/jobs/` ), `lib/init/nomad/` , `bin/disinto` ,
`vault/policies/` , or `vault/roles.yaml` . Eight fail-closed steps:
2026-04-16 07:54:06 +00:00
1. ** `nomad config validate nomad/server.hcl nomad/client.hcl` **
— parses the HCL, fails on unknown blocks, bad port ranges, invalid
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
driver config. Vault HCL is excluded (different tool). Jobspecs are
excluded too — agent-config and jobspec are disjoint HCL grammars;
running this step on a jobspec rejects it with "unknown block 'job'".
2026-04-16 12:39:09 +00:00
2. ** `nomad job validate nomad/jobs/*.hcl` ** (loop, one call per file)
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
— parses each jobspec's HCL, fails on unknown stanzas, missing
required fields, wrong value types, invalid driver config. Runs
offline (no Nomad server needed) so CI exit 0 ≠ "this will schedule
successfully"; it means "the HCL itself is well-formed". What this
step does NOT catch:
- cross-file references (`source = "forgejo-data"` typo against the
`host_volume` list in `client.hcl` ) — that's a scheduling-time
check on the live cluster, not validate-time.
- image reachability — `image = "codeberg.org/forgejo/forgejo:11.0"`
is accepted even if the registry is down or the tag is wrong.
New jobspecs are picked up automatically by the glob — no pipeline
2026-04-16 12:39:09 +00:00
edit needed as long as the file is named `<name>.hcl` .
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
3. ** `vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener` **
2026-04-16 07:54:06 +00:00
— Vault's equivalent syntax + schema check. `-skip=storage/listener`
disables the runtime checks (CI containers don't have
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
`/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only,
e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge.
2026-04-16 18:15:03 +00:00
4. ** `vault policy fmt` idempotence check on every `vault/policies/*.hcl` **
(S2.6) — `vault policy fmt` has no `-check` flag in 1.18.5, so the
step copies each file to `/tmp` , runs `vault policy fmt` on the copy,
and diffs against the original. Any non-empty diff means the
committed file would be rewritten by `fmt` and the step fails — the
author is pointed at `vault policy fmt <file>` to heal the drift.
5. ** `vault policy write` -based validation against an inline dev-mode Vault**
(S2.6) — Vault 1.18.5 has no offline `policy validate` subcommand;
the CI step starts a dev-mode server, loops `vault policy write
< basename > < file > ` over each ` vault/policies/*.hcl`, and aggregates
failures so one CI run surfaces every broken policy. The server is
ephemeral and torn down on step exit — no persistence, no real
secrets. Catches unknown capability names (e.g. `"frobnicate"` ),
malformed `path` blocks, and other semantic errors `fmt` does not.
6. ** `vault/roles.yaml` validator** (S2.6) — yamllint + a PyYAML-based
check that every role's `policy:` field matches a basename under
`vault/policies/` , and that every role entry carries all four
required fields (`name` , `policy` , `namespace` , `job_id` ). Drift
between the two directories is a scheduling-time "permission denied"
in production; this step turns it into a CI failure at PR time.
7. ** `shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto` **
2026-04-16 07:54:06 +00:00
— all init/dispatcher shell clean. `bin/disinto` has no `.sh`
extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips
it — this is the one place it gets checked.
2026-04-16 18:15:03 +00:00
8. ** `bats tests/disinto-init-nomad.bats` **
2026-04-16 07:54:06 +00:00
— exercises the dispatcher: `disinto init --backend=nomad --dry-run` ,
`… --empty --dry-run` , and the `--backend=docker` regression guard.
2026-04-16 18:15:03 +00:00
**Secret-scan coverage.** Policy HCL files under `vault/policies/` are
already swept by the P11 secret-scan gate
(`.woodpecker/secret-scan.yml` , #798 ), whose `vault/**/*` trigger path
covers everything in this directory. `nomad-validate.yml` intentionally
does NOT duplicate that gate — one scanner, one source of truth.
2026-04-16 07:54:06 +00:00
If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1
fix: [nomad-step-1] S1.4 — extend Woodpecker CI to nomad job validate nomad/jobs/*.hcl (#843)
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:32:08 +00:00
fails with a clear error; if it breaks a jobspec (e.g. misspells
`task` as `tsak` , or adds a `volume` stanza without a `source` ), step
2026-04-16 18:15:03 +00:00
2 fails; a typo in a `path "..."` block in a vault policy fails step 5
with the Vault parser's error; a `roles.yaml` entry that points at a
policy basename that does not exist fails step 6. PRs that don't touch
any of the trigger paths skip this pipeline entirely.
2026-04-16 07:54:06 +00:00
## Version pinning
Nomad + Vault versions are pinned in **two** places — bumping one
without the other is a CI-caught drift:
- `lib/init/nomad/install.sh` — the apt-installed versions on factory
boxes (`NOMAD_VERSION` , `VAULT_VERSION` ).
- `.woodpecker/nomad-validate.yml` — the `hashicorp/nomad:…` and
`hashicorp/vault:…` image tags used for static validation.
Bump both in the same PR. The CI pipeline will fail if the pinned
image's `config validate` rejects syntax the installed runtime would
accept (or vice versa).
## Related
- `lib/init/nomad/` — installer + systemd units + cluster-up orchestrator.
- `.woodpecker/nomad-validate.yml` — this directory's CI pipeline.
2026-04-16 18:15:03 +00:00
- `vault/policies/` — Vault ACL policy HCL files (S2.1); the
`vault-policy-fmt` / `vault-policy-validate` CI steps above enforce
their shape. See [`../vault/policies/AGENTS.md` ](../vault/policies/AGENTS.md )
for the policy lifecycle, CI enforcement details, and common failure
modes.
- `vault/roles.yaml` — JWT-auth role → policy bindings (S2.3); the
`vault-roles-validate` CI step above keeps it in lockstep with the
policies directory.
2026-04-16 07:54:06 +00:00
- Top-of-file headers in `server.hcl` / `client.hcl` / `vault.hcl`
document the per-file ownership contract.