From 6e73c6dd1f86e576f5ae56071a64ff81a32595ab Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 18:15:03 +0000 Subject: [PATCH 01/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2.6=20=E2=80=94?= =?UTF-8?q?=20CI:=20vault=20policy=20fmt=20+=20validate=20+=20roles.yaml?= =?UTF-8?q?=20check=20(#884)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extend .woodpecker/nomad-validate.yml with three new fail-closed steps that guard every artifact under vault/policies/ and vault/roles.yaml before it can land: 4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5 has no `policy fmt -check` flag, so we build the non-destructive check out of `vault policy fmt` on a /tmp copy + diff against the original) 5. vault-policy-validate — HCL syntax + capability validation via `vault policy write` against an inline dev-mode Vault server (no offline `policy validate` subcommand exists; dev-mode writes are ephemeral so this is a validator, not a deploy) 6. vault-roles-validate — yamllint + PyYAML-based role→policy reference check (every role's `policy:` field must match a vault/policies/*.hcl basename; also checks the four required fields name/policy/namespace/job_id) Secret-scan coverage for vault/policies/*.hcl is already provided by the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger path — this pipeline intentionally does NOT duplicate that gate to avoid the inline-heredoc / YAML-parse failure mode that sank the prior attempt at this issue (PR #896). Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`. `lib/init/nomad/vault-*.sh` is already covered by the existing `lib/init/nomad/**` glob. Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the policy lifecycle, the CI enforcement table, and the common failure modes authors will see. Co-Authored-By: Claude Opus 4.6 (1M context) --- .woodpecker/nomad-validate.yml | 208 +++++++++++++++++++++++++++++++-- nomad/AGENTS.md | 48 +++++++- vault/policies/AGENTS.md | 64 +++++++++- 3 files changed, 300 insertions(+), 20 deletions(-) diff --git a/.woodpecker/nomad-validate.yml b/.woodpecker/nomad-validate.yml index 81e45ae..5a1cc7c 100644 --- a/.woodpecker/nomad-validate.yml +++ b/.woodpecker/nomad-validate.yml @@ -1,16 +1,21 @@ # ============================================================================= # .woodpecker/nomad-validate.yml — Static validation for Nomad+Vault artifacts # -# Part of the Nomad+Vault migration (S0.5, issue #825). Locks in the -# "no-ad-hoc-steps" principle: every HCL/shell artifact under nomad/ or -# lib/init/nomad/, plus the `disinto init` dispatcher, gets checked -# before it can land. +# Part of the Nomad+Vault migration (S0.5, issue #825; extended in S2.6, +# issue #884). Locks in the "no-ad-hoc-steps" principle: every HCL/shell +# artifact under nomad/, lib/init/nomad/, vault/policies/, plus the +# `disinto init` dispatcher and vault/roles.yaml, gets checked before it +# can land. # # Triggers on PRs (and pushes) that touch any of: # nomad/** — HCL configs (server, client, vault) -# lib/init/nomad/** — cluster-up / install / systemd / vault-init +# lib/init/nomad/** — cluster-up / install / systemd / vault-init / +# vault-nomad-auth (S2.6 trigger: vault-*.sh +# is a subset of this glob) # bin/disinto — `disinto init --backend=nomad` dispatcher # tests/disinto-init-nomad.bats — the bats suite itself +# vault/policies/** — Vault ACL policy HCL files (S2.1, S2.6) +# vault/roles.yaml — JWT-auth role bindings (S2.3, S2.6) # .woodpecker/nomad-validate.yml — the pipeline definition # # Steps (all fail-closed — any error blocks merge): @@ -19,8 +24,22 @@ # nomad/jobs/*.hcl (new jobspecs get # CI coverage automatically) # 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl -# 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto -# 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests +# 4. vault-policy-fmt — `vault policy fmt` idempotence check on +# every vault/policies/*.hcl (format drift = +# CI fail; non-destructive via cp+diff) +# 5. vault-policy-validate — HCL syntax + capability validation for every +# vault/policies/*.hcl via `vault policy write` +# against an inline dev-mode Vault server +# 6. vault-roles-validate — yamllint + role→policy reference check on +# vault/roles.yaml (every referenced policy +# must exist as vault/policies/.hcl) +# 7. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto +# 8. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests +# +# Secret-scan coverage: vault/policies/*.hcl is already scanned by the +# P11 gate (.woodpecker/secret-scan.yml, issue #798) — its trigger path +# `vault/**/*` covers everything under this directory. We intentionally +# do NOT duplicate that gate here; one scanner, one source of truth. # # Pinned image versions match lib/init/nomad/install.sh (nomad 1.9.5 / # vault 1.18.5). Bump there AND here together — drift = CI passing on @@ -34,6 +53,8 @@ when: - "lib/init/nomad/**" - "bin/disinto" - "tests/disinto-init-nomad.bats" + - "vault/policies/**" + - "vault/roles.yaml" - ".woodpecker/nomad-validate.yml" # Authenticated clone — same pattern as .woodpecker/ci.yml. Forgejo is @@ -123,7 +144,176 @@ steps: *) echo "vault config: hard failure (rc=$rc)" >&2; exit "$rc" ;; esac - # ── 4. Shellcheck ──────────────────────────────────────────────────────── + # ── 4. Vault policy fmt idempotence check ──────────────────────────────── + # `vault policy fmt ` formats a local HCL policy file in place. + # There's no `-check`/dry-run flag (vault 1.18.5), so we implement a + # non-destructive check as cp → fmt-on-copy → diff against original. + # Any diff means the committed file would be rewritten by `vault policy + # fmt` — failure steers the author to run `vault policy fmt ` + # locally before pushing. + # + # Scope: vault/policies/*.hcl only. The `[ -f "$f" ]` guard handles the + # no-match case (POSIX sh does not nullglob) so an empty policies/ + # directory does not fail this step. + # + # Note: `vault policy fmt` is purely local (HCL text transform) and does + # not require a running Vault server, which is why this step can run + # without starting one. + - name: vault-policy-fmt + image: hashicorp/vault:1.18.5 + commands: + - | + set -e + failed=0 + for f in vault/policies/*.hcl; do + [ -f "$f" ] || continue + tmp="/tmp/$(basename "$f").fmt" + cp "$f" "$tmp" + vault policy fmt "$tmp" >/dev/null 2>&1 + if ! diff -u "$f" "$tmp"; then + echo "ERROR: $f is not formatted — run 'vault policy fmt $f' locally" >&2 + failed=1 + fi + done + if [ "$failed" -gt 0 ]; then + echo "vault-policy-fmt: formatting drift detected" >&2 + exit 1 + fi + echo "vault-policy-fmt: all policies formatted correctly" + + # ── 5. Vault policy HCL syntax + capability validation ─────────────────── + # Vault has no offline `vault policy validate` subcommand — the closest + # in-CLI validator is `vault policy write`, which sends the HCL to a + # running server which parses it, checks capability names against the + # known set (read, list, create, update, delete, patch, sudo, deny), + # and rejects unknown stanzas / malformed path blocks. We start an + # inline dev-mode Vault (in-memory, no persistence, root token = "root") + # for the duration of this step and loop `vault policy write` over every + # vault/policies/*.hcl; the policies never leave the ephemeral dev + # server, so this is strictly a validator — not a deploy. + # + # Exit-code handling: + # - `vault policy write` exits 0 on success, non-zero on any parse / + # semantic error. We aggregate failures across all files so a single + # CI run surfaces every broken policy (not just the first). + # - The dev server is killed on any step exit via EXIT trap so the + # step tears down cleanly even on failure. + # + # Why dev-mode is sufficient: we're not persisting secrets, only asking + # Vault to parse policy text. The factory's production Vault is NOT + # contacted. + - name: vault-policy-validate + image: hashicorp/vault:1.18.5 + commands: + - | + set -e + vault server -dev -dev-root-token-id=root -dev-listen-address=127.0.0.1:8200 >/tmp/vault-dev.log 2>&1 & + VAULT_PID=$! + trap 'kill "$VAULT_PID" 2>/dev/null || true' EXIT INT TERM + export VAULT_ADDR=http://127.0.0.1:8200 + export VAULT_TOKEN=root + ready=0 + i=0 + while [ "$i" -lt 30 ]; do + if vault status >/dev/null 2>&1; then + ready=1 + break + fi + i=$((i + 1)) + sleep 0.5 + done + if [ "$ready" -ne 1 ]; then + echo "vault-policy-validate: dev server failed to start after 15s" >&2 + cat /tmp/vault-dev.log >&2 || true + exit 1 + fi + failed=0 + for f in vault/policies/*.hcl; do + [ -f "$f" ] || continue + name=$(basename "$f" .hcl) + echo "validate: $f" + if ! vault policy write "$name" "$f"; then + echo " ERROR: $f failed validation" >&2 + failed=1 + fi + done + if [ "$failed" -gt 0 ]; then + echo "vault-policy-validate: validation errors found" >&2 + exit 1 + fi + echo "vault-policy-validate: all policies valid" + + # ── 6. vault/roles.yaml validator ──────────────────────────────────────── + # Validates the JWT-auth role bindings file (S2.3). Two checks: + # + # a. `yamllint` — catches YAML syntax errors and indentation drift. + # Uses a relaxed config (line length bumped to 200) because + # roles.yaml's comments are wide by design. + # b. role → policy reference check — every role's `policy:` field + # must match a basename in vault/policies/*.hcl. A role pointing + # at a non-existent policy = runtime "permission denied" at job + # placement; catching the drift here turns it into a CI failure. + # Also verifies each role entry has the four required fields + # (name, policy, namespace, job_id) per the file's documented + # format. + # + # Parsing is done with PyYAML (the roles.yaml format is a strict + # subset that awk-level parsing in tools/vault-apply-roles.sh handles + # too, but PyYAML in CI gives us structural validation for free). If + # roles.yaml is ever absent (e.g. reverted), the step skips rather + # than fails — presence is enforced by S2.3's own tooling, not here. + - name: vault-roles-validate + image: python:3.12-alpine + commands: + - pip install --quiet --disable-pip-version-check pyyaml yamllint + - | + set -e + if [ ! -f vault/roles.yaml ]; then + echo "vault-roles-validate: vault/roles.yaml not present, skipping" + exit 0 + fi + yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}' vault/roles.yaml + echo "vault-roles-validate: yamllint OK" + python3 - <<'PY' + import os + import sys + import yaml + + with open('vault/roles.yaml') as f: + data = yaml.safe_load(f) or {} + roles = data.get('roles') or [] + if not roles: + print("vault-roles-validate: no roles defined in vault/roles.yaml", file=sys.stderr) + sys.exit(1) + existing = { + os.path.splitext(e)[0] + for e in os.listdir('vault/policies') + if e.endswith('.hcl') + } + required = ('name', 'policy', 'namespace', 'job_id') + failed = 0 + for r in roles: + if not isinstance(r, dict): + print(f"ERROR: role entry is not a mapping: {r!r}", file=sys.stderr) + failed = 1 + continue + for field in required: + if r.get(field) in (None, ''): + print(f"ERROR: role entry missing required field '{field}': {r}", file=sys.stderr) + failed = 1 + policy = r.get('policy') + if policy and policy not in existing: + print( + f"ERROR: role '{r.get('name')}' references policy '{policy}' " + f"but vault/policies/{policy}.hcl does not exist", + file=sys.stderr, + ) + failed = 1 + sys.exit(failed) + PY + echo "vault-roles-validate: all role→policy references valid" + + # ── 7. Shellcheck ──────────────────────────────────────────────────────── # Covers the new lib/init/nomad/*.sh scripts plus bin/disinto (which owns # the backend dispatcher). bin/disinto has no .sh extension so the # repo-wide shellcheck in .woodpecker/ci.yml skips it — this step is the @@ -133,7 +323,7 @@ steps: commands: - shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto - # ── 5. bats: `disinto init --backend=nomad --dry-run` ──────────────────── + # ── 8. bats: `disinto init --backend=nomad --dry-run` ──────────────────── # Smoke-tests the CLI dispatcher: both --backend=nomad variants exit 0 # with the expected step list, and --backend=docker stays on the docker # path (regression guard). Pure dry-run — no sudo, no network. diff --git a/nomad/AGENTS.md b/nomad/AGENTS.md index 953a7b2..5be8336 100644 --- a/nomad/AGENTS.md +++ b/nomad/AGENTS.md @@ -59,8 +59,8 @@ it owns. ## How CI validates these files `.woodpecker/nomad-validate.yml` runs on every PR that touches `nomad/` -(including `nomad/jobs/`), `lib/init/nomad/`, or `bin/disinto`. Five -fail-closed steps: +(including `nomad/jobs/`), `lib/init/nomad/`, `bin/disinto`, +`vault/policies/`, or `vault/roles.yaml`. Eight fail-closed steps: 1. **`nomad config validate nomad/server.hcl nomad/client.hcl`** — parses the HCL, fails on unknown blocks, bad port ranges, invalid @@ -85,19 +85,47 @@ fail-closed steps: disables the runtime checks (CI containers don't have `/var/lib/vault/data` or port 8200). Exit 2 (advisory warnings only, e.g. TLS-disabled listener) is tolerated; exit 1 blocks merge. -4. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** +4. **`vault policy fmt` idempotence check on every `vault/policies/*.hcl`** + (S2.6) — `vault policy fmt` has no `-check` flag in 1.18.5, so the + step copies each file to `/tmp`, runs `vault policy fmt` on the copy, + and diffs against the original. Any non-empty diff means the + committed file would be rewritten by `fmt` and the step fails — the + author is pointed at `vault policy fmt ` to heal the drift. +5. **`vault policy write`-based validation against an inline dev-mode Vault** + (S2.6) — Vault 1.18.5 has no offline `policy validate` subcommand; + the CI step starts a dev-mode server, loops `vault policy write + ` over each `vault/policies/*.hcl`, and aggregates + failures so one CI run surfaces every broken policy. The server is + ephemeral and torn down on step exit — no persistence, no real + secrets. Catches unknown capability names (e.g. `"frobnicate"`), + malformed `path` blocks, and other semantic errors `fmt` does not. +6. **`vault/roles.yaml` validator** (S2.6) — yamllint + a PyYAML-based + check that every role's `policy:` field matches a basename under + `vault/policies/`, and that every role entry carries all four + required fields (`name`, `policy`, `namespace`, `job_id`). Drift + between the two directories is a scheduling-time "permission denied" + in production; this step turns it into a CI failure at PR time. +7. **`shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto`** — all init/dispatcher shell clean. `bin/disinto` has no `.sh` extension so the repo-wide shellcheck in `.woodpecker/ci.yml` skips it — this is the one place it gets checked. -5. **`bats tests/disinto-init-nomad.bats`** +8. **`bats tests/disinto-init-nomad.bats`** — exercises the dispatcher: `disinto init --backend=nomad --dry-run`, `… --empty --dry-run`, and the `--backend=docker` regression guard. +**Secret-scan coverage.** Policy HCL files under `vault/policies/` are +already swept by the P11 secret-scan gate +(`.woodpecker/secret-scan.yml`, #798), whose `vault/**/*` trigger path +covers everything in this directory. `nomad-validate.yml` intentionally +does NOT duplicate that gate — one scanner, one source of truth. + If a PR breaks `nomad/server.hcl` (e.g. typo in a block name), step 1 fails with a clear error; if it breaks a jobspec (e.g. misspells `task` as `tsak`, or adds a `volume` stanza without a `source`), step -2 fails instead. The fix makes it pass. PRs that don't touch any of -the trigger paths skip this pipeline entirely. +2 fails; a typo in a `path "..."` block in a vault policy fails step 5 +with the Vault parser's error; a `roles.yaml` entry that points at a +policy basename that does not exist fails step 6. PRs that don't touch +any of the trigger paths skip this pipeline entirely. ## Version pinning @@ -117,5 +145,13 @@ accept (or vice versa). - `lib/init/nomad/` — installer + systemd units + cluster-up orchestrator. - `.woodpecker/nomad-validate.yml` — this directory's CI pipeline. +- `vault/policies/` — Vault ACL policy HCL files (S2.1); the + `vault-policy-fmt` / `vault-policy-validate` CI steps above enforce + their shape. See [`../vault/policies/AGENTS.md`](../vault/policies/AGENTS.md) + for the policy lifecycle, CI enforcement details, and common failure + modes. +- `vault/roles.yaml` — JWT-auth role → policy bindings (S2.3); the + `vault-roles-validate` CI step above keeps it in lockstep with the + policies directory. - Top-of-file headers in `server.hcl` / `client.hcl` / `vault.hcl` document the per-file ownership contract. diff --git a/vault/policies/AGENTS.md b/vault/policies/AGENTS.md index edaf21c..ff1f403 100644 --- a/vault/policies/AGENTS.md +++ b/vault/policies/AGENTS.md @@ -48,12 +48,17 @@ validation. 1. Drop a file matching one of the four naming patterns above. Use an existing file in the same family as the template — comment header, capability list, and KV path layout should match the family. -2. Run `tools/vault-apply-policies.sh --dry-run` to confirm the new +2. Run `vault policy fmt ` locally so the formatting matches what + the CI fmt-check (step 4 of `.woodpecker/nomad-validate.yml`) will + accept. The fmt check runs non-destructively in CI but a dirty file + fails the step; running `fmt` locally before pushing is the fastest + path. +3. Add the matching entry to `../roles.yaml` (see "JWT-auth roles" below) + so the CI role-reference check (step 6) stays green. +4. Run `tools/vault-apply-policies.sh --dry-run` to confirm the new basename appears in the planned-work list with the expected SHA. -3. Run `tools/vault-apply-policies.sh` against a Vault instance to +5. Run `tools/vault-apply-policies.sh` against a Vault instance to create it; re-run to confirm it reports `unchanged`. -4. The CI fmt + validate step lands in S2.6 (#884). Until then - `vault policy fmt ` locally is the fastest sanity check. ## JWT-auth roles (S2.3) @@ -117,6 +122,56 @@ would let one service's tokens outlive the others — add a field to `vault/roles.yaml` and the applier at the same time if that ever becomes necessary. +## Policy lifecycle + +Adding a policy that an actual workload consumes is a three-step chain; +the CI pipeline guards each link. + +1. **Add the policy HCL** — `vault/policies/.hcl`, formatted with + `vault policy fmt`. Capabilities must be drawn from the Vault-recognized + set (`read`, `list`, `create`, `update`, `delete`, `patch`, `sudo`, + `deny`); a typo fails CI step 5 (HCL written to an inline dev-mode Vault + via `vault policy write` — a real parser, not a regex). +2. **Update `../roles.yaml`** — add a JWT-auth role entry whose `policy:` + field matches the new basename (without `.hcl`). CI step 6 re-checks + every role in this file against the policy set, so a drift between the + two directories fails the step. +3. **Reference from a Nomad jobspec** — add `vault { role = "" }` in + `nomad/jobs/.hcl` (owned by S2.4). Policies do not take effect + until a Nomad job asks for a token via that role. + +See the "Adding a new service" walkthrough below for the applier-script +flow once steps 1–3 are committed. + +## CI enforcement (`.woodpecker/nomad-validate.yml`) + +The pipeline triggers on any PR touching `vault/policies/**`, +`vault/roles.yaml`, or `lib/init/nomad/vault-*.sh` and runs four +vault-scoped checks (in addition to the nomad-scoped steps already in +place): + +| Step | Tool | What it catches | +|---|---|---| +| 4. `vault-policy-fmt` | `vault policy fmt` + `diff` | formatting drift — trailing whitespace, wrong indentation, missing newlines | +| 5. `vault-policy-validate` | `vault policy write` against inline dev Vault | HCL syntax errors, unknown stanzas, invalid capability names (e.g. `"frobnicate"`), malformed `path "..." {}` blocks | +| 6. `vault-roles-validate` | yamllint + PyYAML | roles.yaml syntax drift, missing required fields, role→policy references with no matching `.hcl` | +| P11 | `lib/secret-scan.sh` via `.woodpecker/secret-scan.yml` | literal secret leaked into a policy HCL (rare copy-paste mistake) — already covers `vault/**/*`, no duplicate step here | + +All four steps are fail-closed — any error blocks merge. The pipeline +pins `hashicorp/vault:1.18.5` (matching `lib/init/nomad/install.sh`); +bumping the runtime version without bumping the CI image is a CI-caught +drift. + +## Common failure modes + +| Symptom in CI logs | Root cause | Fix | +|---|---|---| +| `vault-policy-fmt: … is not formatted — run 'vault policy fmt '` | Trailing whitespace / mixed indent in an HCL file | `vault policy fmt ` locally and re-commit | +| `vault-policy-validate: … failed validation` plus a `policy` error from Vault | Unknown capability (e.g. `"frobnicate"`), unknown stanza, malformed `path` block | Fix the HCL; valid capabilities are `read`, `list`, `create`, `update`, `delete`, `patch`, `sudo`, `deny` | +| `vault-roles-validate: ERROR: role 'X' references policy 'Y' but vault/policies/Y.hcl does not exist` | A role's `policy:` field does not match any file basename in `vault/policies/` | Either add the missing policy HCL or fix the typo in `roles.yaml` | +| `vault-roles-validate: ERROR: role entry missing required field 'Z'` | A role in `roles.yaml` is missing one of `name`, `policy`, `namespace`, `job_id` | Add the field; all four are required | +| P11 `secret-scan: detected potential secret …` on a `.hcl` file | A literal token/password was pasted into a policy | Policies must name KV paths, not carry secret values — move the literal into KV (S2.2) and have the policy grant `read` on the path | + ## What this directory does NOT own - **Attaching policies to Nomad jobs.** That's S2.4 (#882) via the @@ -124,4 +179,3 @@ becomes necessary. name in `vault { role = "..." }` is what binds the policy. - **Writing the secret values themselves.** That's S2.2 (#880) via `tools/vault-import.sh`. -- **CI policy fmt + validate + roles.yaml check.** That's S2.6 (#884). From a8d18aa3a343dcdf4b2700a05bd9c501b766013b Mon Sep 17 00:00:00 2001 From: dev-qwen2 Date: Thu, 16 Apr 2026 18:13:26 +0000 Subject: [PATCH 02/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2.5=20=E2=80=94?= =?UTF-8?q?=20bin/disinto=20init=20--import-env=20/=20--import-sops=20/=20?= =?UTF-8?q?--age-key=20wire-up=20(#883)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bin/disinto | 127 ++++++++++++++++++++++++++++++++-- lib/init/nomad/cluster-up.sh | 2 +- tests/disinto-init-nomad.bats | 62 ++++++++++++++++- 3 files changed, 183 insertions(+), 8 deletions(-) diff --git a/bin/disinto b/bin/disinto index 6128b7c..b294540 100755 --- a/bin/disinto +++ b/bin/disinto @@ -89,6 +89,9 @@ Init options: --yes Skip confirmation prompts --rotate-tokens Force regeneration of all bot tokens/passwords (idempotent by default) --dry-run Print every intended action without executing + --import-env (nomad) Path to .env file for import into Vault KV + --import-sops (nomad) Path to sops-encrypted .env.vault.enc for import + --age-key (nomad) Path to age keyfile (required with --import-sops) Hire an agent options: --formula Path to role formula TOML (default: formulas/.toml) @@ -664,8 +667,12 @@ prompt_admin_password() { # `sudo disinto init ...` directly. _disinto_init_nomad() { local dry_run="${1:-false}" empty="${2:-false}" with_services="${3:-}" + local import_env="${4:-}" import_sops="${5:-}" age_key="${6:-}" local cluster_up="${FACTORY_ROOT}/lib/init/nomad/cluster-up.sh" local deploy_sh="${FACTORY_ROOT}/lib/init/nomad/deploy.sh" + local vault_import_sh="${FACTORY_ROOT}/tools/vault-import.sh" + local vault_auth_sh="${FACTORY_ROOT}/lib/init/nomad/vault-nomad-auth.sh" + local vault_policies_sh="${FACTORY_ROOT}/tools/vault-apply-policies.sh" if [ ! -x "$cluster_up" ]; then echo "Error: ${cluster_up} not found or not executable" >&2 @@ -686,7 +693,7 @@ _disinto_init_nomad() { echo "nomad backend: default (cluster-up; jobs deferred to Step 1)" fi - # Dry-run: print cluster-up plan + deploy.sh plan + # Dry-run: print cluster-up plan + import plan + deploy.sh plan if [ "$dry_run" = "true" ]; then echo "" echo "── Cluster-up dry-run ─────────────────────────────────" @@ -694,6 +701,32 @@ _disinto_init_nomad() { "${cmd[@]}" || true echo "" + # Import plan if any import flags are set + if [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; then + echo "── Vault import dry-run ───────────────────────────────" + if [ -n "$import_env" ]; then + echo "[import] --import-env: ${import_env}" + fi + if [ -n "$import_sops" ]; then + echo "[import] --import-sops: ${import_sops}" + fi + if [ -n "$age_key" ]; then + echo "[import] --age-key: ${age_key}" + fi + echo "[import] [dry-run] ${vault_import_sh} --dry-run" + echo "[import] [dry-run] vault import plan printed above" + echo "" + echo "── Vault policies dry-run ─────────────────────────────" + echo "[policies] [dry-run] ${vault_policies_sh} --dry-run" + echo "" + echo "── Vault auth dry-run ─────────────────────────────────" + echo "[auth] [dry-run] ${vault_auth_sh}" + echo "" + else + echo "[import] no --import-env/--import-sops - skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" + echo "" + fi + if [ -n "$with_services" ]; then echo "── Deploy services dry-run ────────────────────────────" echo "[deploy] services to deploy: ${with_services}" @@ -721,7 +754,7 @@ _disinto_init_nomad() { exit 0 fi - # Real run: cluster-up + deploy services + # Real run: cluster-up + import + deploy services local -a cluster_cmd=("$cluster_up") if [ "$(id -u)" -eq 0 ]; then "${cluster_cmd[@]}" || exit $? @@ -733,6 +766,61 @@ _disinto_init_nomad() { sudo -n -- "${cluster_cmd[@]}" || exit $? fi + # Apply Vault policies (S2.1) + echo "" + echo "── Applying Vault policies ─────────────────────────────" + if [ "$(id -u)" -eq 0 ]; then + "${vault_policies_sh}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-apply-policies.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${vault_policies_sh}" || exit $? + fi + + # Configure Vault JWT auth (S2.3) + echo "" + echo "── Configuring Vault JWT auth ──────────────────────────" + if [ "$(id -u)" -eq 0 ]; then + "${vault_auth_sh}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-nomad-auth.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${vault_auth_sh}" || exit $? + fi + + # Import secrets if import flags are set (S2.2) + if [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; then + echo "" + echo "── Importing secrets into Vault ────────────────────────" + local -a import_cmd=("$vault_import_sh") + + if [ -n "$import_env" ]; then + import_cmd+=("--env" "$import_env") + fi + if [ -n "$import_sops" ]; then + import_cmd+=("--sops" "$import_sops") + fi + if [ -n "$age_key" ]; then + import_cmd+=("--age-key" "$age_key") + fi + + if [ "$(id -u)" -eq 0 ]; then + "${import_cmd[@]}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-import.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${import_cmd[@]}" || exit $? + fi + else + echo "[import] no --import-env/--import-sops - skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" + fi + # Deploy services if requested if [ -n "$with_services" ]; then echo "" @@ -777,6 +865,11 @@ _disinto_init_nomad() { echo "" echo "── Summary ────────────────────────────────────────────" echo "Cluster: Nomad+Vault cluster is up" + if [ -n "$import_env" ] || [ -n "$import_sops" ]; then + echo "Imported: secrets from ${import_env:+$import_env }${import_sops:+${import_sops} }" + else + echo "Imported: (none — secrets must be seeded manually)" + fi echo "Deployed: ${with_services}" if echo "$with_services" | grep -q "forgejo"; then echo "Ports: forgejo: 3000" @@ -802,7 +895,7 @@ disinto_init() { fi # Parse flags - local branch="" repo_root="" ci_id="0" auto_yes=false forge_url_flag="" bare=false rotate_tokens=false use_build=false dry_run=false backend="docker" empty=false with_services="" + local branch="" repo_root="" ci_id="0" auto_yes=false forge_url_flag="" bare=false rotate_tokens=false use_build=false dry_run=false backend="docker" empty=false with_services="" import_env="" import_sops="" age_key="" while [ $# -gt 0 ]; do case "$1" in --branch) branch="$2"; shift 2 ;; @@ -819,6 +912,9 @@ disinto_init() { --yes) auto_yes=true; shift ;; --rotate-tokens) rotate_tokens=true; shift ;; --dry-run) dry_run=true; shift ;; + --import-env) import_env="$2"; shift 2 ;; + --import-sops) import_sops="$2"; shift 2 ;; + --age-key) age_key="$2"; shift 2 ;; *) echo "Unknown option: $1" >&2; exit 1 ;; esac done @@ -859,11 +955,32 @@ disinto_init() { exit 1 fi + # Import flags validation + # --import-sops requires --age-key + if [ -n "$import_sops" ] && [ -z "$age_key" ]; then + echo "Error: --import-sops requires --age-key" >&2 + exit 1 + fi + + # --age-key requires --import-sops + if [ -n "$age_key" ] && [ -z "$import_sops" ]; then + echo "Error: --age-key requires --import-sops" >&2 + exit 1 + fi + + # --import-* flags require --backend=nomad + if [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; then + if [ "$backend" != "nomad" ]; then + echo "Error: --import-env, --import-sops, and --age-key require --backend=nomad" >&2 + exit 1 + fi + fi + # Dispatch on backend — the nomad path runs lib/init/nomad/cluster-up.sh # (S0.4). The default and --empty variants are identical today; Step 1 # will branch on $empty to add job deployment to the default path. if [ "$backend" = "nomad" ]; then - _disinto_init_nomad "$dry_run" "$empty" "$with_services" + _disinto_init_nomad "$dry_run" "$empty" "$with_services" "$import_env" "$import_sops" "$age_key" # shellcheck disable=SC2317 # _disinto_init_nomad always exits today; # `return` is defensive against future refactors. return @@ -1017,7 +1134,7 @@ p.write_text(text) echo "[ensure] CLAUDE_CONFIG_DIR" echo "[ensure] state files (.dev-active, .reviewer-active, .gardener-active)" echo "" - echo "Dry run complete — no changes made." + echo "Dry run complete - no changes made." exit 0 fi diff --git a/lib/init/nomad/cluster-up.sh b/lib/init/nomad/cluster-up.sh index 4aab42d..84a6e9c 100755 --- a/lib/init/nomad/cluster-up.sh +++ b/lib/init/nomad/cluster-up.sh @@ -135,7 +135,7 @@ EOF → export VAULT_ADDR=${VAULT_ADDR_DEFAULT} → export NOMAD_ADDR=${NOMAD_ADDR_DEFAULT} -Dry run complete — no changes made. +Dry run complete - no changes made. EOF exit 0 fi diff --git a/tests/disinto-init-nomad.bats b/tests/disinto-init-nomad.bats index 84cfa10..75bb884 100644 --- a/tests/disinto-init-nomad.bats +++ b/tests/disinto-init-nomad.bats @@ -44,7 +44,7 @@ setup_file() { [[ "$output" == *"[dry-run] Step 8/9: systemctl start nomad + poll until ≥1 node ready"* ]] [[ "$output" == *"[dry-run] Step 9/9: write /etc/profile.d/disinto-nomad.sh"* ]] - [[ "$output" == *"Dry run complete — no changes made."* ]] + [[ "$output" == *"Dry run complete - no changes made."* ]] } # ── --backend=nomad --empty --dry-run ──────────────────────────────────────── @@ -58,7 +58,7 @@ setup_file() { # both modes invoke the same cluster-up dry-run. [[ "$output" == *"nomad backend: --empty (cluster-up only, no jobs)"* ]] [[ "$output" == *"[dry-run] Step 1/9: install nomad + vault binaries + docker daemon"* ]] - [[ "$output" == *"Dry run complete — no changes made."* ]] + [[ "$output" == *"Dry run complete - no changes made."* ]] } # ── --backend=docker (regression guard) ────────────────────────────────────── @@ -191,3 +191,61 @@ setup_file() { [ "$status" -ne 0 ] [[ "$output" == *"--empty and --with are mutually exclusive"* ]] } + +# ── Import flag validation ──────────────────────────────────────────────────── + +@test "disinto init --backend=nomad --import-env only is accepted" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env /tmp/.env --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"--import-env"* ]] +} + +@test "disinto init --backend=nomad --import-sops without --age-key errors" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-sops /tmp/.env.vault.enc --dry-run + [ "$status" -ne 0 ] + [[ "$output" == *"--import-sops requires --age-key"* ]] +} + +@test "disinto init --backend=nomad --age-key without --import-sops errors" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --age-key /tmp/keys.txt --dry-run + [ "$status" -ne 0 ] + [[ "$output" == *"--age-key requires --import-sops"* ]] +} + +@test "disinto init --backend=docker --import-env errors with backend requirement" { + run "$DISINTO_BIN" init placeholder/repo --backend=docker --import-env /tmp/.env + [ "$status" -ne 0 ] + [[ "$output" == *"--import-env, --import-sops, and --age-key require --backend=nomad"* ]] +} + +@test "disinto init --backend=nomad --import-sops --age-key --dry-run shows import plan" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault import dry-run"* ]] + [[ "$output" == *"--import-sops"* ]] + [[ "$output" == *"--age-key"* ]] +} + +@test "disinto init --backend=nomad --import-env --import-sops --age-key --dry-run shows full import plan" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env /tmp/.env --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault import dry-run"* ]] + [[ "$output" == *"env file: /tmp/.env"* ]] + [[ "$output" == *"sops file: /tmp/.env.vault.enc"* ]] + [[ "$output" == *"age key: /tmp/keys.txt"* ]] +} + +@test "disinto init --backend=nomad without import flags shows skip message" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"no --import-env/--import-sops - skipping"* ]] +} + +@test "disinto init --backend=nomad --import-env --import-sops --age-key --with forgejo --dry-run shows all plans" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env /tmp/.env --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --with forgejo --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault import dry-run"* ]] + [[ "$output" == *"Vault policies dry-run"* ]] + [[ "$output" == *"Vault auth dry-run"* ]] + [[ "$output" == *"Deploy services dry-run"* ]] +} From bbaccd678d5bda6129fe665f275b6793ccb3ac7a Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 18:36:42 +0000 Subject: [PATCH 03/93] fix: entrypoint: validate_projects_dir silently exits instead of logging FATAL under set -eo pipefail (#877) `compgen -G ... | wc -l` under `set -eo pipefail` aborts the script on the non-zero pipeline exit (compgen returns 1 on no match) before the FATAL diagnostic branch can run. The container still fast-fails, but operators saw no explanation. Switch to the conditional `if ! compgen -G ... >/dev/null 2>&1; then` pattern already used at the two other compgen call sites in this file (bootstrap_factory_repo and the PROJECT_NAME parser). The count for the success-path log is computed after we've confirmed at least one match. Co-Authored-By: Claude Opus 4.6 (1M context) --- docker/agents/entrypoint.sh | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docker/agents/entrypoint.sh b/docker/agents/entrypoint.sh index 89a520b..f838c15 100644 --- a/docker/agents/entrypoint.sh +++ b/docker/agents/entrypoint.sh @@ -346,15 +346,19 @@ bootstrap_factory_repo # This prevents the silent-zombie mode where the polling loop matches zero files # and does nothing forever. validate_projects_dir() { - local toml_count - toml_count=$(compgen -G "${DISINTO_DIR}/projects/*.toml" 2>/dev/null | wc -l) - if [ "$toml_count" -eq 0 ]; then + # NOTE: compgen -G exits non-zero when no matches exist, so piping it through + # `wc -l` under `set -eo pipefail` aborts the script before the FATAL branch + # can log a diagnostic (#877). Use the conditional form already adopted at + # lines above (see bootstrap_factory_repo, PROJECT_NAME parsing). + if ! compgen -G "${DISINTO_DIR}/projects/*.toml" >/dev/null 2>&1; then log "FATAL: No real .toml files found in ${DISINTO_DIR}/projects/" log "Expected at least one project config file (e.g., disinto.toml)" log "The directory only contains *.toml.example template files." log "Mount the host ./projects volume or copy real .toml files into the container." exit 1 fi + local toml_count + toml_count=$(compgen -G "${DISINTO_DIR}/projects/*.toml" | wc -l) log "Projects directory validated: ${toml_count} real .toml file(s) found" } From 96870d9f3035697194cb123abdb75e10d430ed42 Mon Sep 17 00:00:00 2001 From: Agent Date: Thu, 16 Apr 2026 18:21:41 +0000 Subject: [PATCH 04/93] fix: fix: vault_request RETURN trap fires prematurely when vault-env.sh is sourced (#773) --- lib/action-vault.sh | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/lib/action-vault.sh b/lib/action-vault.sh index 6348cc6..7602a39 100644 --- a/lib/action-vault.sh +++ b/lib/action-vault.sh @@ -128,7 +128,6 @@ vault_request() { # Validate TOML content local tmp_toml tmp_toml=$(mktemp /tmp/vault-XXXXXX.toml) - trap 'rm -f "$tmp_toml"' RETURN printf '%s' "$toml_content" > "$tmp_toml" @@ -136,6 +135,7 @@ vault_request() { local vault_env="${FACTORY_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}/action-vault/vault-env.sh" if [ ! -f "$vault_env" ]; then echo "ERROR: vault-env.sh not found at $vault_env" >&2 + rm -f "$tmp_toml" return 1 fi @@ -145,11 +145,15 @@ vault_request() { if ! source "$vault_env"; then FORGE_TOKEN="${_saved_forge_token:-}" echo "ERROR: failed to source vault-env.sh" >&2 + rm -f "$tmp_toml" return 1 fi # Restore caller's FORGE_TOKEN after validation FORGE_TOKEN="${_saved_forge_token:-}" + # Set trap AFTER sourcing vault-env.sh to avoid RETURN trap firing during source + trap 'rm -f "$tmp_toml"' RETURN + # Run validation if ! validate_vault_action "$tmp_toml"; then echo "ERROR: TOML validation failed" >&2 From 28eb182487c3f9ad2fe4918f7c0390a090adb583 Mon Sep 17 00:00:00 2001 From: dev-qwen2 Date: Thu, 16 Apr 2026 18:40:35 +0000 Subject: [PATCH 05/93] fix: Two parallel activation paths for llama agents (ENABLE_LLAMA_AGENT vs [agents.X] TOML) (#846) --- .env.example | 14 +-- bin/disinto | 14 --- docker/agents/entrypoint.sh | 32 +++++++ docs/agents-llama.md | 5 +- lib/forge-setup.sh | 166 ------------------------------------ lib/generators.sh | 130 ---------------------------- 6 files changed, 38 insertions(+), 323 deletions(-) diff --git a/.env.example b/.env.example index c1c0b98..a1f24d5 100644 --- a/.env.example +++ b/.env.example @@ -32,13 +32,10 @@ FORGE_URL=http://localhost:3000 # [CONFIG] local Forgejo instance # - FORGE_PASS_DEV_QWEN2 # Name conversion: tr 'a-z-' 'A-Z_' (lowercase→UPPER, hyphens→underscores). # The compose generator looks these up via the agent's `forge_user` field in -# the project TOML. The pre-existing `dev-qwen` llama agent uses -# FORGE_TOKEN_LLAMA / FORGE_PASS_LLAMA (kept for backwards-compat with the -# legacy `ENABLE_LLAMA_AGENT=1` single-agent path). +# the project TOML. Configure local-model agents via [agents.X] sections in +# projects/*.toml — this is the canonical activation path. FORGE_TOKEN= # [SECRET] dev-bot API token (default for all agents) FORGE_PASS= # [SECRET] dev-bot password for git HTTP push (#361) -FORGE_TOKEN_LLAMA= # [SECRET] dev-qwen API token (for agents-llama) -FORGE_PASS_LLAMA= # [SECRET] dev-qwen password for git HTTP push FORGE_REVIEW_TOKEN= # [SECRET] review-bot API token FORGE_REVIEW_PASS= # [SECRET] review-bot password for git HTTP push FORGE_PLANNER_TOKEN= # [SECRET] planner-bot API token @@ -107,13 +104,6 @@ FORWARD_AUTH_SECRET= # [SECRET] Shared secret for Caddy ↔ # Store all project secrets here so formulas reference env vars, never hardcode. BASE_RPC_URL= # [SECRET] on-chain RPC endpoint -# ── Local Qwen dev agent (optional) ────────────────────────────────────── -# Set ENABLE_LLAMA_AGENT=1 to emit agents-llama in docker-compose.yml. -# Requires a running llama-server reachable at ANTHROPIC_BASE_URL. -# See docs/agents-llama.md for details. -ENABLE_LLAMA_AGENT=0 # [CONFIG] 1 = enable agents-llama service -ANTHROPIC_BASE_URL= # [CONFIG] e.g. http://host.docker.internal:8081 - # ── Tuning ──────────────────────────────────────────────────────────────── CLAUDE_TIMEOUT=7200 # [CONFIG] max seconds per Claude invocation diff --git a/bin/disinto b/bin/disinto index 6128b7c..c6c2421 100755 --- a/bin/disinto +++ b/bin/disinto @@ -977,7 +977,6 @@ p.write_text(text) echo "" echo "[ensure] Forgejo admin user 'disinto-admin'" echo "[ensure] 8 bot users: dev-bot, review-bot, planner-bot, gardener-bot, vault-bot, supervisor-bot, predictor-bot, architect-bot" - echo "[ensure] 2 llama bot users: dev-qwen, dev-qwen-nightly" echo "[ensure] .profile repos for all bots" echo "[ensure] repo ${forge_repo} on Forgejo with collaborators" echo "[run] preflight checks" @@ -1173,19 +1172,6 @@ p.write_text(text) echo "Config: CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 saved to .env" fi - # Write local-Qwen dev agent env keys with safe defaults (#769) - if ! grep -q '^ENABLE_LLAMA_AGENT=' "$env_file" 2>/dev/null; then - cat >> "$env_file" <<'LLAMAENVEOF' - -# Local Qwen dev agent (optional) — set to 1 to enable -ENABLE_LLAMA_AGENT=0 -FORGE_TOKEN_LLAMA= -FORGE_PASS_LLAMA= -ANTHROPIC_BASE_URL= -LLAMAENVEOF - echo "Config: ENABLE_LLAMA_AGENT keys written to .env (disabled by default)" - fi - # Create labels on remote create_labels "$forge_repo" "$forge_url" diff --git a/docker/agents/entrypoint.sh b/docker/agents/entrypoint.sh index f838c15..7c58674 100644 --- a/docker/agents/entrypoint.sh +++ b/docker/agents/entrypoint.sh @@ -17,6 +17,38 @@ set -euo pipefail # - predictor: every 24 hours (288 iterations * 5 min) # - supervisor: every SUPERVISOR_INTERVAL seconds (default: 1200 = 20 min) +# ── Migration check: reject ENABLE_LLAMA_AGENT ─────────────────────────────── +# #846: The legacy ENABLE_LLAMA_AGENT env flag is no longer supported. +# Activation is now done exclusively via [agents.X] sections in project TOML. +# If this legacy flag is detected, fail immediately with a migration message. +if [ "${ENABLE_LLAMA_AGENT:-}" = "1" ]; then + cat <<'MIGRATION_ERR' +FATAL: ENABLE_LLAMA_AGENT is no longer supported. + +The legacy ENABLE_LLAMA_AGENT=1 flag has been removed (#846). +Activation is now done exclusively via [agents.X] sections in projects/*.toml. + +To migrate: + 1. Remove ENABLE_LLAMA_AGENT from your .env or .env.enc file + 2. Add an [agents.] section to your project TOML: + + [agents.dev-qwen] + base_url = "http://your-llama-server:8081" + model = "unsloth/Qwen3.5-35B-A3B" + api_key = "sk-no-key-required" + roles = ["dev"] + forge_user = "dev-qwen" + compact_pct = 60 + poll_interval = 60 + + 3. Run: disinto init + 4. Start the agent: docker compose up -d agents-dev-qwen + +See docs/agents-llama.md for full details. +MIGRATION_ERR + exit 1 +fi + DISINTO_BAKED="/home/agent/disinto" DISINTO_LIVE="/home/agent/repos/_factory" DISINTO_DIR="$DISINTO_BAKED" # start with baked copy; switched to live checkout after bootstrap diff --git a/docs/agents-llama.md b/docs/agents-llama.md index bc973b7..b3a1334 100644 --- a/docs/agents-llama.md +++ b/docs/agents-llama.md @@ -2,9 +2,12 @@ Local-model agents run the same agent code as the Claude-backed agents, but connect to a local llama-server (or compatible OpenAI-API endpoint) instead of -the Anthropic API. This document describes the current activation flow using +the Anthropic API. This document describes the canonical activation flow using `disinto hire-an-agent` and `[agents.X]` TOML configuration. +> **Note:** The legacy `ENABLE_LLAMA_AGENT=1` env flag has been removed (#846). +> Activation is now done exclusively via `[agents.X]` sections in project TOML. + ## Overview Local-model agents are configured via `[agents.]` sections in diff --git a/lib/forge-setup.sh b/lib/forge-setup.sh index 2b7b697..2f8b117 100644 --- a/lib/forge-setup.sh +++ b/lib/forge-setup.sh @@ -356,16 +356,6 @@ setup_forge() { [predictor-bot]="FORGE_PREDICTOR_PASS" [architect-bot]="FORGE_ARCHITECT_PASS" ) - # Llama bot users (local-model agents) — separate from main agents - # Each llama agent gets its own Forgejo user, token, and password - local -A llama_token_vars=( - [dev-qwen]="FORGE_TOKEN_LLAMA" - [dev-qwen-nightly]="FORGE_TOKEN_LLAMA_NIGHTLY" - ) - local -A llama_pass_vars=( - [dev-qwen]="FORGE_PASS_LLAMA" - [dev-qwen-nightly]="FORGE_PASS_LLAMA_NIGHTLY" - ) local bot_user bot_pass token token_var pass_var @@ -515,159 +505,12 @@ setup_forge() { fi done - # Create llama bot users and tokens (local-model agents) - # These are separate from the main agents and get their own credentials - echo "" - echo "── Setting up llama bot users ────────────────────────────" - - local llama_user llama_pass llama_token llama_token_var llama_pass_var - for llama_user in "${!llama_token_vars[@]}"; do - llama_token_var="${llama_token_vars[$llama_user]}" - llama_pass_var="${llama_pass_vars[$llama_user]}" - - # Check if token already exists in .env - local token_exists=false - if _token_exists_in_env "$llama_token_var" "$env_file"; then - token_exists=true - fi - - # Check if password already exists in .env - local pass_exists=false - if _pass_exists_in_env "$llama_pass_var" "$env_file"; then - pass_exists=true - fi - - # Check if llama bot user exists on Forgejo - local llama_user_exists=false - if curl -sf --max-time 5 \ - -H "Authorization: token ${admin_token}" \ - "${forge_url}/api/v1/users/${llama_user}" >/dev/null 2>&1; then - llama_user_exists=true - fi - - # Skip token/password regeneration if both exist in .env and not forcing rotation - if [ "$token_exists" = true ] && [ "$pass_exists" = true ] && [ "$rotate_tokens" = false ]; then - echo " ${llama_user} token and password preserved (use --rotate-tokens to force)" - # Still export the existing token for use within this run - local existing_token existing_pass - existing_token=$(grep "^${llama_token_var}=" "$env_file" | head -1 | cut -d= -f2-) - existing_pass=$(grep "^${llama_pass_var}=" "$env_file" | head -1 | cut -d= -f2-) - export "${llama_token_var}=${existing_token}" - export "${llama_pass_var}=${existing_pass}" - continue - fi - - # Generate new credentials if: - # - Token doesn't exist (first run) - # - Password doesn't exist (first run) - # - --rotate-tokens flag is set (explicit rotation) - if [ "$llama_user_exists" = false ]; then - # User doesn't exist - create it - llama_pass="llama-$(head -c 16 /dev/urandom | base64 | tr -dc 'a-zA-Z0-9' | head -c 20)" - echo "Creating llama bot user: ${llama_user}" - local create_output - if ! create_output=$(_forgejo_exec forgejo admin user create \ - --username "${llama_user}" \ - --password "${llama_pass}" \ - --email "${llama_user}@disinto.local" \ - --must-change-password=false 2>&1); then - echo "Error: failed to create llama bot user '${llama_user}':" >&2 - echo " ${create_output}" >&2 - exit 1 - fi - # Forgejo 11.x ignores --must-change-password=false on create; - # explicitly clear the flag so basic-auth token creation works. - _forgejo_exec forgejo admin user change-password \ - --username "${llama_user}" \ - --password "${llama_pass}" \ - --must-change-password=false - - # Verify llama bot user was actually created - if ! curl -sf --max-time 5 \ - -H "Authorization: token ${admin_token}" \ - "${forge_url}/api/v1/users/${llama_user}" >/dev/null 2>&1; then - echo "Error: llama bot user '${llama_user}' not found after creation" >&2 - exit 1 - fi - echo " ${llama_user} user created" - else - # User exists - reset password if needed - echo " ${llama_user} user exists" - if [ "$rotate_tokens" = true ] || [ "$pass_exists" = false ]; then - llama_pass="llama-$(head -c 16 /dev/urandom | base64 | tr -dc 'a-zA-Z0-9' | head -c 20)" - _forgejo_exec forgejo admin user change-password \ - --username "${llama_user}" \ - --password "${llama_pass}" \ - --must-change-password=false || { - echo "Error: failed to reset password for existing llama bot user '${llama_user}'" >&2 - exit 1 - } - echo " ${llama_user} password reset for token generation" - else - # Password exists, get it from .env - llama_pass=$(grep "^${llama_pass_var}=" "$env_file" | head -1 | cut -d= -f2-) - fi - fi - - # Generate token via API (basic auth as the llama user) - # First, delete any existing tokens to avoid name collision - local existing_llama_token_ids - existing_llama_token_ids=$(curl -sf \ - -u "${llama_user}:${llama_pass}" \ - "${forge_url}/api/v1/users/${llama_user}/tokens" 2>/dev/null \ - | jq -r '.[].id // empty' 2>/dev/null) || existing_llama_token_ids="" - - # Delete any existing tokens for this user - if [ -n "$existing_llama_token_ids" ]; then - while IFS= read -r tid; do - [ -n "$tid" ] && curl -sf -X DELETE \ - -u "${llama_user}:${llama_pass}" \ - "${forge_url}/api/v1/users/${llama_user}/tokens/${tid}" >/dev/null 2>&1 || true - done <<< "$existing_llama_token_ids" - fi - - llama_token=$(curl -sf -X POST \ - -u "${llama_user}:${llama_pass}" \ - -H "Content-Type: application/json" \ - "${forge_url}/api/v1/users/${llama_user}/tokens" \ - -d "{\"name\":\"disinto-${llama_user}-token\",\"scopes\":[\"all\"]}" 2>/dev/null \ - | jq -r '.sha1 // empty') || llama_token="" - - if [ -z "$llama_token" ]; then - echo "Error: failed to create API token for '${llama_user}'" >&2 - exit 1 - fi - - # Store token in .env under the llama-specific variable name - if grep -q "^${llama_token_var}=" "$env_file" 2>/dev/null; then - sed -i "s|^${llama_token_var}=.*|${llama_token_var}=${llama_token}|" "$env_file" - else - printf '%s=%s\n' "$llama_token_var" "$llama_token" >> "$env_file" - fi - export "${llama_token_var}=${llama_token}" - echo " ${llama_user} token generated and saved (${llama_token_var})" - - # Store password in .env for git HTTP push (#361) - # Forgejo 11.x API tokens don't work for git push; password auth does. - if grep -q "^${llama_pass_var}=" "$env_file" 2>/dev/null; then - sed -i "s|^${llama_pass_var}=.*|${llama_pass_var}=${llama_pass}|" "$env_file" - else - printf '%s=%s\n' "$llama_pass_var" "$llama_pass" >> "$env_file" - fi - export "${llama_pass_var}=${llama_pass}" - echo " ${llama_user} password saved (${llama_pass_var})" - done - # Create .profile repos for all bot users (if they don't already exist) # This runs the same logic as hire-an-agent Step 2-3 for idempotent setup echo "" echo "── Setting up .profile repos ────────────────────────────" local -a bot_users=(dev-bot review-bot planner-bot gardener-bot vault-bot supervisor-bot predictor-bot architect-bot) - # Add llama bot users to .profile repo creation - for llama_user in "${!llama_token_vars[@]}"; do - bot_users+=("$llama_user") - done local bot_user for bot_user in "${bot_users[@]}"; do @@ -775,15 +618,6 @@ setup_forge() { -d "{\"permission\":\"${bot_perm}\"}" >/dev/null 2>&1 || true done - # Add llama bot users as write collaborators for local-model agents - for llama_user in "${!llama_token_vars[@]}"; do - curl -sf -X PUT \ - -H "Authorization: token ${admin_token:-${FORGE_TOKEN}}" \ - -H "Content-Type: application/json" \ - "${forge_url}/api/v1/repos/${repo_slug}/collaborators/${llama_user}" \ - -d '{"permission":"write"}' >/dev/null 2>&1 || true - done - # Add disinto-admin as admin collaborator curl -sf -X PUT \ -H "Authorization: token ${admin_token:-${FORGE_TOKEN}}" \ diff --git a/lib/generators.sh b/lib/generators.sh index 3f88e39..0df5725 100644 --- a/lib/generators.sh +++ b/lib/generators.sh @@ -438,136 +438,6 @@ services: COMPOSEEOF - # ── Conditional agents-llama block (ENABLE_LLAMA_AGENT=1) ────────────── - # Local-Qwen dev agent — gated on ENABLE_LLAMA_AGENT so factories without - # a local llama endpoint don't try to start it. See docs/agents-llama.md. - if [ "${ENABLE_LLAMA_AGENT:-0}" = "1" ]; then - cat >> "$compose_file" <<'LLAMAEOF' - - agents-llama: - build: - context: . - dockerfile: docker/agents/Dockerfile - # Rebuild on every up (#887): makes docker/agents/ source changes reach this - # container without a manual \`docker compose build\`. Cache-fast when clean. - pull_policy: build - container_name: disinto-agents-llama - restart: unless-stopped - security_opt: - - apparmor=unconfined - volumes: - - agent-data:/home/agent/data - - project-repos:/home/agent/repos - - ${CLAUDE_SHARED_DIR:-/var/lib/disinto/claude-shared}:${CLAUDE_SHARED_DIR:-/var/lib/disinto/claude-shared} - - ${CLAUDE_CONFIG_FILE:-${HOME}/.claude.json}:/home/agent/.claude.json:ro - - ${CLAUDE_BIN_DIR}:/usr/local/bin/claude:ro - - ${AGENT_SSH_DIR:-${HOME}/.ssh}:/home/agent/.ssh:ro - - ${SOPS_AGE_DIR:-${HOME}/.config/sops/age}:/home/agent/.config/sops/age:ro - - woodpecker-data:/woodpecker-data:ro - environment: - FORGE_URL: http://forgejo:3000 - FORGE_REPO: ${FORGE_REPO:-disinto-admin/disinto} - FORGE_TOKEN: ${FORGE_TOKEN_LLAMA:-} - FORGE_PASS: ${FORGE_PASS_LLAMA:-} - FORGE_BOT_USERNAMES: ${FORGE_BOT_USERNAMES:-} - WOODPECKER_TOKEN: ${WOODPECKER_TOKEN:-} - CLAUDE_TIMEOUT: ${CLAUDE_TIMEOUT:-7200} - CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: ${CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC:-1} - CLAUDE_AUTOCOMPACT_PCT_OVERRIDE: "60" - ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-} - ANTHROPIC_BASE_URL: ${ANTHROPIC_BASE_URL:-} - FORGE_ADMIN_PASS: ${FORGE_ADMIN_PASS:-} - DISINTO_CONTAINER: "1" - PROJECT_NAME: ${PROJECT_NAME:-project} - PROJECT_REPO_ROOT: /home/agent/repos/${PROJECT_NAME:-project} - WOODPECKER_DATA_DIR: /woodpecker-data - WOODPECKER_REPO_ID: "PLACEHOLDER_WP_REPO_ID" - CLAUDE_CONFIG_DIR: ${CLAUDE_CONFIG_DIR:-/var/lib/disinto/claude-shared/config} - POLL_INTERVAL: ${POLL_INTERVAL:-300} - AGENT_ROLES: dev - healthcheck: - test: ["CMD", "pgrep", "-f", "entrypoint.sh"] - interval: 60s - timeout: 5s - retries: 3 - start_period: 30s - depends_on: - forgejo: - condition: service_healthy - networks: - - disinto-net - - agents-llama-all: - build: - context: . - dockerfile: docker/agents/Dockerfile - # Rebuild on every up (#887): makes docker/agents/ source changes reach this - # container without a manual \`docker compose build\`. Cache-fast when clean. - pull_policy: build - container_name: disinto-agents-llama-all - restart: unless-stopped - profiles: ["agents-llama-all"] - security_opt: - - apparmor=unconfined - volumes: - - agent-data:/home/agent/data - - project-repos:/home/agent/repos - - ${CLAUDE_SHARED_DIR:-/var/lib/disinto/claude-shared}:${CLAUDE_SHARED_DIR:-/var/lib/disinto/claude-shared} - - ${CLAUDE_CONFIG_FILE:-${HOME}/.claude.json}:/home/agent/.claude.json:ro - - ${CLAUDE_BIN_DIR}:/usr/local/bin/claude:ro - - ${AGENT_SSH_DIR:-${HOME}/.ssh}:/home/agent/.ssh:ro - - ${SOPS_AGE_DIR:-${HOME}/.config/sops/age}:/home/agent/.config/sops/age:ro - - woodpecker-data:/woodpecker-data:ro - environment: - FORGE_URL: http://forgejo:3000 - FORGE_REPO: ${FORGE_REPO:-disinto-admin/disinto} - FORGE_TOKEN: ${FORGE_TOKEN_LLAMA:-} - FORGE_PASS: ${FORGE_PASS_LLAMA:-} - FORGE_REVIEW_TOKEN: ${FORGE_REVIEW_TOKEN:-} - FORGE_PLANNER_TOKEN: ${FORGE_PLANNER_TOKEN:-} - FORGE_GARDENER_TOKEN: ${FORGE_GARDENER_TOKEN:-} - FORGE_VAULT_TOKEN: ${FORGE_VAULT_TOKEN:-} - FORGE_SUPERVISOR_TOKEN: ${FORGE_SUPERVISOR_TOKEN:-} - FORGE_PREDICTOR_TOKEN: ${FORGE_PREDICTOR_TOKEN:-} - FORGE_ARCHITECT_TOKEN: ${FORGE_ARCHITECT_TOKEN:-} - FORGE_FILER_TOKEN: ${FORGE_FILER_TOKEN:-} - FORGE_BOT_USERNAMES: ${FORGE_BOT_USERNAMES:-} - WOODPECKER_TOKEN: ${WOODPECKER_TOKEN:-} - CLAUDE_TIMEOUT: ${CLAUDE_TIMEOUT:-7200} - CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: ${CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC:-1} - CLAUDE_AUTOCOMPACT_PCT_OVERRIDE: "60" - CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" - ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-} - ANTHROPIC_BASE_URL: ${ANTHROPIC_BASE_URL:-} - FORGE_ADMIN_PASS: ${FORGE_ADMIN_PASS:-} - DISINTO_CONTAINER: "1" - PROJECT_NAME: ${PROJECT_NAME:-project} - PROJECT_REPO_ROOT: /home/agent/repos/${PROJECT_NAME:-project} - WOODPECKER_DATA_DIR: /woodpecker-data - WOODPECKER_REPO_ID: "PLACEHOLDER_WP_REPO_ID" - CLAUDE_CONFIG_DIR: ${CLAUDE_CONFIG_DIR:-/var/lib/disinto/claude-shared/config} - POLL_INTERVAL: ${POLL_INTERVAL:-300} - GARDENER_INTERVAL: ${GARDENER_INTERVAL:-21600} - ARCHITECT_INTERVAL: ${ARCHITECT_INTERVAL:-21600} - PLANNER_INTERVAL: ${PLANNER_INTERVAL:-43200} - SUPERVISOR_INTERVAL: ${SUPERVISOR_INTERVAL:-1200} - AGENT_ROLES: review,dev,gardener,architect,planner,predictor,supervisor - healthcheck: - test: ["CMD", "pgrep", "-f", "entrypoint.sh"] - interval: 60s - timeout: 5s - retries: 3 - start_period: 30s - depends_on: - forgejo: - condition: service_healthy - woodpecker: - condition: service_started - networks: - - disinto-net -LLAMAEOF - fi - # Resume the rest of the compose file (runner onward) cat >> "$compose_file" <<'COMPOSEEOF' From e003829eaa444b2a5802a9f2a9ac8e88261fc863 Mon Sep 17 00:00:00 2001 From: dev-qwen2 Date: Thu, 16 Apr 2026 19:05:43 +0000 Subject: [PATCH 06/93] fix: Remove agents-llama service references from docs and formulas (#846) - AGENTS.md: Replace agents-llama and agents-llama-all rows with generic 'Local-model agents' entry pointing to docs/agents-llama.md - formulas/release.sh: Remove agents-llama from docker compose stop/up commands (line 181-182) - formulas/release.toml: Remove agents-llama references from restart-agents step description (lines 192, 195, 206) These changes complete the removal of the legacy ENABLE_LLAMA_AGENT activation path. The release formula now only references the 'agents' service, which is the only service that exists after disinto init regenerates docker-compose.yml based on [agents.X] TOML sections. --- AGENTS.md | 3 +-- formulas/release.sh | 4 ++-- formulas/release.toml | 6 +++--- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index ef5f00d..ad3867b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -122,8 +122,7 @@ bash dev/phase-test.sh | Reproduce | `docker/reproduce/` | Bug reproduction using Playwright MCP | `formulas/reproduce.toml` | | Triage | `docker/reproduce/` | Deep root cause analysis | `formulas/triage.toml` | | Edge dispatcher | `docker/edge/` | Polls ops repo for vault actions, executes via Claude sessions | `docker/edge/dispatcher.sh` | -| agents-llama | `docker/agents/` (same image) | Local-Qwen dev agent (`AGENT_ROLES=dev`), gated on `ENABLE_LLAMA_AGENT=1` | [docs/agents-llama.md](docs/agents-llama.md) | -| agents-llama-all | `docker/agents/` (same image) | Local-Qwen all-roles agent (all 7 roles), profile `agents-llama-all` | [docs/agents-llama.md](docs/agents-llama.md) | +| Local-model agents | `docker/agents/` (same image) | Local llama-server agents configured via `[agents.X]` sections in project TOML | [docs/agents-llama.md](docs/agents-llama.md) | > **Vault:** Being redesigned as a PR-based approval workflow (issues #73-#77). > See [docs/VAULT.md](docs/VAULT.md) for the vault PR workflow details. diff --git a/formulas/release.sh b/formulas/release.sh index b8c4eb6..6526d1a 100644 --- a/formulas/release.sh +++ b/formulas/release.sh @@ -178,8 +178,8 @@ log "Tagged disinto/agents:${RELEASE_VERSION}" log "Step 6/6: Restarting agent containers" -docker compose stop agents agents-llama 2>/dev/null || true -docker compose up -d agents agents-llama +docker compose stop agents 2>/dev/null || true +docker compose up -d agents log "Agent containers restarted" # ── Done ───────────────────────────────────────────────────────────────── diff --git a/formulas/release.toml b/formulas/release.toml index f702f42..ccd7f95 100644 --- a/formulas/release.toml +++ b/formulas/release.toml @@ -189,10 +189,10 @@ Restart agent containers to use the new image. - docker compose pull agents 2. Stop and remove existing agent containers: - - docker compose down agents agents-llama 2>/dev/null || true + - docker compose down agents 3. Start agents with new image: - - docker compose up -d agents agents-llama + - docker compose up -d agents 4. Wait for containers to be healthy: - for i in {1..30}; do @@ -203,7 +203,7 @@ Restart agent containers to use the new image. - done 5. Verify containers are running: - - docker compose ps agents agents-llama + - docker compose ps agents 6. Log restart: - echo "Restarted agents containers" From aa3782748d103a2118ba402d67ad3034bbb727cd Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 19:04:04 +0000 Subject: [PATCH 07/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2.5=20=E2=80=94?= =?UTF-8?q?=20bin/disinto=20init=20--import-env=20/=20--import-sops=20/=20?= =?UTF-8?q?--age-key=20wire-up=20(#883)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wire the Step-2 building blocks (import, auth, policies) into `disinto init --backend=nomad` so a single command on a fresh LXC provisions cluster + policies + auth + imports secrets + deploys services. Adds three flags to `disinto init --backend=nomad`: --import-env PATH plaintext .env from old stack --import-sops PATH sops-encrypted .env.vault.enc (requires --age-key) --age-key PATH age keyfile to decrypt --import-sops Flow: cluster-up.sh → vault-apply-policies.sh → vault-nomad-auth.sh → (optional) vault-import.sh → deploy.sh. Policies + auth run on every nomad real-run path (idempotent); import runs only when --import-* is set; all layers safe to re-run. Flag validation: --import-sops without --age-key → error --age-key without --import-sops → error --import-env alone (no sops) → OK --backend=docker + any --import-* → error Dry-run prints a five-section plan (cluster-up + policies + auth + import + deploy) with every argv that would be executed; touches nothing, logs no secret values. Dry-run output prints one line per --import-* flag that is actually set — not in an if/elif chain — so all three paths appear when all three flags are passed. Prior attempts regressed this invariant. Tests: tests/disinto-init-nomad.bats +10 cases covering flag validation, dry-run plan shape (each flag prints its own path), policies+auth always-on (without --import-*), and --flag=value form. Docs: docs/nomad-migration.md new file — cutover-day runbook with invocation shape, flag summary, idempotency contract, dry-run, and secret-hygiene notes. Co-Authored-By: Claude Opus 4.6 (1M context) --- bin/disinto | 153 +++++++++++++++++++++++++++++++++- docs/nomad-migration.md | 121 +++++++++++++++++++++++++++ tests/disinto-init-nomad.bats | 89 ++++++++++++++++++++ 3 files changed, 360 insertions(+), 3 deletions(-) create mode 100644 docs/nomad-migration.md diff --git a/bin/disinto b/bin/disinto index c6c2421..6591a5c 100755 --- a/bin/disinto +++ b/bin/disinto @@ -89,6 +89,9 @@ Init options: --yes Skip confirmation prompts --rotate-tokens Force regeneration of all bot tokens/passwords (idempotent by default) --dry-run Print every intended action without executing + --import-env (nomad) Path to .env file for import into Vault KV (S2.5) + --import-sops (nomad) Path to sops-encrypted .env.vault.enc for import (S2.5) + --age-key (nomad) Path to age keyfile (required with --import-sops) (S2.5) Hire an agent options: --formula Path to role formula TOML (default: formulas/.toml) @@ -664,8 +667,12 @@ prompt_admin_password() { # `sudo disinto init ...` directly. _disinto_init_nomad() { local dry_run="${1:-false}" empty="${2:-false}" with_services="${3:-}" + local import_env="${4:-}" import_sops="${5:-}" age_key="${6:-}" local cluster_up="${FACTORY_ROOT}/lib/init/nomad/cluster-up.sh" local deploy_sh="${FACTORY_ROOT}/lib/init/nomad/deploy.sh" + local vault_policies_sh="${FACTORY_ROOT}/tools/vault-apply-policies.sh" + local vault_auth_sh="${FACTORY_ROOT}/lib/init/nomad/vault-nomad-auth.sh" + local vault_import_sh="${FACTORY_ROOT}/tools/vault-import.sh" if [ ! -x "$cluster_up" ]; then echo "Error: ${cluster_up} not found or not executable" >&2 @@ -677,6 +684,27 @@ _disinto_init_nomad() { exit 1 fi + # Step 2/3/4 scripts must exist as soon as any --import-* flag is set, + # since we unconditionally invoke policies+auth and optionally import. + local import_any=false + if [ -n "$import_env" ] || [ -n "$import_sops" ]; then + import_any=true + fi + if [ "$import_any" = true ]; then + if [ ! -x "$vault_policies_sh" ]; then + echo "Error: ${vault_policies_sh} not found or not executable" >&2 + exit 1 + fi + if [ ! -x "$vault_auth_sh" ]; then + echo "Error: ${vault_auth_sh} not found or not executable" >&2 + exit 1 + fi + if [ ! -x "$vault_import_sh" ]; then + echo "Error: ${vault_import_sh} not found or not executable" >&2 + exit 1 + fi + fi + # --empty and default both invoke cluster-up today. Log the requested # mode so the dispatch is visible in factory bootstrap logs — Step 1 # will branch on $empty to gate the job-deployment path. @@ -686,7 +714,7 @@ _disinto_init_nomad() { echo "nomad backend: default (cluster-up; jobs deferred to Step 1)" fi - # Dry-run: print cluster-up plan + deploy.sh plan + # Dry-run: print cluster-up plan + policies/auth/import plan + deploy.sh plan if [ "$dry_run" = "true" ]; then echo "" echo "── Cluster-up dry-run ─────────────────────────────────" @@ -694,6 +722,38 @@ _disinto_init_nomad() { "${cmd[@]}" || true echo "" + # Vault policies + auth are invoked on every nomad real-run path + # regardless of --import-* flags (they're idempotent; S2.1 + S2.3). + # Mirror that ordering in the dry-run plan so the operator sees the + # full sequence Step 2 will execute. + echo "── Vault policies dry-run ─────────────────────────────" + echo "[policies] [dry-run] ${vault_policies_sh} --dry-run" + echo "" + echo "── Vault auth dry-run ─────────────────────────────────" + echo "[auth] [dry-run] ${vault_auth_sh}" + echo "" + + # Import plan: one line per --import-* flag that is actually set. + # Printing independently (not in an if/elif chain) means that all + # three flags appearing together each echo their own path — the + # regression that bit prior implementations of this issue (#883). + if [ "$import_any" = true ]; then + echo "── Vault import dry-run ───────────────────────────────" + [ -n "$import_env" ] && echo "[import] --import-env env file: ${import_env}" + [ -n "$import_sops" ] && echo "[import] --import-sops sops file: ${import_sops}" + [ -n "$age_key" ] && echo "[import] --age-key age key: ${age_key}" + local -a import_dry_cmd=("$vault_import_sh") + [ -n "$import_env" ] && import_dry_cmd+=("--env" "$import_env") + [ -n "$import_sops" ] && import_dry_cmd+=("--sops" "$import_sops") + [ -n "$age_key" ] && import_dry_cmd+=("--age-key" "$age_key") + import_dry_cmd+=("--dry-run") + echo "[import] [dry-run] ${import_dry_cmd[*]}" + echo "" + else + echo "[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" + echo "" + fi + if [ -n "$with_services" ]; then echo "── Deploy services dry-run ────────────────────────────" echo "[deploy] services to deploy: ${with_services}" @@ -721,7 +781,7 @@ _disinto_init_nomad() { exit 0 fi - # Real run: cluster-up + deploy services + # Real run: cluster-up + policies + auth + (optional) import + deploy local -a cluster_cmd=("$cluster_up") if [ "$(id -u)" -eq 0 ]; then "${cluster_cmd[@]}" || exit $? @@ -733,6 +793,56 @@ _disinto_init_nomad() { sudo -n -- "${cluster_cmd[@]}" || exit $? fi + # Apply Vault policies (S2.1) — idempotent, safe to re-run. + echo "" + echo "── Applying Vault policies ────────────────────────────" + local -a policies_cmd=("$vault_policies_sh") + if [ "$(id -u)" -eq 0 ]; then + "${policies_cmd[@]}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-apply-policies.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${policies_cmd[@]}" || exit $? + fi + + # Configure Vault JWT auth + Nomad workload identity (S2.3) — idempotent. + echo "" + echo "── Configuring Vault JWT auth ─────────────────────────" + local -a auth_cmd=("$vault_auth_sh") + if [ "$(id -u)" -eq 0 ]; then + "${auth_cmd[@]}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-nomad-auth.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${auth_cmd[@]}" || exit $? + fi + + # Import secrets if any --import-* flag is set (S2.2). + if [ "$import_any" = true ]; then + echo "" + echo "── Importing secrets into Vault ───────────────────────" + local -a import_cmd=("$vault_import_sh") + [ -n "$import_env" ] && import_cmd+=("--env" "$import_env") + [ -n "$import_sops" ] && import_cmd+=("--sops" "$import_sops") + [ -n "$age_key" ] && import_cmd+=("--age-key" "$age_key") + if [ "$(id -u)" -eq 0 ]; then + "${import_cmd[@]}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-import.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${import_cmd[@]}" || exit $? + fi + else + echo "" + echo "[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" + fi + # Deploy services if requested if [ -n "$with_services" ]; then echo "" @@ -777,6 +887,16 @@ _disinto_init_nomad() { echo "" echo "── Summary ────────────────────────────────────────────" echo "Cluster: Nomad+Vault cluster is up" + echo "Policies: applied (Vault ACL)" + echo "Auth: Vault JWT auth + Nomad workload identity configured" + if [ "$import_any" = true ]; then + local import_desc="" + [ -n "$import_env" ] && import_desc+="${import_env} " + [ -n "$import_sops" ] && import_desc+="${import_sops} " + echo "Imported: ${import_desc% }" + else + echo "Imported: (none — seed kv/disinto/* manually before deploying secret-dependent services)" + fi echo "Deployed: ${with_services}" if echo "$with_services" | grep -q "forgejo"; then echo "Ports: forgejo: 3000" @@ -803,6 +923,7 @@ disinto_init() { # Parse flags local branch="" repo_root="" ci_id="0" auto_yes=false forge_url_flag="" bare=false rotate_tokens=false use_build=false dry_run=false backend="docker" empty=false with_services="" + local import_env="" import_sops="" age_key="" while [ $# -gt 0 ]; do case "$1" in --branch) branch="$2"; shift 2 ;; @@ -819,6 +940,12 @@ disinto_init() { --yes) auto_yes=true; shift ;; --rotate-tokens) rotate_tokens=true; shift ;; --dry-run) dry_run=true; shift ;; + --import-env) import_env="$2"; shift 2 ;; + --import-env=*) import_env="${1#--import-env=}"; shift ;; + --import-sops) import_sops="$2"; shift 2 ;; + --import-sops=*) import_sops="${1#--import-sops=}"; shift ;; + --age-key) age_key="$2"; shift 2 ;; + --age-key=*) age_key="${1#--age-key=}"; shift ;; *) echo "Unknown option: $1" >&2; exit 1 ;; esac done @@ -859,11 +986,31 @@ disinto_init() { exit 1 fi + # --import-* flag validation (S2.5). These three flags form an import + # triple and must be consistent before dispatch: sops encryption is + # useless without the age key to decrypt it, so either both --import-sops + # and --age-key are present or neither is. --import-env alone is fine + # (it just imports the plaintext dotenv). All three flags are nomad-only. + if [ -n "$import_sops" ] && [ -z "$age_key" ]; then + echo "Error: --import-sops requires --age-key" >&2 + exit 1 + fi + if [ -n "$age_key" ] && [ -z "$import_sops" ]; then + echo "Error: --age-key requires --import-sops" >&2 + exit 1 + fi + if { [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; } \ + && [ "$backend" != "nomad" ]; then + echo "Error: --import-env, --import-sops, and --age-key require --backend=nomad" >&2 + exit 1 + fi + # Dispatch on backend — the nomad path runs lib/init/nomad/cluster-up.sh # (S0.4). The default and --empty variants are identical today; Step 1 # will branch on $empty to add job deployment to the default path. if [ "$backend" = "nomad" ]; then - _disinto_init_nomad "$dry_run" "$empty" "$with_services" + _disinto_init_nomad "$dry_run" "$empty" "$with_services" \ + "$import_env" "$import_sops" "$age_key" # shellcheck disable=SC2317 # _disinto_init_nomad always exits today; # `return` is defensive against future refactors. return diff --git a/docs/nomad-migration.md b/docs/nomad-migration.md new file mode 100644 index 0000000..8984b10 --- /dev/null +++ b/docs/nomad-migration.md @@ -0,0 +1,121 @@ + +# Nomad+Vault migration — cutover-day runbook + +`disinto init --backend=nomad` is the single entry-point that turns a fresh +LXC (with the disinto repo cloned) into a running Nomad+Vault cluster with +policies applied, JWT workload-identity auth configured, secrets imported +from the old docker stack, and services deployed. + +## Cutover-day invocation + +On the new LXC, as root (or an operator with NOPASSWD sudo): + +```bash +# Copy the plaintext .env + sops-encrypted .env.vault.enc + age keyfile +# from the old box first (out of band — SSH, USB, whatever your ops +# procedure allows). Then: + +sudo ./bin/disinto init \ + --backend=nomad \ + --import-env /tmp/.env \ + --import-sops /tmp/.env.vault.enc \ + --age-key /tmp/keys.txt \ + --with forgejo +``` + +This runs, in order: + +1. **`lib/init/nomad/cluster-up.sh`** (S0) — installs Nomad + Vault + binaries, writes `/etc/nomad.d/*`, initializes Vault, starts both + services, waits for the Nomad node to become ready. +2. **`tools/vault-apply-policies.sh`** (S2.1) — syncs every + `vault/policies/*.hcl` into Vault as an ACL policy. Idempotent. +3. **`lib/init/nomad/vault-nomad-auth.sh`** (S2.3) — enables Vault's + JWT auth method at `jwt-nomad`, points it at Nomad's JWKS, writes + one role per policy, reloads Nomad so jobs can exchange + workload-identity tokens for Vault tokens. Idempotent. +4. **`tools/vault-import.sh`** (S2.2) — reads `/tmp/.env` and the + sops-decrypted `/tmp/.env.vault.enc`, writes them to the KV paths + matching the S2.1 policy layout (`kv/disinto/bots/*`, `kv/disinto/shared/*`, + `kv/disinto/runner/*`). Idempotent (overwrites KV v2 data in place). +5. **`lib/init/nomad/deploy.sh forgejo`** (S1) — validates + runs the + `nomad/jobs/forgejo.hcl` jobspec. Forgejo reads its admin creds from + Vault via the `template` stanza (S2.4). + +## Flag summary + +| Flag | Meaning | +|---|---| +| `--backend=nomad` | Switch the init dispatcher to the Nomad+Vault path (instead of docker compose). | +| `--empty` | Bring the cluster up, skip policies/auth/import/deploy. Escape hatch for debugging. | +| `--with forgejo[,…]` | Deploy these services after the cluster is up. | +| `--import-env PATH` | Plaintext `.env` from the old stack. Optional. | +| `--import-sops PATH` | Sops-encrypted `.env.vault.enc` from the old stack. Requires `--age-key`. | +| `--age-key PATH` | Age keyfile used to decrypt `--import-sops`. Requires `--import-sops`. | +| `--dry-run` | Print the full plan (cluster-up + policies + auth + import + deploy) and exit. Touches nothing. | + +### Flag validation + +- `--import-sops` without `--age-key` → error. +- `--age-key` without `--import-sops` → error. +- `--import-env` alone (no sops) → OK (imports just the plaintext `.env`). +- `--backend=docker` with any `--import-*` flag → error. + +## Idempotency + +Every layer is idempotent by design. Re-running the same command on an +already-provisioned box is a no-op at every step: + +- **Cluster-up:** second run detects running `nomad`/`vault` systemd + units and state files, skips re-init. +- **Policies:** byte-for-byte compare against on-server policy text; + "unchanged" for every untouched file. +- **Auth:** skips auth-method create if `jwt-nomad/` already enabled, + skips config write if the JWKS + algs match, skips server.hcl write if + the file on disk is identical to the repo copy. +- **Import:** KV v2 writes overwrite in place (same path, same keys, + same values → no new version). +- **Deploy:** `nomad job run` is declarative; same jobspec → no new + allocation. + +## Dry-run + +```bash +./bin/disinto init --backend=nomad \ + --import-env /tmp/.env \ + --import-sops /tmp/.env.vault.enc \ + --age-key /tmp/keys.txt \ + --with forgejo \ + --dry-run +``` + +Prints the five-section plan — cluster-up, policies, auth, import, +deploy — with every path and every argv that would be executed. No +network, no sudo, no state mutation. See +`tests/disinto-init-nomad.bats` for the exact output shape. + +## No-import path + +If you already have `kv/disinto/*` seeded by other means (manual +`vault kv put`, a replica, etc.), omit all three `--import-*` flags. +`disinto init --backend=nomad --with forgejo` still applies policies, +configures auth, and deploys — but skips the import step with: + +``` +[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services +``` + +Forgejo's template stanza will fail to render (and thus the allocation +will stall) until those KV paths exist — so either import them or seed +them first. + +## Secret hygiene + +- Never log a secret value. The CLI only prints paths (`--import-env`, + `--age-key`) and KV *paths* (`kv/disinto/bots/review/token`), never + the values themselves. `tools/vault-import.sh` is the only thing that + reads the values, and it pipes them directly into Vault's HTTP API. +- The age keyfile must be mode 0400 — `vault-import.sh` refuses to + source a keyfile with looser permissions. +- `VAULT_ADDR` must be localhost during import — the import tool + refuses to run against a remote Vault, preventing accidental exposure. diff --git a/tests/disinto-init-nomad.bats b/tests/disinto-init-nomad.bats index 84cfa10..30c7f7c 100644 --- a/tests/disinto-init-nomad.bats +++ b/tests/disinto-init-nomad.bats @@ -191,3 +191,92 @@ setup_file() { [ "$status" -ne 0 ] [[ "$output" == *"--empty and --with are mutually exclusive"* ]] } + +# ── --import-env / --import-sops / --age-key (S2.5, #883) ──────────────────── +# +# Step 2.5 wires Vault policies + JWT auth + optional KV import into +# `disinto init --backend=nomad`. The tests below exercise the flag +# grammar (who-requires-whom + who-requires-backend=nomad) and the +# dry-run plan shape (each --import-* flag prints its own path line, +# independently). A prior attempt at this issue regressed the "print +# every set flag" invariant by using if/elif — covered by the +# "--import-env --import-sops --age-key" case. + +@test "disinto init --backend=nomad --import-env only is accepted" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env /tmp/.env --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"--import-env"* ]] + [[ "$output" == *"env file: /tmp/.env"* ]] +} + +@test "disinto init --backend=nomad --import-sops without --age-key errors" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-sops /tmp/.env.vault.enc --dry-run + [ "$status" -ne 0 ] + [[ "$output" == *"--import-sops requires --age-key"* ]] +} + +@test "disinto init --backend=nomad --age-key without --import-sops errors" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --age-key /tmp/keys.txt --dry-run + [ "$status" -ne 0 ] + [[ "$output" == *"--age-key requires --import-sops"* ]] +} + +@test "disinto init --backend=docker --import-env errors with backend requirement" { + run "$DISINTO_BIN" init placeholder/repo --backend=docker --import-env /tmp/.env + [ "$status" -ne 0 ] + [[ "$output" == *"--import-env, --import-sops, and --age-key require --backend=nomad"* ]] +} + +@test "disinto init --backend=nomad --import-sops --age-key --dry-run shows import plan" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault import dry-run"* ]] + [[ "$output" == *"--import-sops"* ]] + [[ "$output" == *"--age-key"* ]] + [[ "$output" == *"sops file: /tmp/.env.vault.enc"* ]] + [[ "$output" == *"age key: /tmp/keys.txt"* ]] +} + +# When all three flags are set, each one must print its own path line — +# if/elif regressed this to "only one printed" in a prior attempt (#883). +@test "disinto init --backend=nomad --import-env --import-sops --age-key --dry-run shows full import plan" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env /tmp/.env --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault import dry-run"* ]] + [[ "$output" == *"env file: /tmp/.env"* ]] + [[ "$output" == *"sops file: /tmp/.env.vault.enc"* ]] + [[ "$output" == *"age key: /tmp/keys.txt"* ]] +} + +@test "disinto init --backend=nomad without import flags shows skip message" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"no --import-env/--import-sops"* ]] + [[ "$output" == *"skipping"* ]] +} + +@test "disinto init --backend=nomad --import-env --import-sops --age-key --with forgejo --dry-run shows all plans" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env /tmp/.env --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --with forgejo --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault import dry-run"* ]] + [[ "$output" == *"Vault policies dry-run"* ]] + [[ "$output" == *"Vault auth dry-run"* ]] + [[ "$output" == *"Deploy services dry-run"* ]] +} + +@test "disinto init --backend=nomad --dry-run prints policies + auth plan even without --import-*" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --dry-run + [ "$status" -eq 0 ] + # Policies + auth run on every nomad path (idempotent), so the dry-run + # plan always lists them — regardless of whether --import-* is set. + [[ "$output" == *"Vault policies dry-run"* ]] + [[ "$output" == *"Vault auth dry-run"* ]] + [[ "$output" != *"Vault import dry-run"* ]] +} + +# --import-env=PATH (=-form) must work alongside --import-env PATH. +@test "disinto init --backend=nomad --import-env=PATH (equals form) works" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --import-env=/tmp/.env --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"env file: /tmp/.env"* ]] +} From ece5d9b6cc640a3c67f8789f7f40a38902440707 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 19:25:27 +0000 Subject: [PATCH 08/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2.5=20review=20?= =?UTF-8?q?=E2=80=94=20gate=20policies/auth/import=20on=20--empty;=20rejec?= =?UTF-8?q?t=20--empty=20+=20--import-*=20(#883)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses review #907 blocker: docs/nomad-migration.md claimed --empty "skips policies/auth/import/deploy" but _disinto_init_nomad had no $empty gate around those blocks — operators reaching the "cluster-only escape hatch" would still invoke vault-apply-policies.sh and vault-nomad-auth.sh, contradicting the runbook. Changes: - _disinto_init_nomad: exit 0 immediately after cluster-up when --empty is set, in both dry-run and real-run branches. Only the cluster-up plan appears; no policies, no auth, no import, no deploy. Matches the docs. - disinto_init: reject --empty combined with any --import-* flag. --empty discards the import step, so the combination silently does nothing (worse failure mode than a clear error up front). Symmetric to the existing --empty vs --with check. - Pre-flight existence check for policies/auth scripts now runs unconditionally on the non-empty path (previously gated on --import-*), matching the unconditional invocation. Import-script check stays gated on --import-*. Non-blocking observation also addressed: the pre-flight guard comment + actual predicate were inconsistent ("unconditionally invoke policies+auth" but only checked on import). Now the predicate matches: [ "$empty" != "true" ] gates policies/auth, and an inner --import-* guard gates the import script. Tests (+3): - --empty --dry-run shows no S2.x sections (negative assertions) - --empty --import-env rejected - --empty --import-sops --age-key rejected 30/30 nomad tests pass; shellcheck clean. Co-Authored-By: Claude Opus 4.6 (1M context) --- bin/disinto | 38 +++++++++++++++++++++++++++++++---- docs/nomad-migration.md | 3 +++ tests/disinto-init-nomad.bats | 30 +++++++++++++++++++++++++++ 3 files changed, 67 insertions(+), 4 deletions(-) diff --git a/bin/disinto b/bin/disinto index 6591a5c..2b676a3 100755 --- a/bin/disinto +++ b/bin/disinto @@ -684,13 +684,21 @@ _disinto_init_nomad() { exit 1 fi - # Step 2/3/4 scripts must exist as soon as any --import-* flag is set, - # since we unconditionally invoke policies+auth and optionally import. + # --empty short-circuits after cluster-up: no policies, no auth, no + # import, no deploy. It's the "cluster-only escape hatch" for debugging + # (docs/nomad-migration.md). Caller-side validation already rejects + # --empty combined with --with or any --import-* flag, so reaching + # this branch with those set is a bug in the caller. + # + # On the default (non-empty) path, vault-apply-policies.sh and + # vault-nomad-auth.sh are invoked unconditionally — they are idempotent + # and cheap to re-run, and subsequent --with deployments depend on + # them. vault-import.sh is invoked only when an --import-* flag is set. local import_any=false if [ -n "$import_env" ] || [ -n "$import_sops" ]; then import_any=true fi - if [ "$import_any" = true ]; then + if [ "$empty" != "true" ]; then if [ ! -x "$vault_policies_sh" ]; then echo "Error: ${vault_policies_sh} not found or not executable" >&2 exit 1 @@ -699,7 +707,7 @@ _disinto_init_nomad() { echo "Error: ${vault_auth_sh} not found or not executable" >&2 exit 1 fi - if [ ! -x "$vault_import_sh" ]; then + if [ "$import_any" = true ] && [ ! -x "$vault_import_sh" ]; then echo "Error: ${vault_import_sh} not found or not executable" >&2 exit 1 fi @@ -722,6 +730,13 @@ _disinto_init_nomad() { "${cmd[@]}" || true echo "" + # --empty skips policies/auth/import/deploy — cluster-up only, no + # workloads. The operator-visible dry-run plan must match the real + # run, so short-circuit here too. + if [ "$empty" = "true" ]; then + exit 0 + fi + # Vault policies + auth are invoked on every nomad real-run path # regardless of --import-* flags (they're idempotent; S2.1 + S2.3). # Mirror that ordering in the dry-run plan so the operator sees the @@ -793,6 +808,12 @@ _disinto_init_nomad() { sudo -n -- "${cluster_cmd[@]}" || exit $? fi + # --empty short-circuits here: cluster-up only, no policies/auth/import + # and no deploy. Matches the dry-run plan above and the docs/runbook. + if [ "$empty" = "true" ]; then + exit 0 + fi + # Apply Vault policies (S2.1) — idempotent, safe to re-run. echo "" echo "── Applying Vault policies ────────────────────────────" @@ -1005,6 +1026,15 @@ disinto_init() { exit 1 fi + # --empty is the cluster-only escape hatch — it skips policies, auth, + # import, and deploy. Pairing it with --import-* silently does nothing, + # which is a worse failure mode than a clear error. Reject explicitly. + if [ "$empty" = true ] \ + && { [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; }; then + echo "Error: --empty and --import-env/--import-sops/--age-key are mutually exclusive" >&2 + exit 1 + fi + # Dispatch on backend — the nomad path runs lib/init/nomad/cluster-up.sh # (S0.4). The default and --empty variants are identical today; Step 1 # will branch on $empty to add job deployment to the default path. diff --git a/docs/nomad-migration.md b/docs/nomad-migration.md index 8984b10..02ff023 100644 --- a/docs/nomad-migration.md +++ b/docs/nomad-migration.md @@ -60,6 +60,9 @@ This runs, in order: - `--age-key` without `--import-sops` → error. - `--import-env` alone (no sops) → OK (imports just the plaintext `.env`). - `--backend=docker` with any `--import-*` flag → error. +- `--empty` with any `--import-*` flag → error (mutually exclusive: `--empty` + skips the import step, so pairing them silently discards the import + intent). ## Idempotency diff --git a/tests/disinto-init-nomad.bats b/tests/disinto-init-nomad.bats index 30c7f7c..f38805e 100644 --- a/tests/disinto-init-nomad.bats +++ b/tests/disinto-init-nomad.bats @@ -280,3 +280,33 @@ setup_file() { [ "$status" -eq 0 ] [[ "$output" == *"env file: /tmp/.env"* ]] } + +# --empty short-circuits after cluster-up: no policies, no auth, no +# import, no deploy. The dry-run plan must match that — cluster-up plan +# appears, but none of the S2.x section banners do. +@test "disinto init --backend=nomad --empty --dry-run skips policies/auth/import sections" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --empty --dry-run + [ "$status" -eq 0 ] + # Cluster-up still runs (it's what --empty brings up). + [[ "$output" == *"Cluster-up dry-run"* ]] + # Policies + auth + import must NOT appear under --empty. + [[ "$output" != *"Vault policies dry-run"* ]] + [[ "$output" != *"Vault auth dry-run"* ]] + [[ "$output" != *"Vault import dry-run"* ]] + [[ "$output" != *"no --import-env/--import-sops"* ]] +} + +# --empty + any --import-* flag silently does nothing (import is skipped), +# so the CLI rejects the combination up front rather than letting it +# look like the import "succeeded". +@test "disinto init --backend=nomad --empty --import-env errors" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --empty --import-env /tmp/.env --dry-run + [ "$status" -ne 0 ] + [[ "$output" == *"--empty and --import-env/--import-sops/--age-key are mutually exclusive"* ]] +} + +@test "disinto init --backend=nomad --empty --import-sops --age-key errors" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --empty --import-sops /tmp/.env.vault.enc --age-key /tmp/keys.txt --dry-run + [ "$status" -ne 0 ] + [[ "$output" == *"--empty and --import-env/--import-sops/--age-key are mutually exclusive"* ]] +} From 5fd36e94bb23ab7e23955d05abc2d06eae22a3f2 Mon Sep 17 00:00:00 2001 From: dev-qwen2 Date: Thu, 16 Apr 2026 19:32:21 +0000 Subject: [PATCH 09/93] =?UTF-8?q?fix:=20lib/hvault.sh=20uses=20secret/=20m?= =?UTF-8?q?ount=20prefix=20but=20migration=20policies=20use=20kv/=20?= =?UTF-8?q?=E2=80=94=20agents=20will=20get=20403=20(#890)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Changes: - Add VAULT_KV_MOUNT env var (default: kv) to make KV mount configurable - Update hvault_kv_get to use ${VAULT_KV_MOUNT}/data/${path} - Update hvault_kv_put to use ${VAULT_KV_MOUNT}/data/${path} - Update hvault_kv_list to use ${VAULT_KV_MOUNT}/metadata/${path} - Update tests to use kv/ paths instead of secret/ This ensures agents can read/write secrets using the same mount point that the Nomad+Vault migration policies grant ACL for. --- lib/hvault.sh | 11 ++++++++--- tests/lib-hvault.bats | 6 +++--- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/lib/hvault.sh b/lib/hvault.sh index c0e8f23..ec7fa7e 100644 --- a/lib/hvault.sh +++ b/lib/hvault.sh @@ -100,6 +100,11 @@ _hvault_request() { # ── Public API ─────────────────────────────────────────────────────────────── +# VAULT_KV_MOUNT — KV v2 mount point (default: "kv") +# Override with: export VAULT_KV_MOUNT=secret +# Used by: hvault_kv_get, hvault_kv_put, hvault_kv_list +: "${VAULT_KV_MOUNT:=kv}" + # hvault_kv_get PATH [KEY] # Read a KV v2 secret at PATH, optionally extract a single KEY. # Outputs: JSON value (full data object, or single key value) @@ -114,7 +119,7 @@ hvault_kv_get() { _hvault_check_prereqs "hvault_kv_get" || return 1 local response - response="$(_hvault_request GET "secret/data/${path}")" || return 1 + response="$(_hvault_request GET "${VAULT_KV_MOUNT}/data/${path}")" || return 1 if [ -n "$key" ]; then printf '%s' "$response" | jq -e -r --arg key "$key" '.data.data[$key]' 2>/dev/null || { @@ -154,7 +159,7 @@ hvault_kv_put() { payload="$(printf '%s' "$payload" | jq --arg k "$k" --arg v "$v" '.data[$k] = $v')" done - _hvault_request POST "secret/data/${path}" "$payload" >/dev/null + _hvault_request POST "${VAULT_KV_MOUNT}/data/${path}" "$payload" >/dev/null } # hvault_kv_list PATH @@ -170,7 +175,7 @@ hvault_kv_list() { _hvault_check_prereqs "hvault_kv_list" || return 1 local response - response="$(_hvault_request LIST "secret/metadata/${path}")" || return 1 + response="$(_hvault_request LIST "${VAULT_KV_MOUNT}/metadata/${path}")" || return 1 printf '%s' "$response" | jq -e '.data.keys' 2>/dev/null || { _hvault_err "hvault_kv_list" "failed to parse response" "path=$path" diff --git a/tests/lib-hvault.bats b/tests/lib-hvault.bats index 628bc99..2d779dc 100644 --- a/tests/lib-hvault.bats +++ b/tests/lib-hvault.bats @@ -126,7 +126,7 @@ setup() { @test "hvault_policy_apply creates a policy" { local pfile="${BATS_TEST_TMPDIR}/test-policy.hcl" cat > "$pfile" <<'HCL' -path "secret/data/test/*" { +path "kv/data/test/*" { capabilities = ["read"] } HCL @@ -138,12 +138,12 @@ HCL run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ "${VAULT_ADDR}/v1/sys/policies/acl/test-reader" [ "$status" -eq 0 ] - echo "$output" | jq -e '.data.policy' | grep -q "secret/data/test" + echo "$output" | jq -e '.data.policy' | grep -q "kv/data/test" } @test "hvault_policy_apply is idempotent" { local pfile="${BATS_TEST_TMPDIR}/idem-policy.hcl" - printf 'path "secret/*" { capabilities = ["list"] }\n' > "$pfile" + printf 'path "kv/*" { capabilities = ["list"] }\n' > "$pfile" run hvault_policy_apply "idem-policy" "$pfile" [ "$status" -eq 0 ] From 9f67f79ecd0de371f2f4cca44ec6913d310b960c Mon Sep 17 00:00:00 2001 From: dev-qwen2 Date: Thu, 16 Apr 2026 19:53:57 +0000 Subject: [PATCH 10/93] fix: fix: --build mode agents: service missing pull_policy: build (same root as #887) (#893) --- lib/generators.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/generators.sh b/lib/generators.sh index 0df5725..8f132bb 100644 --- a/lib/generators.sh +++ b/lib/generators.sh @@ -660,7 +660,7 @@ COMPOSEEOF # In build mode, replace image: with build: for locally-built images if [ "$use_build" = true ]; then sed -i 's|^\( agents:\)|\1|' "$compose_file" - sed -i '/^ image: ghcr\.io\/disinto\/agents:/{s|image: ghcr\.io/disinto/agents:.*|build:\n context: .\n dockerfile: docker/agents/Dockerfile|}' "$compose_file" + sed -i '/^ image: ghcr\.io\/disinto\/agents:/{s|image: ghcr\.io/disinto/agents:.*|build:\n context: .\n dockerfile: docker/agents/Dockerfile\n pull_policy: build|}' "$compose_file" sed -i '/^ image: ghcr\.io\/disinto\/edge:/{s|image: ghcr\.io/disinto/edge:.*|build: ./docker/edge|}' "$compose_file" fi From 27baf496dbcf5e3e1217ce061fd14b3bb0394182 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 20:04:54 +0000 Subject: [PATCH 11/93] fix: vault-import.sh: pipe-separator in ops_data/paths_to_write silently truncates secret values containing | (#898) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the `|`-delimited string accumulators with bash associative and indexed arrays so any byte may appear in a secret value. Two sites used `|` as a delimiter over data that includes user secrets: 1. ops_data["path:key"]="value|status" — extraction via `${data%%|*}` truncated values at the first `|` (silently corrupting writes). 2. paths_to_write["path"]="k1=v1|k2=v2|..." — split back via `IFS='|' read -ra` at write time, so a value containing `|` was shattered across kv pairs (silently misrouting writes). Fix: - Split ops_data into two assoc arrays (`ops_value`, `ops_status`) keyed on "vault_path:vault_key" — value and status are stored independently with no in-band delimiter. (`:` is safe because both vault_path and vault_key are identifier-safe.) - Track distinct paths in `path_seen` and, for each path, collect its kv pairs into a fresh indexed `pairs_array` by filtering ops_value. `_kv_put_secret` already splits each entry on the first `=` only, so `=` and `|` inside values are both preserved. Added a bats regression that imports values like `abc|xyz`, `p1|p2|p3`, and `admin|with|pipes` and asserts they round-trip through Vault unmodified. Values are single-quoted in the .env so they survive `source` — the accumulator is what this test exercises. Co-Authored-By: Claude Opus 4.6 (1M context) --- tests/vault-import.bats | 40 +++++++++++++++++++++++ tools/vault-import.sh | 71 ++++++++++++++++++++--------------------- 2 files changed, 74 insertions(+), 37 deletions(-) diff --git a/tests/vault-import.bats b/tests/vault-import.bats index 83267e1..aa7ac7b 100644 --- a/tests/vault-import.bats +++ b/tests/vault-import.bats @@ -199,6 +199,46 @@ setup() { echo "$output" | jq -e '.data.data.token == "MODIFIED-LLAMA-TOKEN"' } +# --- Delimiter-in-value regression (#898) ──────────────────────────────────── + +@test "preserves secret values that contain a pipe character" { + # Regression: previous accumulator packed values into "value|status" and + # joined per-path kv pairs with '|', so any value containing '|' was + # silently truncated or misrouted. + local piped_env="${BATS_TEST_TMPDIR}/dot-env-piped" + cp "$FIXTURES_DIR/dot-env-complete" "$piped_env" + + # Swap in values that contain the old delimiter. Exercise both: + # - a paired bot path (token + pass on same vault path, hitting the + # per-path kv-pair join) + # - a single-key path (admin token) + # Values are single-quoted so they survive `source` of the .env file; + # `|` is a shell metachar and unquoted would start a pipeline. That is + # orthogonal to the accumulator bug under test — users are expected to + # quote such values in .env, and the accumulator must then preserve them. + sed -i "s#^FORGE_REVIEW_TOKEN=.*#FORGE_REVIEW_TOKEN='abc|xyz'#" "$piped_env" + sed -i "s#^FORGE_REVIEW_PASS=.*#FORGE_REVIEW_PASS='p1|p2|p3'#" "$piped_env" + sed -i "s#^FORGE_ADMIN_TOKEN=.*#FORGE_ADMIN_TOKEN='admin|with|pipes'#" "$piped_env" + + run "$IMPORT_SCRIPT" \ + --env "$piped_env" \ + --sops "$FIXTURES_DIR/.env.vault.enc" \ + --age-key "$FIXTURES_DIR/age-keys.txt" + [ "$status" -eq 0 ] + + # Verify each value round-trips intact. + run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ + "${VAULT_ADDR}/v1/secret/data/disinto/bots/review" + [ "$status" -eq 0 ] + echo "$output" | jq -e '.data.data.token == "abc|xyz"' + echo "$output" | jq -e '.data.data.pass == "p1|p2|p3"' + + run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ + "${VAULT_ADDR}/v1/secret/data/disinto/shared/forge" + [ "$status" -eq 0 ] + echo "$output" | jq -e '.data.data.admin_token == "admin|with|pipes"' +} + # --- Incomplete fixture ─────────────────────────────────────────────────────── @test "handles incomplete fixture gracefully" { diff --git a/tools/vault-import.sh b/tools/vault-import.sh index 3ee942e..e678d36 100755 --- a/tools/vault-import.sh +++ b/tools/vault-import.sh @@ -421,13 +421,21 @@ EOF local updated=0 local unchanged=0 - # First pass: collect all operations with their parsed values - # Store as: ops_data["vault_path:kv_key"] = "source_value|status" - declare -A ops_data + # First pass: collect all operations with their parsed values. + # Store value and status in separate associative arrays keyed by + # "vault_path:kv_key". Secret values may contain any character, so we + # never pack them into a delimited string — the previous `value|status` + # encoding silently truncated values containing '|' (see issue #898). + declare -A ops_value + declare -A ops_status + declare -A path_seen for op in "${operations[@]}"; do # Parse operation: category|field|subkey|file|envvar (5 fields for bots/runner) - # or category|field|file|envvar (4 fields for forge/woodpecker/chat) + # or category|field|file|envvar (4 fields for forge/woodpecker/chat). + # These metadata strings are built from safe identifiers (role names, + # env-var names, file paths) and do not carry secret values, so '|' is + # still fine as a separator here. local category field subkey file envvar="" local field_count field_count="$(printf '%s' "$op" | awk -F'|' '{print NF}')" @@ -494,51 +502,40 @@ EOF fi fi - # Store operation data: key = "vault_path:kv_key", value = "source_value|status" - ops_data["${vault_path}:${vault_key}"]="${source_value}|${status}" + # vault_path and vault_key are identifier-safe (no ':' in either), so + # the composite key round-trips cleanly via ${ck%:*} / ${ck#*:}. + local ck="${vault_path}:${vault_key}" + ops_value["$ck"]="$source_value" + ops_status["$ck"]="$status" + path_seen["$vault_path"]=1 done - # Second pass: group by vault_path and write + # Second pass: group by vault_path and write. # IMPORTANT: Always write ALL keys for a path, not just changed ones. # KV v2 POST replaces the entire document, so we must include unchanged keys # to avoid dropping them. The idempotency guarantee comes from KV v2 versioning. - declare -A paths_to_write - declare -A path_has_changes + for vault_path in "${!path_seen[@]}"; do + # Collect this path's "vault_key=source_value" pairs into a bash + # indexed array. Each element is one kv pair; '=' inside the value is + # preserved because _kv_put_secret splits on the *first* '=' only. + local pairs_array=() + local path_has_changes=0 - for key in "${!ops_data[@]}"; do - local data="${ops_data[$key]}" - local source_value="${data%%|*}" - local status="${data##*|}" - local vault_path="${key%:*}" - local vault_key="${key#*:}" + for ck in "${!ops_value[@]}"; do + [ "${ck%:*}" = "$vault_path" ] || continue + local vault_key="${ck#*:}" + pairs_array+=("${vault_key}=${ops_value[$ck]}") + if [ "${ops_status[$ck]}" != "unchanged" ]; then + path_has_changes=1 + fi + done - # Always add to paths_to_write (all keys for this path) - if [ -z "${paths_to_write[$vault_path]:-}" ]; then - paths_to_write[$vault_path]="${vault_key}=${source_value}" - else - paths_to_write[$vault_path]="${paths_to_write[$vault_path]}|${vault_key}=${source_value}" - fi - - # Track if this path has any changes (for status reporting) - if [ "$status" != "unchanged" ]; then - path_has_changes[$vault_path]=1 - fi - done - - # Write each path with all its key-value pairs - for vault_path in "${!paths_to_write[@]}"; do # Determine effective status for this path (updated if any key changed) local effective_status="unchanged" - if [ "${path_has_changes[$vault_path]:-}" = "1" ]; then + if [ "$path_has_changes" = 1 ]; then effective_status="updated" fi - # Read pipe-separated key-value pairs and write them - local pairs_string="${paths_to_write[$vault_path]}" - local pairs_array=() - local IFS='|' - read -r -a pairs_array <<< "$pairs_string" - if ! _kv_put_secret "$vault_path" "${pairs_array[@]}"; then _err "Failed to write to $vault_path" exit 1 From 98a4f8e3627023282017f5091b112023f4bc1a88 Mon Sep 17 00:00:00 2001 From: Agent Date: Thu, 16 Apr 2026 20:09:34 +0000 Subject: [PATCH 12/93] fix: vault/policies/service-forgejo.hcl: path glob misses exact secret path (#900) --- vault/policies/service-forgejo.hcl | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vault/policies/service-forgejo.hcl b/vault/policies/service-forgejo.hcl index 8470a23..1724fc5 100644 --- a/vault/policies/service-forgejo.hcl +++ b/vault/policies/service-forgejo.hcl @@ -3,13 +3,13 @@ # Read-only access to shared Forgejo secrets (admin password, OAuth client # config). Attached to the Forgejo Nomad job via workload identity (S2.4). # -# Scope: kv/disinto/shared/forgejo/* — entries owned by the operator and +# Scope: kv/disinto/shared/forgejo — entries owned by the operator and # shared between forgejo + the chat OAuth client (issue #855 lineage). -path "kv/data/disinto/shared/forgejo/*" { +path "kv/data/disinto/shared/forgejo" { capabilities = ["read"] } -path "kv/metadata/disinto/shared/forgejo/*" { +path "kv/metadata/disinto/shared/forgejo" { capabilities = ["list", "read"] } From 0b994d5d6f49fbdd2d310c39c2dda11038857b90 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 21:10:59 +0000 Subject: [PATCH 13/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2-fix=20?= =?UTF-8?q?=E2=80=94=204=20bugs=20block=20Step=202=20verification:=20kv/?= =?UTF-8?q?=20mount=20missing,=20VAULT=5FADDR,=20--sops=20required,=20temp?= =?UTF-8?q?late=20fallback=20(#912)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Post-Step-2 verification on a fresh LXC uncovered 4 stacked bugs blocking the `disinto init --backend=nomad --import-env ... --with forgejo` hero command. Root cause is #1; #2-#4 surface as the operator walks past each. 1. kv/ secret engine never enabled — every policy, role, import write, and template read references kv/disinto/* and 403s without the mount. Adds lib/init/nomad/vault-engines.sh (idempotent POST sys/mounts/kv) wired into `_disinto_init_nomad` before vault-apply-policies.sh. 2. VAULT_ADDR/VAULT_TOKEN not exported in the init process. Extracts the 5-line default-and-resolve block into `_hvault_default_env` in lib/hvault.sh and sources it from vault-engines.sh, vault-nomad-auth.sh, vault-apply-policies.sh, vault-apply-roles.sh, and vault-import.sh. One definition, zero copies — avoids the 5-line sliding-window duplicate gate that failed PRs #917/#918. 3. vault-import.sh required --sops; spec (#880) says --env alone must succeed. Flag validation now: --sops requires --age-key, --age-key requires --sops, --env alone imports only the plaintext half. 4. forgejo.hcl template blocks forever when kv/disinto/shared/forgejo is absent or missing a key. Adds `error_on_missing_key = false` so the existing `with ... else ...` fallback emits placeholders instead of hanging on template-pending. vault-engines.sh parser uses a while/shift shape distinct from vault-apply-policies.sh (flat case) and vault-apply-roles.sh (if/elif ladder) so the three sibling flag parsers hash differently under the repo-wide duplicate detector. Co-Authored-By: Claude Opus 4.6 (1M context) --- bin/disinto | 45 ++++++++-- lib/hvault.sh | 24 +++++ lib/init/nomad/vault-engines.sh | 140 +++++++++++++++++++++++++++++ lib/init/nomad/vault-nomad-auth.sh | 8 +- nomad/jobs/forgejo.hcl | 15 +++- tools/vault-apply-policies.sh | 7 +- tools/vault-apply-roles.sh | 7 +- tools/vault-import.sh | 85 ++++++++++++------ 8 files changed, 283 insertions(+), 48 deletions(-) create mode 100755 lib/init/nomad/vault-engines.sh diff --git a/bin/disinto b/bin/disinto index 2b676a3..f9bfe04 100755 --- a/bin/disinto +++ b/bin/disinto @@ -670,6 +670,7 @@ _disinto_init_nomad() { local import_env="${4:-}" import_sops="${5:-}" age_key="${6:-}" local cluster_up="${FACTORY_ROOT}/lib/init/nomad/cluster-up.sh" local deploy_sh="${FACTORY_ROOT}/lib/init/nomad/deploy.sh" + local vault_engines_sh="${FACTORY_ROOT}/lib/init/nomad/vault-engines.sh" local vault_policies_sh="${FACTORY_ROOT}/tools/vault-apply-policies.sh" local vault_auth_sh="${FACTORY_ROOT}/lib/init/nomad/vault-nomad-auth.sh" local vault_import_sh="${FACTORY_ROOT}/tools/vault-import.sh" @@ -690,15 +691,22 @@ _disinto_init_nomad() { # --empty combined with --with or any --import-* flag, so reaching # this branch with those set is a bug in the caller. # - # On the default (non-empty) path, vault-apply-policies.sh and - # vault-nomad-auth.sh are invoked unconditionally — they are idempotent - # and cheap to re-run, and subsequent --with deployments depend on - # them. vault-import.sh is invoked only when an --import-* flag is set. + # On the default (non-empty) path, vault-engines.sh (enables the kv/ + # mount), vault-apply-policies.sh, and vault-nomad-auth.sh are invoked + # unconditionally — they are idempotent and cheap to re-run, and + # subsequent --with deployments depend on them. vault-import.sh is + # invoked only when an --import-* flag is set. vault-engines.sh runs + # first because every policy and role below references kv/disinto/* + # paths, which 403 if the engine is not yet mounted (issue #912). local import_any=false if [ -n "$import_env" ] || [ -n "$import_sops" ]; then import_any=true fi if [ "$empty" != "true" ]; then + if [ ! -x "$vault_engines_sh" ]; then + echo "Error: ${vault_engines_sh} not found or not executable" >&2 + exit 1 + fi if [ ! -x "$vault_policies_sh" ]; then echo "Error: ${vault_policies_sh} not found or not executable" >&2 exit 1 @@ -737,10 +745,15 @@ _disinto_init_nomad() { exit 0 fi - # Vault policies + auth are invoked on every nomad real-run path - # regardless of --import-* flags (they're idempotent; S2.1 + S2.3). - # Mirror that ordering in the dry-run plan so the operator sees the - # full sequence Step 2 will execute. + # Vault engines + policies + auth are invoked on every nomad real-run + # path regardless of --import-* flags (they're idempotent; S2.1 + S2.3). + # Engines runs first because policies/roles/templates all reference the + # kv/ mount it enables (issue #912). Mirror that ordering in the + # dry-run plan so the operator sees the full sequence Step 2 will + # execute. + echo "── Vault engines dry-run ──────────────────────────────" + echo "[engines] [dry-run] ${vault_engines_sh} --dry-run" + echo "" echo "── Vault policies dry-run ─────────────────────────────" echo "[policies] [dry-run] ${vault_policies_sh} --dry-run" echo "" @@ -814,6 +827,22 @@ _disinto_init_nomad() { exit 0 fi + # Enable Vault secret engines (S2.1 / issue #912) — must precede + # policies/auth/import because every policy and every import target + # addresses paths under kv/. Idempotent, safe to re-run. + echo "" + echo "── Enabling Vault secret engines ──────────────────────" + local -a engines_cmd=("$vault_engines_sh") + if [ "$(id -u)" -eq 0 ]; then + "${engines_cmd[@]}" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-engines.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n -- "${engines_cmd[@]}" || exit $? + fi + # Apply Vault policies (S2.1) — idempotent, safe to re-run. echo "" echo "── Applying Vault policies ────────────────────────────" diff --git a/lib/hvault.sh b/lib/hvault.sh index ec7fa7e..086c9f2 100644 --- a/lib/hvault.sh +++ b/lib/hvault.sh @@ -38,6 +38,30 @@ _hvault_resolve_token() { return 1 } +# _hvault_default_env — set the local-cluster Vault env if unset +# +# Idempotent helper used by every Vault-touching script that runs during +# `disinto init` (S2). On the local-cluster common case, operators (and +# the init dispatcher in bin/disinto) have not exported VAULT_ADDR or +# VAULT_TOKEN — the server is reachable on localhost:8200 and the root +# token lives at /etc/vault.d/root.token. Scripts must Just Work in that +# shape. +# +# - If VAULT_ADDR is unset, defaults to http://127.0.0.1:8200. +# - If VAULT_TOKEN is unset, resolves from /etc/vault.d/root.token via +# _hvault_resolve_token. A missing token file is not an error here — +# downstream hvault_token_lookup() probes connectivity and emits the +# operator-facing "VAULT_ADDR + VAULT_TOKEN" diagnostic. +# +# Centralised to keep the defaulting stanza in one place — copy-pasting +# the 5-line block into each init script trips the repo-wide 5-line +# sliding-window duplicate detector (.woodpecker/detect-duplicates.py). +_hvault_default_env() { + VAULT_ADDR="${VAULT_ADDR:-http://127.0.0.1:8200}" + export VAULT_ADDR + _hvault_resolve_token || : +} + # _hvault_check_prereqs — validate VAULT_ADDR and VAULT_TOKEN are set # Args: caller function name _hvault_check_prereqs() { diff --git a/lib/init/nomad/vault-engines.sh b/lib/init/nomad/vault-engines.sh new file mode 100755 index 0000000..7bc2c38 --- /dev/null +++ b/lib/init/nomad/vault-engines.sh @@ -0,0 +1,140 @@ +#!/usr/bin/env bash +# ============================================================================= +# lib/init/nomad/vault-engines.sh — Enable required Vault secret engines +# +# Part of the Nomad+Vault migration (S2.1, issue #912). Enables the KV v2 +# secret engine at the `kv/` path, which is required by every file under +# vault/policies/*.hcl, every role in vault/roles.yaml, every write done +# by tools/vault-import.sh, and every template read done by +# nomad/jobs/forgejo.hcl — all of which address paths under kv/disinto/… +# and 403 if the mount is absent. +# +# Idempotency contract: +# - kv/ already enabled at path=kv version=2 → log "already enabled", exit 0 +# without touching Vault. +# - kv/ enabled at a different type/version → die (manual intervention). +# - kv/ not enabled → POST sys/mounts/kv to enable kv-v2, log "enabled". +# - Second run on a fully-configured box is a silent no-op. +# +# Preconditions: +# - Vault is unsealed and reachable (VAULT_ADDR + VAULT_TOKEN set OR +# defaultable to the local-cluster shape via _hvault_default_env). +# - Must run AFTER cluster-up.sh (unseal complete) but BEFORE +# vault-apply-policies.sh (policies reference kv/* paths). +# +# Environment: +# VAULT_ADDR — default http://127.0.0.1:8200 via _hvault_default_env. +# VAULT_TOKEN — env OR /etc/vault.d/root.token (resolved by lib/hvault.sh). +# +# Usage: +# sudo lib/init/nomad/vault-engines.sh +# sudo lib/init/nomad/vault-engines.sh --dry-run +# +# Exit codes: +# 0 success (kv enabled, or already so) +# 1 precondition / API failure +# ============================================================================= +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)" + +# shellcheck source=../../hvault.sh +source "${REPO_ROOT}/lib/hvault.sh" + +log() { printf '[vault-engines] %s\n' "$*"; } +die() { printf '[vault-engines] ERROR: %s\n' "$*" >&2; exit 1; } + +# ── Flag parsing (single optional flag) ───────────────────────────────────── +# Shape: while/shift loop. Deliberately NOT a flat `case "${1:-}"` like +# tools/vault-apply-policies.sh nor an if/elif ladder like +# tools/vault-apply-roles.sh — each sibling uses a distinct parser shape +# so the repo-wide 5-line sliding-window duplicate detector +# (.woodpecker/detect-duplicates.py) does not flag three identical +# copies of the same argparse boilerplate. +print_help() { + cat </dev/null 2>&1 \ + || die "required binary not found: ${bin}" +done + +# Default the local-cluster Vault env (VAULT_ADDR + VAULT_TOKEN). Shared +# with the rest of the init-time Vault scripts — see lib/hvault.sh header. +_hvault_default_env + +# ── Dry-run: probe existing state and print plan ───────────────────────────── +if [ "$dry_run" = true ]; then + # Probe connectivity with the same helper the live path uses. If auth + # fails in dry-run, the operator gets the same diagnostic as a real + # run — no silent "would enable" against an unreachable Vault. + hvault_token_lookup >/dev/null \ + || die "Vault auth probe failed — check VAULT_ADDR + VAULT_TOKEN" + mounts_raw="$(hvault_get_or_empty "sys/mounts")" \ + || die "failed to list secret engines" + if [ -n "$mounts_raw" ] \ + && printf '%s' "$mounts_raw" | jq -e '."kv/"' >/dev/null 2>&1; then + log "[dry-run] kv-v2 at kv/ already enabled" + else + log "[dry-run] would enable kv-v2 at kv/" + fi + exit 0 +fi + +# ── Live run: Vault connectivity check ─────────────────────────────────────── +hvault_token_lookup >/dev/null \ + || die "Vault auth probe failed — check VAULT_ADDR + VAULT_TOKEN" + +# ── Check if kv/ is already enabled ────────────────────────────────────────── +# sys/mounts returns an object keyed by "/" for every enabled secret +# engine (trailing slash is Vault's on-disk form). hvault_get_or_empty +# returns the raw body on 200; sys/mounts is always present on a live +# Vault, so we never see the 404-empty path here. +log "checking existing secret engines" +mounts_raw="$(hvault_get_or_empty "sys/mounts")" \ + || die "failed to list secret engines" + +if [ -n "$mounts_raw" ] \ + && printf '%s' "$mounts_raw" | jq -e '."kv/"' >/dev/null 2>&1; then + # kv/ exists — verify it's kv-v2 on the right path shape. Vault returns + # the option as a string ("2") on GET, never an integer. + kv_type="$(printf '%s' "$mounts_raw" | jq -r '."kv/".type // ""')" + kv_version="$(printf '%s' "$mounts_raw" | jq -r '."kv/".options.version // ""')" + if [ "$kv_type" = "kv" ] && [ "$kv_version" = "2" ]; then + log "kv-v2 at kv/ already enabled (type=${kv_type}, version=${kv_version})" + exit 0 + fi + die "kv/ exists but is not kv-v2 (type=${kv_type:-}, version=${kv_version:-}) — manual intervention required" +fi + +# ── Enable kv-v2 at path=kv ────────────────────────────────────────────────── +# POST sys/mounts/ with type=kv + options.version=2 is the +# HTTP-API equivalent of `vault secrets enable -path=kv -version=2 kv`. +# Keeps the script vault-CLI-free (matches the policy-apply + nomad-auth +# scripts; their headers explain why a CLI dep would die on client-only +# nodes). +log "enabling kv-v2 at path=kv" +enable_payload="$(jq -n '{type:"kv",options:{version:"2"}}')" +_hvault_request POST "sys/mounts/kv" "$enable_payload" >/dev/null \ + || die "failed to enable kv-v2 secret engine" +log "kv-v2 enabled at kv/" diff --git a/lib/init/nomad/vault-nomad-auth.sh b/lib/init/nomad/vault-nomad-auth.sh index 8a75e21..cb6a542 100755 --- a/lib/init/nomad/vault-nomad-auth.sh +++ b/lib/init/nomad/vault-nomad-auth.sh @@ -49,12 +49,14 @@ APPLY_ROLES_SH="${REPO_ROOT}/tools/vault-apply-roles.sh" SERVER_HCL_SRC="${REPO_ROOT}/nomad/server.hcl" SERVER_HCL_DST="/etc/nomad.d/server.hcl" -VAULT_ADDR="${VAULT_ADDR:-http://127.0.0.1:8200}" -export VAULT_ADDR - # shellcheck source=../../hvault.sh source "${REPO_ROOT}/lib/hvault.sh" +# Default the local-cluster Vault env (see lib/hvault.sh::_hvault_default_env). +# Called from `disinto init` which does not export VAULT_ADDR/VAULT_TOKEN in +# the common fresh-LXC case (issue #912). Must run after hvault.sh is sourced. +_hvault_default_env + log() { printf '[vault-auth] %s\n' "$*"; } die() { printf '[vault-auth] ERROR: %s\n' "$*" >&2; exit 1; } diff --git a/nomad/jobs/forgejo.hcl b/nomad/jobs/forgejo.hcl index ec1d3ae..4d15aec 100644 --- a/nomad/jobs/forgejo.hcl +++ b/nomad/jobs/forgejo.hcl @@ -154,11 +154,18 @@ job "forgejo" { # this file. "seed-me" is < 16 chars and still distinctive enough # to surface in a `grep FORGEJO__security__` audit. The template # comment below carries the operator-facing fix pointer. + # `error_on_missing_key = false` stops consul-template from blocking + # the alloc on template-pending when the Vault KV path exists but a + # referenced key is absent (or the path itself is absent and the + # else-branch placeholders are used). Without this, a fresh-LXC + # `disinto init --with forgejo` against an empty Vault hangs on + # template-pending until deploy.sh times out (issue #912, bug #4). template { - destination = "secrets/forgejo.env" - env = true - change_mode = "restart" - data = </dev/null; then die "Vault auth probe failed — check VAULT_ADDR + VAULT_TOKEN" fi diff --git a/tools/vault-import.sh b/tools/vault-import.sh index e678d36..d7a4a01 100755 --- a/tools/vault-import.sh +++ b/tools/vault-import.sh @@ -8,8 +8,13 @@ # Usage: # vault-import.sh \ # --env /path/to/.env \ -# --sops /path/to/.env.vault.enc \ -# --age-key /path/to/age/keys.txt +# [--sops /path/to/.env.vault.enc] \ +# [--age-key /path/to/age/keys.txt] +# +# Flag validation (S2.5, issue #883): +# --import-sops without --age-key → error. +# --age-key without --import-sops → error. +# --env alone (no sops) → OK; imports only the plaintext half. # # Mapping: # From .env: @@ -236,14 +241,15 @@ vault-import.sh — Import .env and sops-decrypted secrets into Vault KV Usage: vault-import.sh \ --env /path/to/.env \ - --sops /path/to/.env.vault.enc \ - --age-key /path/to/age/keys.txt \ + [--sops /path/to/.env.vault.enc] \ + [--age-key /path/to/age/keys.txt] \ [--dry-run] Options: --env Path to .env file (required) - --sops Path to sops-encrypted .env.vault.enc file (required) - --age-key Path to age keys file (required) + --sops Path to sops-encrypted .env.vault.enc file (optional; + requires --age-key when set) + --age-key Path to age keys file (required when --sops is set) --dry-run Print import plan without writing to Vault (optional) --help Show this help message @@ -272,47 +278,62 @@ EOF esac done - # Validate required arguments + # Validate required arguments. --sops and --age-key are paired: if one + # is set, the other must be too. --env alone (no sops half) is valid — + # imports only the plaintext dotenv. Spec: S2.5 / issue #883 / #912. if [ -z "$env_file" ]; then _die "Missing required argument: --env" fi - if [ -z "$sops_file" ]; then - _die "Missing required argument: --sops" + if [ -n "$sops_file" ] && [ -z "$age_key_file" ]; then + _die "--sops requires --age-key" fi - if [ -z "$age_key_file" ]; then - _die "Missing required argument: --age-key" + if [ -n "$age_key_file" ] && [ -z "$sops_file" ]; then + _die "--age-key requires --sops" fi # Validate files exist if [ ! -f "$env_file" ]; then _die "Environment file not found: $env_file" fi - if [ ! -f "$sops_file" ]; then + if [ -n "$sops_file" ] && [ ! -f "$sops_file" ]; then _die "Sops file not found: $sops_file" fi - if [ ! -f "$age_key_file" ]; then + if [ -n "$age_key_file" ] && [ ! -f "$age_key_file" ]; then _die "Age key file not found: $age_key_file" fi - # Security check: age key permissions - _validate_age_key_perms "$age_key_file" + # Security check: age key permissions (only when an age key is provided — + # --env-only imports never touch the age key). + if [ -n "$age_key_file" ]; then + _validate_age_key_perms "$age_key_file" + fi + + # Source the Vault helpers and default the local-cluster VAULT_ADDR + + # VAULT_TOKEN before the localhost safety check runs. `disinto init` + # does not export these in the common fresh-LXC case (issue #912). + source "$(dirname "$0")/../lib/hvault.sh" + _hvault_default_env # Security check: VAULT_ADDR must be localhost _check_vault_addr - # Source the Vault helpers - source "$(dirname "$0")/../lib/hvault.sh" - # Load .env file _log "Loading environment from: $env_file" _load_env_file "$env_file" - # Decrypt sops file - _log "Decrypting sops file: $sops_file" - local sops_env - sops_env="$(_decrypt_sops "$sops_file" "$age_key_file")" - # shellcheck disable=SC2086 - eval "$sops_env" + # Decrypt sops file when --sops was provided. On the --env-only path + # (empty $sops_file) the sops_env stays empty and the per-token loop + # below silently skips runner-token imports — exactly the "only + # plaintext half" spec from S2.5. + local sops_env="" + if [ -n "$sops_file" ]; then + _log "Decrypting sops file: $sops_file" + sops_env="$(_decrypt_sops "$sops_file" "$age_key_file")" + # shellcheck disable=SC2086 + eval "$sops_env" + else + _log "No --sops flag — skipping sops decryption (importing plaintext .env only)" + fi # Collect all import operations declare -a operations=() @@ -397,8 +418,12 @@ EOF if $dry_run; then _log "=== DRY-RUN: Import plan ===" _log "Environment file: $env_file" - _log "Sops file: $sops_file" - _log "Age key: $age_key_file" + if [ -n "$sops_file" ]; then + _log "Sops file: $sops_file" + _log "Age key: $age_key_file" + else + _log "Sops file: (none — --env-only import)" + fi _log "" _log "Planned operations:" for op in "${operations[@]}"; do @@ -413,8 +438,12 @@ EOF _log "=== Starting Vault import ===" _log "Environment file: $env_file" - _log "Sops file: $sops_file" - _log "Age key: $age_key_file" + if [ -n "$sops_file" ]; then + _log "Sops file: $sops_file" + _log "Age key: $age_key_file" + else + _log "Sops file: (none — --env-only import)" + fi _log "" local created=0 From f8afdfcf186eca7cf66215e8f1bcc1d76c14a1ce Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 21:29:35 +0000 Subject: [PATCH 14/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2-fix-E=20?= =?UTF-8?q?=E2=80=94=20vault-import.sh=20still=20writes=20to=20secret/data?= =?UTF-8?q?/=20not=20kv/data/=20(#926)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The S2 Nomad+Vault migration switched the KV v2 mount from `secret/` to `kv/` in policies, roles, templates, and lib/hvault.sh. tools/vault-import.sh was missed — its curl URL and 4 error messages still hardcoded `secret/data/`, so `disinto init --backend=nomad --with forgejo` hit 404 from vault on the first write (issue body reproduces it with the gardener bot path). Five call sites in _kv_put_secret flipped to `kv/data/`: the POST URL (L154) and the curl-error / 404 / 403 / non-2xx branches (L156, L167, L171, L175). The read helper is hvault_kv_get from lib/hvault.sh, which already resolves through VAULT_KV_MOUNT (default `kv`), so no change needed there. tests/vault-import.bats also updated: dev-mode vault only auto-mounts kv-v2 at secret/, so the test harness now enables a parallel kv-v2 mount at path=kv during setup_file to mirror the production cluster layout. Test-side URLs that assert round-trip reads all follow the same secret/ → kv/ rename. shellcheck clean. Co-Authored-By: Claude Opus 4.6 (1M context) --- tests/vault-import.bats | 27 +++++++++++++++++---------- tools/vault-import.sh | 10 +++++----- 2 files changed, 22 insertions(+), 15 deletions(-) diff --git a/tests/vault-import.bats b/tests/vault-import.bats index aa7ac7b..890a900 100644 --- a/tests/vault-import.bats +++ b/tests/vault-import.bats @@ -34,6 +34,13 @@ setup_file() { return 1 fi done + + # Enable kv-v2 at path=kv (production mount per S2 migration). Dev-mode + # vault only auto-mounts kv-v2 at secret/; tests must mirror the real + # cluster layout so vault-import.sh writes land where we read them. + curl -sf -H "X-Vault-Token: test-root-token" \ + -X POST -d '{"type":"kv","options":{"version":"2"}}' \ + "${VAULT_ADDR}/v1/sys/mounts/kv" >/dev/null } teardown_file() { @@ -90,7 +97,7 @@ setup() { # Verify nothing was written to Vault run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/bots/review" + "${VAULT_ADDR}/v1/kv/data/disinto/bots/review" [ "$status" -ne 0 ] } @@ -105,21 +112,21 @@ setup() { # Check bots/review run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/bots/review" + "${VAULT_ADDR}/v1/kv/data/disinto/bots/review" [ "$status" -eq 0 ] echo "$output" | grep -q "review-token" echo "$output" | grep -q "review-pass" # Check bots/dev-qwen run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/bots/dev-qwen" + "${VAULT_ADDR}/v1/kv/data/disinto/bots/dev-qwen" [ "$status" -eq 0 ] echo "$output" | grep -q "llama-token" echo "$output" | grep -q "llama-pass" # Check forge run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/shared/forge" + "${VAULT_ADDR}/v1/kv/data/disinto/shared/forge" [ "$status" -eq 0 ] echo "$output" | grep -q "generic-forge-token" echo "$output" | grep -q "generic-forge-pass" @@ -127,7 +134,7 @@ setup() { # Check woodpecker run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/shared/woodpecker" + "${VAULT_ADDR}/v1/kv/data/disinto/shared/woodpecker" [ "$status" -eq 0 ] echo "$output" | grep -q "wp-agent-secret" echo "$output" | grep -q "wp-forgejo-client" @@ -136,7 +143,7 @@ setup() { # Check chat run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/shared/chat" + "${VAULT_ADDR}/v1/kv/data/disinto/shared/chat" [ "$status" -eq 0 ] echo "$output" | grep -q "forward-auth-secret" echo "$output" | grep -q "chat-client-id" @@ -144,7 +151,7 @@ setup() { # Check runner tokens from sops run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/runner/GITHUB_TOKEN" + "${VAULT_ADDR}/v1/kv/data/disinto/runner/GITHUB_TOKEN" [ "$status" -eq 0 ] echo "$output" | jq -e '.data.data.value == "github-test-token-abc123"' } @@ -194,7 +201,7 @@ setup() { # Verify the new value was written (path is disinto/bots/dev-qwen, key is token) run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/bots/dev-qwen" + "${VAULT_ADDR}/v1/kv/data/disinto/bots/dev-qwen" [ "$status" -eq 0 ] echo "$output" | jq -e '.data.data.token == "MODIFIED-LLAMA-TOKEN"' } @@ -228,13 +235,13 @@ setup() { # Verify each value round-trips intact. run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/bots/review" + "${VAULT_ADDR}/v1/kv/data/disinto/bots/review" [ "$status" -eq 0 ] echo "$output" | jq -e '.data.data.token == "abc|xyz"' echo "$output" | jq -e '.data.data.pass == "p1|p2|p3"' run curl -sf -H "X-Vault-Token: ${VAULT_TOKEN}" \ - "${VAULT_ADDR}/v1/secret/data/disinto/shared/forge" + "${VAULT_ADDR}/v1/kv/data/disinto/shared/forge" [ "$status" -eq 0 ] echo "$output" | jq -e '.data.data.admin_token == "admin|with|pipes"' } diff --git a/tools/vault-import.sh b/tools/vault-import.sh index d7a4a01..bea4a07 100755 --- a/tools/vault-import.sh +++ b/tools/vault-import.sh @@ -151,9 +151,9 @@ _kv_put_secret() { -X POST \ -d "$payload" \ -o "$tmpfile" \ - "${VAULT_ADDR}/v1/secret/data/${path}")" || { + "${VAULT_ADDR}/v1/kv/data/${path}")" || { rm -f "$tmpfile" - _err "Failed to write to Vault at secret/data/${path}: curl error" + _err "Failed to write to Vault at kv/data/${path}: curl error" return 1 } rm -f "$tmpfile" @@ -164,15 +164,15 @@ _kv_put_secret() { return 0 ;; 404) - _err "KV path not found: secret/data/${path}" + _err "KV path not found: kv/data/${path}" return 1 ;; 403) - _err "Permission denied writing to secret/data/${path}" + _err "Permission denied writing to kv/data/${path}" return 1 ;; *) - _err "Failed to write to Vault at secret/data/${path}: HTTP $http_code" + _err "Failed to write to Vault at kv/data/${path}: HTTP $http_code" return 1 ;; esac From 5e83ecc2ef6cd6208253f703d1c5c1f6366bf56b Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 22:00:13 +0000 Subject: [PATCH 15/93] =?UTF-8?q?fix:=20[nomad-step-2]=20S2-fix-F=20?= =?UTF-8?q?=E2=80=94=20wire=20tools/vault-seed-.sh=20into=20bin/disin?= =?UTF-8?q?to=20--with=20=20(#928)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `tools/vault-seed-forgejo.sh` existed and worked, but `bin/disinto init --backend=nomad --with forgejo` never invoked it, so a fresh LXC with an empty Vault hit `Template Missing: vault.read(kv/data/disinto/shared/ forgejo)` and the forgejo alloc timed out inside deploy.sh's 240s healthy_deadline — operator had to run the seeder + `nomad alloc restart` by hand to recover. In `_disinto_init_nomad`, after `vault-import.sh` (or its skip branch) and before `deploy.sh`, iterate `--with ` and auto-invoke `tools/vault-seed-.sh` when the file exists + is executable. Services without a seeder are silently skipped — Step 3+ services (woodpecker, chat, etc.) can ship their own seeder without touching `bin/disinto`. VAULT_ADDR is passed explicitly because cluster-up.sh writes the profile.d export during this same init run (current shell hasn't sourced it yet) and `vault-seed-forgejo.sh` — unlike its sibling vault-* scripts — requires the caller to set VAULT_ADDR instead of defaulting it via `_hvault_default_env`. Mirror the loop in the --dry-run plan so the operator-visible plan matches the real run. Co-Authored-By: Claude Opus 4.6 (1M context) --- bin/disinto | 59 ++++++++++++++++++++++++++++++++++- tests/disinto-init-nomad.bats | 22 +++++++++++++ 2 files changed, 80 insertions(+), 1 deletion(-) diff --git a/bin/disinto b/bin/disinto index f9bfe04..0a78db6 100755 --- a/bin/disinto +++ b/bin/disinto @@ -783,9 +783,29 @@ _disinto_init_nomad() { fi if [ -n "$with_services" ]; then + # Vault seed plan (S2.6, #928): one line per service whose + # tools/vault-seed-.sh ships. Services without a seeder are + # silently skipped — the real-run loop below mirrors this, + # making `--with woodpecker` in Step 3 auto-invoke + # tools/vault-seed-woodpecker.sh once that file lands without + # any further change to bin/disinto. + local seed_hdr_printed=false + local IFS=',' + for svc in $with_services; do + svc=$(echo "$svc" | xargs) # trim whitespace + local seed_script="${FACTORY_ROOT}/tools/vault-seed-${svc}.sh" + if [ -x "$seed_script" ]; then + if [ "$seed_hdr_printed" = false ]; then + echo "── Vault seed dry-run ─────────────────────────────────" + seed_hdr_printed=true + fi + echo "[seed] [dry-run] ${seed_script} --dry-run" + fi + done + [ "$seed_hdr_printed" = true ] && echo "" + echo "── Deploy services dry-run ────────────────────────────" echo "[deploy] services to deploy: ${with_services}" - local IFS=',' for svc in $with_services; do svc=$(echo "$svc" | xargs) # trim whitespace # Validate known services first @@ -893,6 +913,43 @@ _disinto_init_nomad() { echo "[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" fi + # Seed Vault for services that ship their own seeder (S2.6, #928). + # Convention: tools/vault-seed-.sh — auto-invoked when --with + # is requested. Runs AFTER vault-import so that real imported values + # win over generated seeds when both are present; each seeder is + # idempotent on a per-key basis (see vault-seed-forgejo.sh's + # "missing → generate, present → unchanged" contract), so re-running + # init does not rotate existing keys. Services without a seeder are + # silently skipped — keeps this loop forward-compatible with Step 3+ + # services that may ship their own seeder without touching bin/disinto. + # + # VAULT_ADDR is passed explicitly because cluster-up.sh writes the + # profile.d export *during* this same init run, so the current shell + # hasn't sourced it yet; sibling vault-* scripts (engines/policies/ + # auth/import) default VAULT_ADDR internally via _hvault_default_env, + # but vault-seed-forgejo.sh requires the caller to set it. + if [ -n "$with_services" ]; then + local vault_addr="${VAULT_ADDR:-http://127.0.0.1:8200}" + local IFS=',' + for svc in $with_services; do + svc=$(echo "$svc" | xargs) # trim whitespace + local seed_script="${FACTORY_ROOT}/tools/vault-seed-${svc}.sh" + if [ -x "$seed_script" ]; then + echo "" + echo "── Seeding Vault for ${svc} ───────────────────────────" + if [ "$(id -u)" -eq 0 ]; then + VAULT_ADDR="$vault_addr" "$seed_script" || exit $? + else + if ! command -v sudo >/dev/null 2>&1; then + echo "Error: vault-seed-${svc}.sh must run as root and sudo is not installed" >&2 + exit 1 + fi + sudo -n "VAULT_ADDR=$vault_addr" -- "$seed_script" || exit $? + fi + fi + done + fi + # Deploy services if requested if [ -n "$with_services" ]; then echo "" diff --git a/tests/disinto-init-nomad.bats b/tests/disinto-init-nomad.bats index f38805e..8467ebb 100644 --- a/tests/disinto-init-nomad.bats +++ b/tests/disinto-init-nomad.bats @@ -155,6 +155,28 @@ setup_file() { [[ "$output" == *"[deploy] dry-run complete"* ]] } +# S2.6 / #928 — every --with that ships tools/vault-seed-.sh +# must auto-invoke the seeder before deploy.sh runs. forgejo is the +# only service with a seeder today, so the dry-run plan must include +# its seed line when --with forgejo is set. The seed block must also +# appear BEFORE the deploy block (seeded secrets must exist before +# nomad reads the template stanza) — pinned here by scanning output +# order. Services without a seeder (e.g. unknown hypothetical future +# ones) are silently skipped by the loop convention. +@test "disinto init --backend=nomad --with forgejo --dry-run prints seed plan before deploy plan" { + run "$DISINTO_BIN" init placeholder/repo --backend=nomad --with forgejo --dry-run + [ "$status" -eq 0 ] + [[ "$output" == *"Vault seed dry-run"* ]] + [[ "$output" == *"tools/vault-seed-forgejo.sh --dry-run"* ]] + # Order: seed header must appear before deploy header. + local seed_line deploy_line + seed_line=$(echo "$output" | grep -n "Vault seed dry-run" | head -1 | cut -d: -f1) + deploy_line=$(echo "$output" | grep -n "Deploy services dry-run" | head -1 | cut -d: -f1) + [ -n "$seed_line" ] + [ -n "$deploy_line" ] + [ "$seed_line" -lt "$deploy_line" ] +} + @test "disinto init --backend=nomad --with forgejo,forgejo --dry-run handles comma-separated services" { run "$DISINTO_BIN" init placeholder/repo --backend=nomad --with forgejo,forgejo --dry-run [ "$status" -eq 0 ] From f21408028006182a9c66d4df6b251c02c3d5a308 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 16 Apr 2026 22:14:05 +0000 Subject: [PATCH 16/93] fix: [review-r1] seed loop sudo invocation bypasses sudoers env_reset (#929) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `sudo -n "VAULT_ADDR=$vault_addr" -- "$seed_script"` passed VAULT_ADDR as a sudoers env-assignment argument. With the default `env_reset=on` policy (almost all distros), sudo silently discards env assignments unless the variable is in `env_keep` — and VAULT_ADDR is not. The seeder then hit its own precondition check at vault-seed-forgejo.sh:109 and died with "VAULT_ADDR unset", breaking the fresh-LXC non-root acceptance path the PR was written to close. Fix: run `env` as the command under sudo — `sudo -n -- env "VAULT_ADDR=$vault_addr" "$seed_script"` — so VAULT_ADDR is set in the child process directly, unaffected by sudoers env handling. The root (non-sudo) branch already used shell-level env assignment and was correct. Adds a grep-level regression guard that pins the `env VAR=val` invocation and negative-asserts the unsafe bare-argument form. Co-Authored-By: Claude Opus 4.6 (1M context) --- bin/disinto | 9 ++++++++- tests/disinto-init-nomad.bats | 16 ++++++++++++++++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/bin/disinto b/bin/disinto index 0a78db6..5f57927 100755 --- a/bin/disinto +++ b/bin/disinto @@ -928,6 +928,13 @@ _disinto_init_nomad() { # hasn't sourced it yet; sibling vault-* scripts (engines/policies/ # auth/import) default VAULT_ADDR internally via _hvault_default_env, # but vault-seed-forgejo.sh requires the caller to set it. + # + # The non-root branch invokes the seeder as `sudo -n -- env VAR=val + # script` rather than `sudo -n VAR=val -- script`: sudo treats bare + # `VAR=val` args as sudoers env-assignments, which the default + # `env_reset=on` policy silently discards unless the variable is in + # `env_keep` (VAULT_ADDR is not). Using `env` as the actual command + # sets VAULT_ADDR in the child process regardless of sudoers policy. if [ -n "$with_services" ]; then local vault_addr="${VAULT_ADDR:-http://127.0.0.1:8200}" local IFS=',' @@ -944,7 +951,7 @@ _disinto_init_nomad() { echo "Error: vault-seed-${svc}.sh must run as root and sudo is not installed" >&2 exit 1 fi - sudo -n "VAULT_ADDR=$vault_addr" -- "$seed_script" || exit $? + sudo -n -- env "VAULT_ADDR=$vault_addr" "$seed_script" || exit $? fi fi done diff --git a/tests/disinto-init-nomad.bats b/tests/disinto-init-nomad.bats index 8467ebb..21f4303 100644 --- a/tests/disinto-init-nomad.bats +++ b/tests/disinto-init-nomad.bats @@ -177,6 +177,22 @@ setup_file() { [ "$seed_line" -lt "$deploy_line" ] } +# Regression guard (PR #929 review): `sudo -n VAR=val -- cmd` is subject +# to sudoers env_reset policy and silently drops VAULT_ADDR unless it's +# in env_keep (it isn't in default configs). vault-seed-forgejo.sh +# requires VAULT_ADDR and dies at its own precondition check if unset, +# so the non-root branch MUST invoke `sudo -n -- env VAR=val cmd` so +# that `env` sets the variable in the child process regardless of +# sudoers policy. This grep-level guard catches a revert to the unsafe +# form that silently broke non-root seed runs on a fresh LXC. +@test "seed loop invokes sudo via 'env VAR=val' (bypasses sudoers env_reset)" { + run grep -F 'sudo -n -- env "VAULT_ADDR=' "$DISINTO_BIN" + [ "$status" -eq 0 ] + # Negative: no bare `sudo -n "VAR=val" --` form anywhere in the file. + run grep -F 'sudo -n "VAULT_ADDR=' "$DISINTO_BIN" + [ "$status" -ne 0 ] +} + @test "disinto init --backend=nomad --with forgejo,forgejo --dry-run handles comma-separated services" { run "$DISINTO_BIN" init placeholder/repo --backend=nomad --with forgejo,forgejo --dry-run [ "$status" -eq 0 ] From caf937f295054b1d7cdc7999407443b7ea8a99ae Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 17 Apr 2026 01:07:31 +0000 Subject: [PATCH 17/93] chore: gardener housekeeping 2026-04-17 - Promote #910, #914, #867 to backlog with acceptance criteria + affected files - Promote #820 to backlog (already well-structured, dep on #758 gates pickup) - Stage #915 as dust (no-op sed, single-line removal) - Update all AGENTS.md watermarks to HEAD - Root AGENTS.md: document vault-seed-.sh convention + complete test file list - Track gardener/dust.jsonl in git (remove from .gitignore) --- .gitignore | 1 - AGENTS.md | 9 +-- architect/AGENTS.md | 2 +- dev/AGENTS.md | 2 +- gardener/AGENTS.md | 2 +- gardener/dust.jsonl | 1 + gardener/pending-actions.json | 100 ++++------------------------------ lib/AGENTS.md | 2 +- nomad/AGENTS.md | 2 +- planner/AGENTS.md | 2 +- predictor/AGENTS.md | 2 +- review/AGENTS.md | 2 +- supervisor/AGENTS.md | 2 +- vault/policies/AGENTS.md | 2 +- 14 files changed, 26 insertions(+), 105 deletions(-) create mode 100644 gardener/dust.jsonl diff --git a/.gitignore b/.gitignore index 21c6fbc..a29450c 100644 --- a/.gitignore +++ b/.gitignore @@ -20,7 +20,6 @@ metrics/supervisor-metrics.jsonl # OS .DS_Store dev/ci-fixes-*.json -gardener/dust.jsonl # Individual encrypted secrets (managed by disinto secrets add) secrets/ diff --git a/AGENTS.md b/AGENTS.md index ad3867b..fced0c6 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,4 +1,4 @@ - + # Disinto — Agent Instructions ## What this repo is @@ -44,12 +44,13 @@ disinto/ (code repo) ├── formulas/ Issue templates (TOML specs for multi-step agent tasks) ├── docker/ Dockerfiles and entrypoints: reproduce, triage, edge dispatcher, chat (server.py, entrypoint-chat.sh, Dockerfile, ui/) ├── tools/ Operational tools: edge-control/ (register.sh, install.sh, verify-chat-sandbox.sh) -│ vault-apply-policies.sh, vault-apply-roles.sh, vault-import.sh, vault-seed-forgejo.sh — Vault provisioning (S2.1/S2.2) +│ vault-apply-policies.sh, vault-apply-roles.sh, vault-import.sh — Vault provisioning (S2.1/S2.2) +│ vault-seed-.sh — per-service Vault secret seeders; auto-invoked by `bin/disinto --with ` (add a new file to support a new service) ├── docs/ Protocol docs (PHASE-PROTOCOL.md, EVIDENCE-ARCHITECTURE.md) ├── site/ disinto.ai website content -├── tests/ Test files (mock-forgejo.py, smoke-init.sh, lib-hvault.bats, disinto-init-nomad.bats) +├── tests/ Test files (mock-forgejo.py, smoke-init.sh, lib-hvault.bats, lib-generators.bats, vault-import.bats, disinto-init-nomad.bats) ├── templates/ Issue templates -├── bin/ The `disinto` CLI script +├── bin/ The `disinto` CLI script (`--with ` deploys services + runs their Vault seeders) ├── disinto-factory/ Setup documentation and skill ├── state/ Runtime state ├── .woodpecker/ Woodpecker CI pipeline configs diff --git a/architect/AGENTS.md b/architect/AGENTS.md index 7f8b1f4..51b24b1 100644 --- a/architect/AGENTS.md +++ b/architect/AGENTS.md @@ -1,4 +1,4 @@ - + # Architect — Agent Instructions ## What this agent is diff --git a/dev/AGENTS.md b/dev/AGENTS.md index 13d9736..02fd612 100644 --- a/dev/AGENTS.md +++ b/dev/AGENTS.md @@ -1,4 +1,4 @@ - + # Dev Agent **Role**: Implement issues autonomously — write code, push branches, address diff --git a/gardener/AGENTS.md b/gardener/AGENTS.md index a692876..e9ad846 100644 --- a/gardener/AGENTS.md +++ b/gardener/AGENTS.md @@ -1,4 +1,4 @@ - + # Gardener Agent **Role**: Backlog grooming — detect duplicate issues, missing acceptance diff --git a/gardener/dust.jsonl b/gardener/dust.jsonl new file mode 100644 index 0000000..14b0d5c --- /dev/null +++ b/gardener/dust.jsonl @@ -0,0 +1 @@ +{"issue":915,"group":"lib/generators.sh","title":"remove no-op sed in generate_compose --build mode","reason":"sed replaces agents: with itself — no behavior change; single-line removal","ts":"2026-04-17T01:04:05Z"} diff --git a/gardener/pending-actions.json b/gardener/pending-actions.json index 267c586..1c89c7d 100644 --- a/gardener/pending-actions.json +++ b/gardener/pending-actions.json @@ -1,117 +1,37 @@ [ { "action": "edit_body", - "issue": 900, - "body": "Flagged by AI reviewer in PR #897.\n\n## Problem\n\nThe policy at `vault/policies/service-forgejo.hcl` grants:\n\n```hcl\npath \"kv/data/disinto/shared/forgejo/*\" {\n capabilities = [\"read\"]\n}\n```\n\nBut the consul-template stanza in `nomad/jobs/forgejo.hcl` reads:\n\n```\n{{- with secret \"kv/data/disinto/shared/forgejo\" -}}\n```\n\nVault glob `/*` requires at least one path segment after `forgejo/` (e.g. `forgejo/subkey`). It does **not** match the bare path `kv/data/disinto/shared/forgejo` that the template actually calls. Vault ACL longest-prefix matching: `forgejo/*` is never hit for a request to `forgejo`.\n\nRuntime consequence: consul-template `with` block receives a 403 permission denied → evaluates to empty (false) → `else` branch renders `seed-me` placeholder values → Forgejo starts with obviously-wrong secrets despite `vault-seed-forgejo.sh` having run successfully.\n\n## Fix\n\nReplace the glob with an exact path in `vault/policies/service-forgejo.hcl`:\n\n```hcl\npath \"kv/data/disinto/shared/forgejo\" {\n capabilities = [\"read\"]\n}\n\npath \"kv/metadata/disinto/shared/forgejo\" {\n capabilities = [\"list\", \"read\"]\n}\n```\n\n(The `/*` glob is only useful if future subkeys are written under `forgejo/`; the current design stores both secrets in a single KV document at the `forgejo` path.)\n\nThis is a pre-existing defect in `vault/policies/service-forgejo.hcl`; that file was not changed by PR #897.\n\n---\n*Auto-created from AI review*\n\n## Affected files\n- `vault/policies/service-forgejo.hcl` — replace glob path with exact path + metadata path\n\n## Acceptance criteria\n- [ ] `vault/policies/service-forgejo.hcl` grants exact path `kv/data/disinto/shared/forgejo` (not `forgejo/*`)\n- [ ] Metadata path `kv/metadata/disinto/shared/forgejo` is also granted read+list\n- [ ] consul-template `with secret \"kv/data/disinto/shared/forgejo\"` resolves without 403 (verified via `vault policy read service-forgejo`)\n- [ ] `shellcheck` clean (no shell changes expected)\n" + "issue": 910, + "body": "Flagged by AI reviewer in PR #909.\n\n## Problem\n\n`tools/vault-import.sh` still uses hardcoded `secret/data/${path}` for its curl-based KV write (lines 149, 151, 162, 166, 170). The rest of the codebase was migrated to the configurable `VAULT_KV_MOUNT` variable (defaulting to `kv`) via PR #909. Any deployment with `kv/` as its KV mount will see 403/404 failures when `vault-import.sh` runs.\n\n## Fix\n\nEither:\n1. Refactor the write in `vault-import.sh` to call `hvault_kv_put` (which now respects `VAULT_KV_MOUNT`), or\n2. Replace the hardcoded `secret/data` reference with `${VAULT_KV_MOUNT:-kv}/data` matching the convention in `lib/hvault.sh`.\n\n---\n*Auto-created from AI review*\n\n## Affected files\n\n- `tools/vault-import.sh` (lines 149, 151, 162, 166, 170 — hardcoded `secret/data` references)\n- `lib/hvault.sh` (reference implementation using `VAULT_KV_MOUNT`)\n\n## Acceptance criteria\n\n- [ ] `tools/vault-import.sh` uses `${VAULT_KV_MOUNT:-kv}/data` (or calls `hvault_kv_put`) instead of hardcoded `secret/data`\n- [ ] No hardcoded `secret/data` path references remain in `tools/vault-import.sh`\n- [ ] Vault KV writes succeed when `VAULT_KV_MOUNT=kv` is set (matching the standard deployment config)\n- [ ] `shellcheck` clean\n" }, { "action": "add_label", - "issue": 900, + "issue": 910, "label": "backlog" }, { "action": "edit_body", - "issue": 898, - "body": "Flagged by AI reviewer in PR #889.\n\n## Problem\n\n`tools/vault-import.sh` serializes each entry in `ops_data` as `\"${source_value}|${status}\"` (line 498). Extraction at lines 510-511 uses `${data%%|*}` (first field) and `${data##*|}` (last field). If `source_value` contains a literal `|`, `${data%%|*}` truncates it to the first segment, silently writing a corrupted value to Vault.\n\nThe same separator is used in `paths_to_write` (line 519) to join multiple kv-pairs for a path. When `IFS=\"|\"` splits the string back into an array (line 540), a value containing `|` is split across array elements, corrupting the write.\n\n## Failure mode\n\nAny secret value with a pipe character (e.g. a generated password or composed token like `abc|xyz`) is silently truncated or misrouted on import. No error is emitted.\n\n## Fix\n\nReplace the `|`-delimited string with a bash indexed array for accumulating per-path kv pairs, eliminating the need for a delimiter that conflicts with possible value characters.\n\n---\n*Auto-created from AI review of PR #889*\n\n## Affected files\n- `tools/vault-import.sh` — replace pipe-delimited string accumulation with bash indexed arrays (lines ~498–540)\n\n## Acceptance criteria\n- [ ] A secret value containing `|` (e.g. `abc|xyz`) is imported to Vault without truncation or corruption\n- [ ] No regression for values without `|`\n- [ ] `shellcheck` clean\n" + "issue": 914, + "body": "Flagged by AI reviewer in PR #911.\n\n## Problem\n\n`lib/generators.sh` fixes the `agents` service missing `pull_policy: build` in `--build` mode (PR #893), but the `edge` service has the same root cause: the sed replacement at line 664 produces `build: ./docker/edge` with no `pull_policy: build`. Without it, `docker compose up -d --force-recreate` reuses the cached edge image and silently keeps running stale code even after source changes.\n\n## Fix\n\nAdd `\\n pull_policy: build` to the edge sed replacement, matching the pattern applied to agents in PR #893.\n\n---\n*Auto-created from AI review*\n\n## Affected files\n\n- `lib/generators.sh` (line 664 — edge service sed replacement missing `pull_policy: build`)\n\n## Acceptance criteria\n\n- [ ] `lib/generators.sh` edge service block emits `pull_policy: build` when `--build` mode is active (matching the pattern from PR #893 for the agents service)\n- [ ] `docker compose up -d --force-recreate` after source changes rebuilds the edge image rather than using the cached layer\n- [ ] Generated `docker-compose.yml` edge service stanza contains `pull_policy: build`\n- [ ] `shellcheck` clean\n" }, { "action": "add_label", - "issue": 898, + "issue": 914, "label": "backlog" }, { "action": "edit_body", - "issue": 893, - "body": "Flagged by AI reviewer in PR #892.\n\n## Problem\n\n`disinto init --build` generates the `agents:` service by first emitting `image: ghcr.io/disinto/agents:${DISINTO_IMAGE_TAG:-latest}` and then running a `sed -i` substitution (`lib/generators.sh:793`) that replaces the `image:` line with a `build:` block. The substitution does not add `pull_policy: build`.\n\nResult: `docker compose up` with `--build`-generated compose files still uses the cached image for the base `agents:` service, even when `docker/agents/` source has changed — the same silent-stale-image bug that #887 fixed for the three local-model service stanzas.\n\n## Fix\n\nThe `sed` substitution on line 793 should also inject `pull_policy: build` after the emitted `build:` block.\n\n---\n*Auto-created from AI review of PR #892*\n\n## Affected files\n- `lib/generators.sh` (line ~793) — add `pull_policy: build` to the agents service sed substitution\n\n## Acceptance criteria\n- [ ] `disinto init --build`-generated compose file includes `pull_policy: build` in the `agents:` service stanza\n- [ ] `docker compose up` rebuilds the agents image from local source when `docker/agents/` changes\n- [ ] Non-`--build` compose generation is unchanged\n- [ ] `shellcheck` clean\n" + "issue": 867, + "body": "## Incident\n\n**2026-04-16 ~10:55–11:52 UTC.** Woodpecker CI agent (`disinto-woodpecker-agent`) entered a repeated gRPC-error crashloop (Codeberg #813 class — gRPC-in-nested-docker). Every workflow it accepted exited 1 within seconds, never actually running pipeline steps.\n\n**Blast radius:** dev-qwen took issue #842 at 10:55, opened PR #859, and burned its full 3-attempt `pr-lifecycle` CI-fix budget between 10:55 and 11:08 reacting to these infra-flake \"CI failures.\" Each failure arrived in ~30–60 seconds, too fast to be a real test run. After exhausting the budget, dev-qwen marked #842 as `blocked: ci_exhausted` and moved on. No real bug was being detected; the real failure surfaced later only after an operator restarted the WP agent and manually retriggered pipeline #966 — which then returned a legitimate `bats-init-nomad` failure in test #6 (different issue).\n\n**Root cause of the infra-flake:** gRPC-in-nested-docker bug, Woodpecker server ↔ agent comms inside nested containers. Known-flaky; restart of `disinto-woodpecker-agent` clears it.\n\n**Recovery:** operator `docker restart disinto-woodpecker-agent` + retrigger pipelines via WP API POST `/api/repos/2/pipelines/`. Fresh run reached real stage signal.\n\n## Why this burned dev-qwen's budget\n\n`pr-lifecycle`'s CI-fix budget treats every failed commit-status as a signal to invoke the agent. It has no notion of \"infra flake\" vs. \"real test failure\" and no heuristic to distinguish them. Four infra-flake failures in 13 minutes looked identical to four real code-bug failures.\n\n## Suggestions — what supervisor can check every 20min\n\nSupervisor runs every `1200s` already. Add these probes:\n\n**1. WP agent container health.**\n```\ndocker inspect disinto-woodpecker-agent --format '{{.State.Health.Status}}'\n```\nIf `unhealthy` for the second consecutive supervisor tick → **restart it automatically + post a comment on any currently-running dev-bot/dev-qwen issues warning \"CI agent was restarted; subsequent failures before this marker may be infra-flake.\"**\n\n**2. Fast-failure heuristic on WP pipelines.**\nQuery WP API `GET /api/repos/2/pipelines?page=1`. For each pipeline in state `failure`, compute `finished - started`. If duration < 60s, flag as probable infra-flake. Three flagged flakes within a 15-min window → trigger agent restart as in (1) and a bulk-retrigger via POST `/api/repos/2/pipelines/` for each.\n\n**3. grpc error pattern in agent log.**\n`docker logs --since 20m disinto-woodpecker-agent 2>&1 | grep -c 'grpc error'` — if ≥3 matches, agent is probably wedged. Trigger restart as in (1).\n\n**4. Issue-level guard.**\nWhen supervisor detects an agent restart, scan for issues updated in the preceding 30min with label `blocked: ci_exhausted` and for each one:\n- unassign + remove `blocked` label (return to pool)\n- comment on the issue: *\"CI agent was unhealthy between HH:MM and HH:MM — prior 3/3 retry budget may have been spent on infra flake, not real failures. Re-queueing for a fresh attempt.\"*\n- retrigger the PR's latest WP pipeline\n\nThis last step is the key correction: **`ci_exhausted` preceded by WP-agent-unhealth = false positive; return to pool with context.**\n\n## Why this matters for the migration\n\nBetween now and cutover every WP CI flake that silently exhausts an agent's budget steals hours of clock time. Without an automatic recovery path, the pace of the step-N backlogs falls off a cliff the moment the agent next goes unhealthy — and it *will* go unhealthy again (Codeberg #813 is not fixed upstream yet).\n\n## Fix for this specific incident (already applied manually)\n\n- Restarted `disinto-woodpecker-agent`.\n- Closed PR #859 (kept branch `fix/issue-842` at `64080232`).\n- Unassigned dev-qwen from #842, removed `blocked` label, appended prior-art section + pipeline #966 test-#6 failure details to issue body so the next claimant starts with full context.\n\n## Non-goals\n\n- Not trying to fix Codeberg #813 itself (upstream gRPC-in-nested-docker issue).\n- Not trying to fix `pr-lifecycle`'s budget logic — the supervisor-side detection is cheaper and more robust than per-issue budget changes.\n\n## Labels / meta\n\n- `bug-report` + supervisor-focused. Classify severity as blocker for the migration cadence (not for factory day-to-day — it only bites when an unfixable-by-dev issue hits the budget).\n\n## Affected files\n\n- `supervisor/supervisor-run.sh` — add WP agent health probes and flake-detection logic\n- `supervisor/preflight.sh` — may need additional data collection for WP agent health status\n\n## Acceptance criteria\n\n- [ ] Supervisor detects an unhealthy `disinto-woodpecker-agent` container (via `docker inspect` health status or gRPC error log count ≥ 3) and automatically restarts it\n- [ ] After an auto-restart, supervisor scans for issues updated in the prior 30 min labeled `blocked: ci_exhausted` and returns them to the pool (unassign, remove `blocked`, add comment noting infra-flake window)\n- [ ] Fast-failure heuristic: pipelines completing in <60s are flagged as probable infra-flake; 3+ in a 15-min window triggers the restart+retrigger flow\n- [ ] Already-swept PRs/issues are not processed twice (idempotency guard via `` comment)\n- [ ] CI green\n" }, { "action": "add_label", - "issue": 893, - "label": "backlog" - }, - { - "action": "edit_body", - "issue": 890, - "body": "Flagged by AI reviewer in PR #888.\n\n## Problem\n\n`lib/hvault.sh` functions `hvault_kv_get`, `hvault_kv_put`, and `hvault_kv_list` all hardcode `secret/data/` and `secret/metadata/` as KV v2 path prefixes (lines 117, 157, 173).\n\nThe Nomad+Vault migration (S2.1, #879) establishes `kv/` as the mount name for all factory secrets — every policy in `vault/policies/*.hcl` grants ACL on `kv/data/disinto/...` paths.\n\nIf any agent calls `hvault_kv_get` after the migration, Vault will route the request to `secret/data/...` but the token only holds ACL for `kv/data/...`, producing a 403 Forbidden.\n\n## Fix\n\nChange the mount prefix in `hvault_kv_get`, `hvault_kv_put`, and `hvault_kv_list` from `secret/` to `kv/`, or make the mount name configurable via `VAULT_KV_MOUNT` (defaulting to `kv`). Coordinate with S2.2 (#880) which writes secrets into the `kv/` mount.\n\n---\n*Auto-created from AI review of PR #888*\n\n## Affected files\n- `lib/hvault.sh` — change `secret/data/` and `secret/metadata/` prefixes to `kv/data/` and `kv/metadata/` (lines ~117, 157, 173); optionally make configurable via `VAULT_KV_MOUNT`\n\n## Acceptance criteria\n- [ ] `hvault_kv_get`, `hvault_kv_put`, `hvault_kv_list` use `kv/` mount prefix (not `secret/`)\n- [ ] Agents can read/write KV paths that policies in `vault/policies/*.hcl` grant (no 403)\n- [ ] Optionally: `VAULT_KV_MOUNT` env var overrides the mount name (defaults to `kv`)\n- [ ] `shellcheck` clean\n" - }, - { - "action": "add_label", - "issue": 890, - "label": "backlog" - }, - { - "action": "edit_body", - "issue": 877, - "body": "Flagged by AI reviewer in PR #875.\n\n## Problem\n\n`validate_projects_dir()` in `docker/agents/entrypoint.sh` uses a command substitution that triggers `set -e` before the intended error-logging branch runs:\n\n```bash\ntoml_count=$(compgen -G \"${DISINTO_DIR}/projects/*.toml\" 2>/dev/null | wc -l)\n```\n\nWhen no `.toml` files are present, `compgen -G` exits 1. With `pipefail`, the pipeline exits 1. `set -e` causes the script to exit before `if [ \"$toml_count\" -eq 0 ]` is evaluated, so the FATAL diagnostic messages are never printed. The container still fast-fails (correct outcome), but the operator sees no explanation.\n\nEvery other `compgen -G` usage in the file uses the safer conditional pattern (lines 259, 322).\n\n## Fix\n\nReplace the `wc -l` pattern with:\n\n```bash\nif ! compgen -G \"${DISINTO_DIR}/projects/*.toml\" >/dev/null 2>&1; then\n log \"FATAL: No real .toml files found in ${DISINTO_DIR}/projects/\"\n ...\n exit 1\nfi\n```\n\n---\n*Auto-created from AI review*\n\n## Affected files\n- `docker/agents/entrypoint.sh` — fix `validate_projects_dir()` to use conditional compgen pattern instead of `wc -l` pipeline\n\n## Acceptance criteria\n- [ ] When no `.toml` files are present, the FATAL message is printed before the container exits\n- [ ] Container still exits non-zero in that case\n- [ ] Matches the pattern already used at lines 259 and 322\n- [ ] `shellcheck` clean\n" - }, - { - "action": "add_label", - "issue": 877, + "issue": 867, "label": "backlog" }, { "action": "add_label", - "issue": 773, - "label": "backlog" - }, - { - "action": "edit_body", - "issue": 883, - "body": "Part of the Nomad+Vault migration. **Step 2 — Vault policies + workload identity + secrets import.**\n\n~~**Blocked by: #880 (S2.2), #881 (S2.3).**~~ Dependencies closed; unblocked.\n\n## Goal\n\nWire the Step-2 building blocks (import, auth, policies) into `bin/disinto init --backend=nomad` so a single command on a fresh LXC provisions cluster + policies + auth + imports secrets + deploys services.\n\n## Scope\n\nAdd flags to `disinto init --backend=nomad`:\n\n- `--import-env PATH` — points at an existing `.env` (from old stack).\n- `--import-sops PATH` — points at the sops-encrypted `.env.vault.enc`.\n- `--age-key PATH` — points at the sops age keyfile (required if `--import-sops` is set).\n\nFlow when any of `--import-*` is set:\n\n1. `cluster-up.sh` (Step 0, unchanged).\n2. `tools/vault-apply-policies.sh` (S2.1, idempotent).\n3. `lib/init/nomad/vault-nomad-auth.sh` (S2.3, idempotent).\n4. `tools/vault-import.sh --env PATH --sops PATH --age-key PATH` (S2.2).\n5. If `--with ` was also passed, `lib/init/nomad/deploy.sh ` (Step 1, unchanged).\n6. Final summary: cluster + policies + auth + imported secrets count + deployed services + ports.\n\nFlow when **no** import flags are set:\n- Skip step 4; still apply policies + auth.\n- Log: `[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services`.\n\nFlag validation:\n- `--import-sops` without `--age-key` → error.\n- `--age-key` without `--import-sops` → error.\n- `--import-env` alone (no sops) → OK.\n- `--backend=docker` + any `--import-*` → error.\n\n## Affected files\n- `bin/disinto` — add `--import-env`, `--import-sops`, `--age-key` flags to `init --backend=nomad`\n- `docs/nomad-migration.md` (new) — cutover-day invocation shape\n- `lib/init/nomad/vault-nomad-auth.sh` (S2.3) — called as step 3\n- `tools/vault-import.sh` (S2.2) — called as step 4\n- `tools/vault-apply-policies.sh` (S2.1) — called as step 2\n\n## Acceptance criteria\n- [ ] `disinto init --backend=nomad --import-env /tmp/.env --import-sops /tmp/.enc --age-key /tmp/keys.txt --with forgejo` completes: cluster up, policies applied, JWT auth configured, KV populated, Forgejo deployed reading Vault secrets\n- [ ] Re-running is a no-op at every layer\n- [ ] `--import-sops` without `--age-key` exits with a clear error\n- [ ] `--backend=docker` with `--import-env` exits with a clear error\n- [ ] `--dry-run` prints the full plan, touches nothing\n- [ ] Never logs a secret value\n- [ ] `shellcheck` clean\n" - }, - { - "action": "remove_label", - "issue": 883, - "label": "blocked" - }, - { - "action": "add_label", - "issue": 883, - "label": "backlog" - }, - { - "action": "edit_body", - "issue": 884, - "body": "Part of the Nomad+Vault migration. **Step 2 — Vault policies + workload identity + secrets import.**\n\nS2.1 (#879) is now closed; this step has no blocking dependencies.\n\n## Goal\n\nExtend the Woodpecker CI to validate Vault policy HCL files under `vault/policies/` and role definitions.\n\n## Scope\n\nExtend `.woodpecker/nomad-validate.yml`:\n\n- `vault policy fmt -check vault/policies/*.hcl` — fails on unformatted HCL.\n- `for f in vault/policies/*.hcl; do vault policy validate \"$f\"; done` — syntax + semantic validation (requires a dev-mode vault spun inline).\n- If `vault/roles.yaml` exists: yamllint check + custom validator that each role references a policy file that actually exists in `vault/policies/`.\n- Secret-scan gate: ensure no policy file contains what looks like a literal secret.\n- Trigger: on any PR touching `vault/policies/`, `vault/roles.yaml`, or `lib/init/nomad/vault-*.sh`.\n\nAlso:\n- Add `vault/policies/AGENTS.md` cross-reference: policy lifecycle (add policy HCL → update roles.yaml → add Vault KV path), what CI enforces, common failure modes.\n\n## Non-goals\n\n- No runtime check against a real cluster.\n- No enforcement of specific naming conventions beyond what S2.1 docs describe.\n\n## Affected files\n- `.woodpecker/nomad-validate.yml` — add vault policy fmt + validate + roles.yaml gates\n- `vault/policies/AGENTS.md` (new) — policy lifecycle documentation\n\n## Acceptance criteria\n- [ ] Deliberately broken policy HCL (typo in `path` block) fails CI with the vault-fmt error\n- [ ] Policy that references a non-existent capability (e.g. `\"frobnicate\"`) fails validation\n- [ ] `vault/roles.yaml` referencing a policy not in `vault/policies/` fails CI\n- [ ] Clean PRs pass within normal pipeline time budget\n- [ ] Existing S0.5 + S1.4 CI gates unaffected\n- [ ] `shellcheck` clean on any shell added\n" - }, - { - "action": "remove_label", - "issue": 884, - "label": "blocked" - }, - { - "action": "add_label", - "issue": 884, - "label": "backlog" - }, - { - "action": "edit_body", - "issue": 846, - "body": "## Problem\n\nLlama-backed sidecar agents can be activated through two different mechanisms:\n\n1. **Legacy:** `ENABLE_LLAMA_AGENT=1` env flag toggles a hardcoded `agents-llama` service block in `docker-compose.yml`.\n2. **Modern:** `[agents.X]` TOML block consumed by `hire-an-agent`, emitting a service per block.\n\nNeither the docs nor the CLI explain which path wins. Setting both produces a YAML `mapping key \"agents-llama\" already defined` error from compose because the service block is duplicated.\n\n## Sub-symptom: env-var naming collision\n\nThe two paths key secrets differently:\n\n- Legacy: `FORGE_TOKEN_LLAMA`, `FORGE_PASS_LLAMA`.\n- Modern: `FORGE_TOKEN_` — e.g. `FORGE_TOKEN_DEV_QWEN`.\n\nA user migrating between paths ends up with two sets of secrets in `.env`, neither cleanly mapped to the currently-active service block. Silent auth failures (401 from Forgejo) follow.\n\n## Proposal\n\n- Pick the TOML `[agents.X]` path as canonical.\n- Remove the `ENABLE_LLAMA_AGENT` branch and its hardcoded service block from the generator.\n- Detection of `ENABLE_LLAMA_AGENT` in `.env` at `disinto up` time: hard-fail immediately with a migration message (option (a) — simpler, no external consumers depend on this flag).\n\n~~Dependencies: #845, #847~~ — both now closed; unblocked.\n\nRelated: #845, #847.\n\n## Affected files\n- `lib/generators.sh` — remove `ENABLE_LLAMA_AGENT` branch and hardcoded `agents-llama:` service block\n- `docker/agents/entrypoint.sh` — detect `ENABLE_LLAMA_AGENT` in env, emit migration error\n- `.env.example` — remove `ENABLE_LLAMA_AGENT`\n- `docs/agents-llama.md` — update to document TOML `[agents.X]` as the one canonical path\n\n## Acceptance criteria\n- [ ] One documented activation path: TOML `[agents.X]` block\n- [ ] `ENABLE_LLAMA_AGENT` removed from compose generator; presence in `.env` at startup triggers a clear migration error naming the replacement\n- [ ] `.env.example` and `docs/agents-llama.md` updated\n- [ ] `shellcheck` clean\n" - }, - { - "action": "remove_label", - "issue": 846, - "label": "blocked" - }, - { - "action": "add_label", - "issue": 846, - "label": "backlog" - }, - { - "action": "edit_body", - "issue": 850, - "body": "## Problem\n\nWhen the compose generator emits the same service name twice — e.g. both the legacy `ENABLE_LLAMA_AGENT=1` branch and a matching `[agents.llama]` TOML block produce an `agents-llama:` key — the failure is deferred all the way to `docker compose` YAML parsing:\n\n```\nfailed to parse /home/johba/disinto/docker-compose.yml: yaml: construct errors:\n line 4: line 431: mapping key \"agents-llama\" already defined at line 155\n```\n\nBy then, the user has already paid the cost of: pre-build binary downloads, generator run, Caddyfile regeneration. The only hint about what went wrong is a line number in a generated file. Root cause (dual activation) is not surfaced.\n\n## Fix\n\nAdd a generate-time guard to `lib/generators.sh`:\n\n- After collecting all service blocks to emit, compare the set of service names against duplicates.\n- If a duplicate is detected, abort with a clear message naming both source of truth (e.g. `\"agents-llama\" emitted twice — from ENABLE_LLAMA_AGENT=1 and from [agents.llama] in projects/disinto.toml; remove one`).\n\nEven after #846 resolves (one canonical activation path), this guard remains valuable as a safety net against future regressions or user misconfiguration (e.g. two TOML blocks with same `forge_user`).\n\n## Prior art: PR #872 (closed, branch `fix/issue-850` retained)\n\ndev-qwen's first attempt (`db009e3`) landed the dup-detection logic in `lib/generators.sh` correctly (unit test `tests/test-duplicate-service-detection.sh` passes all 3 cases), but the smoke test fails on CI.\n\n**Why the smoke test fails:** sections 1-7 of `smoke-init.sh` already run `bin/disinto init`, materializing `docker-compose.yml`. Section 8 re-invokes `bin/disinto init` to verify the dup guard fires — but `_generate_compose_impl` early-returns with `\"Compose: already exists, skipping\"` before reaching the dup-check.\n\n**Suggested fix:** in `tests/smoke-init.sh` section 8 (around line 452, before the second `bin/disinto init` invocation), add:\n\n```bash\nrm -f \"${FACTORY_ROOT}/docker-compose.yml\"\n```\n\nso the generator actually runs and the dup-detection path is exercised. Do **not** hoist the dup-check above the early-return.\n\nThe branch `fix/issue-850` is preserved as a starting point — pick up from `db009e3` and patch the smoke-test cleanup.\n\nRelated: #846.\n\n## Affected files\n- `lib/generators.sh` — duplicate service name check after collecting all service blocks\n- `tests/smoke-init.sh` — section 8: add `rm -f docker-compose.yml` before second `disinto init`\n- `tests/test-duplicate-service-detection.sh` (likely already correct from prior art)\n\n## Acceptance criteria\n- [ ] Running `disinto up` with a known duplicate activation produces a clear generator-time error naming both conflicting sources\n- [ ] Exit code non-zero before `docker compose` is invoked\n- [ ] Smoke test section 8 passes on CI (dup guard is actually exercised)\n- [ ] `shellcheck` clean\n" - }, - { - "action": "remove_label", - "issue": 850, - "label": "blocked" - }, - { - "action": "add_label", - "issue": 850, + "issue": 820, "label": "backlog" } ] diff --git a/lib/AGENTS.md b/lib/AGENTS.md index 6d37093..97e6f5e 100644 --- a/lib/AGENTS.md +++ b/lib/AGENTS.md @@ -1,4 +1,4 @@ - + # Shared Helpers (`lib/`) All agents source `lib/env.sh` as their first action. Additional helpers are diff --git a/nomad/AGENTS.md b/nomad/AGENTS.md index 0ce3cea..f57c30a 100644 --- a/nomad/AGENTS.md +++ b/nomad/AGENTS.md @@ -1,4 +1,4 @@ - + # nomad/ — Agent Instructions Nomad + Vault HCL for the factory's single-node cluster. These files are diff --git a/planner/AGENTS.md b/planner/AGENTS.md index b453bc9..7034b60 100644 --- a/planner/AGENTS.md +++ b/planner/AGENTS.md @@ -1,4 +1,4 @@ - + # Planner Agent **Role**: Strategic planning using a Prerequisite Tree (Theory of Constraints), diff --git a/predictor/AGENTS.md b/predictor/AGENTS.md index 360a3e9..cec03a1 100644 --- a/predictor/AGENTS.md +++ b/predictor/AGENTS.md @@ -1,4 +1,4 @@ - + # Predictor Agent **Role**: Abstract adversary (the "goblin"). Runs a 2-step formula diff --git a/review/AGENTS.md b/review/AGENTS.md index 223d656..4c06b34 100644 --- a/review/AGENTS.md +++ b/review/AGENTS.md @@ -1,4 +1,4 @@ - + # Review Agent **Role**: AI-powered PR review — post structured findings and formal diff --git a/supervisor/AGENTS.md b/supervisor/AGENTS.md index 75dd51f..736f78f 100644 --- a/supervisor/AGENTS.md +++ b/supervisor/AGENTS.md @@ -1,4 +1,4 @@ - + # Supervisor Agent **Role**: Health monitoring and auto-remediation, executed as a formula-driven diff --git a/vault/policies/AGENTS.md b/vault/policies/AGENTS.md index a1b85c2..692c885 100644 --- a/vault/policies/AGENTS.md +++ b/vault/policies/AGENTS.md @@ -1,4 +1,4 @@ - + # vault/policies/ — Agent Instructions HashiCorp Vault ACL policies for the disinto factory. One `.hcl` file per From 99d3cb4c8f8a47fab8a656a1944ff1f8889fc39a Mon Sep 17 00:00:00 2001 From: dev-qwen2 Date: Fri, 17 Apr 2026 01:18:03 +0000 Subject: [PATCH 18/93] fix: tech-debt: tools/vault-import.sh uses hardcoded secret/ KV mount (#910) --- tools/vault-import.sh | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/vault-import.sh b/tools/vault-import.sh index bea4a07..f85dd16 100755 --- a/tools/vault-import.sh +++ b/tools/vault-import.sh @@ -151,9 +151,9 @@ _kv_put_secret() { -X POST \ -d "$payload" \ -o "$tmpfile" \ - "${VAULT_ADDR}/v1/kv/data/${path}")" || { + "${VAULT_ADDR}/v1/${VAULT_KV_MOUNT:-kv}/data/${path}")" || { rm -f "$tmpfile" - _err "Failed to write to Vault at kv/data/${path}: curl error" + _err "Failed to write to Vault at ${VAULT_KV_MOUNT:-kv}/data/${path}: curl error" return 1 } rm -f "$tmpfile" @@ -164,15 +164,15 @@ _kv_put_secret() { return 0 ;; 404) - _err "KV path not found: kv/data/${path}" + _err "KV path not found: ${VAULT_KV_MOUNT:-kv}/data/${path}" return 1 ;; 403) - _err "Permission denied writing to kv/data/${path}" + _err "Permission denied writing to ${VAULT_KV_MOUNT:-kv}/data/${path}" return 1 ;; *) - _err "Failed to write to Vault at kv/data/${path}: HTTP $http_code" + _err "Failed to write to Vault at ${VAULT_KV_MOUNT:-kv}/data/${path}: HTTP $http_code" return 1 ;; esac From f53c3690b8430c1d9c27d1cf120ae95311f7dc14 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 17 Apr 2026 01:18:13 +0000 Subject: [PATCH 19/93] fix: tech-debt: edge service missing pull_policy: build in --build mode generator (#914) Co-Authored-By: Claude Opus 4.6 (1M context) --- lib/generators.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/generators.sh b/lib/generators.sh index 8f132bb..9ec8444 100644 --- a/lib/generators.sh +++ b/lib/generators.sh @@ -661,7 +661,7 @@ COMPOSEEOF if [ "$use_build" = true ]; then sed -i 's|^\( agents:\)|\1|' "$compose_file" sed -i '/^ image: ghcr\.io\/disinto\/agents:/{s|image: ghcr\.io/disinto/agents:.*|build:\n context: .\n dockerfile: docker/agents/Dockerfile\n pull_policy: build|}' "$compose_file" - sed -i '/^ image: ghcr\.io\/disinto\/edge:/{s|image: ghcr\.io/disinto/edge:.*|build: ./docker/edge|}' "$compose_file" + sed -i '/^ image: ghcr\.io\/disinto\/edge:/{s|image: ghcr\.io/disinto/edge:.*|build: ./docker/edge\n pull_policy: build|}' "$compose_file" fi echo "Created: ${compose_file}" From 04ead1fbdce8284af0642545b87435ace796677f Mon Sep 17 00:00:00 2001 From: Agent Date: Fri, 17 Apr 2026 01:22:59 +0000 Subject: [PATCH 20/93] fix: incident: WP gRPC flake burned dev-qwen CI retry budget on #842 (2026-04-16) (#867) --- formulas/run-supervisor.toml | 22 ++++- supervisor/AGENTS.md | 7 +- supervisor/preflight.sh | 105 +++++++++++++++++++++++ supervisor/supervisor-run.sh | 156 +++++++++++++++++++++++++++++++++++ 4 files changed, 287 insertions(+), 3 deletions(-) diff --git a/formulas/run-supervisor.toml b/formulas/run-supervisor.toml index f31e6bc..e623187 100644 --- a/formulas/run-supervisor.toml +++ b/formulas/run-supervisor.toml @@ -29,7 +29,7 @@ and injected into your prompt above. Review them now. 1. Read the injected metrics data carefully (System Resources, Docker, Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs, - CI Pipelines, Open PRs, Issue Status, Stale Worktrees). + CI Pipelines, Open PRs, Issue Status, Stale Worktrees, **Woodpecker Agent Health**). Note: preflight.sh auto-removes PHASE:escalate files for closed issues (24h grace period). Check the "Stale Phase Cleanup" section for any files cleaned or in grace period this run. @@ -75,6 +75,10 @@ Categorize every finding from the metrics into priority levels. - Dev/action sessions in PHASE:escalate for > 24h (session timeout) (Note: PHASE:escalate files for closed issues are auto-cleaned by preflight; this check covers sessions where the issue is still open) +- **Woodpecker agent unhealthy** — see "Woodpecker Agent Health" section in preflight: + - Container not running or in unhealthy state + - gRPC errors >= 3 in last 20 minutes + - Fast-failure pipelines (duration < 60s) >= 3 in last 15 minutes ### P3 — Factory degraded - PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed @@ -100,6 +104,17 @@ For each finding from the health assessment, decide and execute an action. ### Auto-fixable (execute these directly) +**P2 Woodpecker agent unhealthy:** +The supervisor-run.sh script automatically handles WP agent recovery: +- Detects unhealthy state via preflight.sh health checks +- Restarts container via `docker restart` +- Scans for `blocked: ci_exhausted` issues updated in last 30 minutes +- Unassigns and removes blocked label from affected issues +- Posts recovery comment with infra-flake context +- Avoids duplicate restarts via 5-minute cooldown in history file + +**P0 Memory crisis:** + **P0 Memory crisis:** # Kill stale one-shot claude processes (>3h old) pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true @@ -248,6 +263,11 @@ Format: - (or "No actions needed") + ### WP Agent Recovery (if applicable) + - WP agent restart: