`sudo -n "VAULT_ADDR=$vault_addr" -- "$seed_script"` passed
VAULT_ADDR as a sudoers env-assignment argument. With the default
`env_reset=on` policy (almost all distros), sudo silently discards
env assignments unless the variable is in `env_keep` — and
VAULT_ADDR is not. The seeder then hit its own precondition check
at vault-seed-forgejo.sh:109 and died with "VAULT_ADDR unset",
breaking the fresh-LXC non-root acceptance path the PR was written
to close.
Fix: run `env` as the command under sudo — `sudo -n -- env
"VAULT_ADDR=$vault_addr" "$seed_script"` — so VAULT_ADDR is set in
the child process directly, unaffected by sudoers env handling.
The root (non-sudo) branch already used shell-level env assignment
and was correct.
Adds a grep-level regression guard that pins the `env VAR=val`
invocation and negative-asserts the unsafe bare-argument form.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`tools/vault-seed-forgejo.sh` existed and worked, but `bin/disinto init
--backend=nomad --with forgejo` never invoked it, so a fresh LXC with an
empty Vault hit `Template Missing: vault.read(kv/data/disinto/shared/
forgejo)` and the forgejo alloc timed out inside deploy.sh's 240s
healthy_deadline — operator had to run the seeder + `nomad alloc
restart` by hand to recover.
In `_disinto_init_nomad`, after `vault-import.sh` (or its skip branch)
and before `deploy.sh`, iterate `--with <svc>` and auto-invoke
`tools/vault-seed-<svc>.sh` when the file exists + is executable.
Services without a seeder are silently skipped — Step 3+ services
(woodpecker, chat, etc.) can ship their own seeder without touching
`bin/disinto`. VAULT_ADDR is passed explicitly because cluster-up.sh
writes the profile.d export during this same init run (current shell
hasn't sourced it yet) and `vault-seed-forgejo.sh` — unlike its
sibling vault-* scripts — requires the caller to set VAULT_ADDR
instead of defaulting it via `_hvault_default_env`. Mirror the loop in
the --dry-run plan so the operator-visible plan matches the real run.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The S2 Nomad+Vault migration switched the KV v2 mount from `secret/` to
`kv/` in policies, roles, templates, and lib/hvault.sh. tools/vault-import.sh
was missed — its curl URL and 4 error messages still hardcoded `secret/data/`,
so `disinto init --backend=nomad --with forgejo` hit 404 from vault on the
first write (issue body reproduces it with the gardener bot path).
Five call sites in _kv_put_secret flipped to `kv/data/`: the POST URL (L154)
and the curl-error / 404 / 403 / non-2xx branches (L156, L167, L171, L175).
The read helper is hvault_kv_get from lib/hvault.sh, which already resolves
through VAULT_KV_MOUNT (default `kv`), so no change needed there.
tests/vault-import.bats also updated: dev-mode vault only auto-mounts kv-v2
at secret/, so the test harness now enables a parallel kv-v2 mount at path=kv
during setup_file to mirror the production cluster layout. Test-side URLs
that assert round-trip reads all follow the same secret/ → kv/ rename.
shellcheck clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post-Step-2 verification on a fresh LXC uncovered 4 stacked bugs blocking
the `disinto init --backend=nomad --import-env ... --with forgejo` hero
command. Root cause is #1; #2-#4 surface as the operator walks past each.
1. kv/ secret engine never enabled — every policy, role, import write,
and template read references kv/disinto/* and 403s without the mount.
Adds lib/init/nomad/vault-engines.sh (idempotent POST sys/mounts/kv)
wired into `_disinto_init_nomad` before vault-apply-policies.sh.
2. VAULT_ADDR/VAULT_TOKEN not exported in the init process. Extracts the
5-line default-and-resolve block into `_hvault_default_env` in
lib/hvault.sh and sources it from vault-engines.sh, vault-nomad-auth.sh,
vault-apply-policies.sh, vault-apply-roles.sh, and vault-import.sh. One
definition, zero copies — avoids the 5-line sliding-window duplicate
gate that failed PRs #917/#918.
3. vault-import.sh required --sops; spec (#880) says --env alone must
succeed. Flag validation now: --sops requires --age-key, --age-key
requires --sops, --env alone imports only the plaintext half.
4. forgejo.hcl template blocks forever when kv/disinto/shared/forgejo is
absent or missing a key. Adds `error_on_missing_key = false` so the
existing `with ... else ...` fallback emits placeholders instead of
hanging on template-pending.
vault-engines.sh parser uses a while/shift shape distinct from
vault-apply-policies.sh (flat case) and vault-apply-roles.sh (if/elif
ladder) so the three sibling flag parsers hash differently under the
repo-wide duplicate detector.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the `|`-delimited string accumulators with bash associative and
indexed arrays so any byte may appear in a secret value.
Two sites used `|` as a delimiter over data that includes user secrets:
1. ops_data["path:key"]="value|status" — extraction via `${data%%|*}`
truncated values at the first `|` (silently corrupting writes).
2. paths_to_write["path"]="k1=v1|k2=v2|..." — split back via
`IFS='|' read -ra` at write time, so a value containing `|` was
shattered across kv pairs (silently misrouting writes).
Fix:
- Split ops_data into two assoc arrays (`ops_value`, `ops_status`) keyed
on "vault_path:vault_key" — value and status are stored independently
with no in-band delimiter. (`:` is safe because both vault_path and
vault_key are identifier-safe.)
- Track distinct paths in `path_seen` and, for each path, collect its
kv pairs into a fresh indexed `pairs_array` by filtering ops_value.
`_kv_put_secret` already splits each entry on the first `=` only, so
`=` and `|` inside values are both preserved.
Added a bats regression that imports values like `abc|xyz`, `p1|p2|p3`,
and `admin|with|pipes` and asserts they round-trip through Vault
unmodified. Values are single-quoted in the .env so they survive
`source` — the accumulator is what this test exercises.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changes:
- Add VAULT_KV_MOUNT env var (default: kv) to make KV mount configurable
- Update hvault_kv_get to use ${VAULT_KV_MOUNT}/data/${path}
- Update hvault_kv_put to use ${VAULT_KV_MOUNT}/data/${path}
- Update hvault_kv_list to use ${VAULT_KV_MOUNT}/metadata/${path}
- Update tests to use kv/ paths instead of secret/
This ensures agents can read/write secrets using the same mount point
that the Nomad+Vault migration policies grant ACL for.
Addresses review #907 blocker: docs/nomad-migration.md claimed
--empty "skips policies/auth/import/deploy" but _disinto_init_nomad
had no $empty gate around those blocks — operators reaching the
"cluster-only escape hatch" would still invoke vault-apply-policies.sh
and vault-nomad-auth.sh, contradicting the runbook.
Changes:
- _disinto_init_nomad: exit 0 immediately after cluster-up when
--empty is set, in both dry-run and real-run branches. Only the
cluster-up plan appears; no policies, no auth, no import, no
deploy. Matches the docs.
- disinto_init: reject --empty combined with any --import-* flag.
--empty discards the import step, so the combination silently
does nothing (worse failure mode than a clear error up front).
Symmetric to the existing --empty vs --with check.
- Pre-flight existence check for policies/auth scripts now runs
unconditionally on the non-empty path (previously gated on
--import-*), matching the unconditional invocation. Import-script
check stays gated on --import-*.
Non-blocking observation also addressed: the pre-flight guard
comment + actual predicate were inconsistent ("unconditionally
invoke policies+auth" but only checked on import). Now the
predicate matches: [ "$empty" != "true" ] gates policies/auth,
and an inner --import-* guard gates the import script.
Tests (+3):
- --empty --dry-run shows no S2.x sections (negative assertions)
- --empty --import-env rejected
- --empty --import-sops --age-key rejected
30/30 nomad tests pass; shellcheck clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire the Step-2 building blocks (import, auth, policies) into
`disinto init --backend=nomad` so a single command on a fresh LXC
provisions cluster + policies + auth + imports secrets + deploys
services.
Adds three flags to `disinto init --backend=nomad`:
--import-env PATH plaintext .env from old stack
--import-sops PATH sops-encrypted .env.vault.enc (requires --age-key)
--age-key PATH age keyfile to decrypt --import-sops
Flow: cluster-up.sh → vault-apply-policies.sh → vault-nomad-auth.sh →
(optional) vault-import.sh → deploy.sh. Policies + auth run on every
nomad real-run path (idempotent); import runs only when --import-* is
set; all layers safe to re-run.
Flag validation:
--import-sops without --age-key → error
--age-key without --import-sops → error
--import-env alone (no sops) → OK
--backend=docker + any --import-* → error
Dry-run prints a five-section plan (cluster-up + policies + auth +
import + deploy) with every argv that would be executed; touches
nothing, logs no secret values.
Dry-run output prints one line per --import-* flag that is actually
set — not in an if/elif chain — so all three paths appear when all
three flags are passed. Prior attempts regressed this invariant.
Tests:
tests/disinto-init-nomad.bats +10 cases covering flag validation,
dry-run plan shape (each flag prints its own path), policies+auth
always-on (without --import-*), and --flag=value form.
Docs: docs/nomad-migration.md new file — cutover-day runbook with
invocation shape, flag summary, idempotency contract, dry-run, and
secret-hygiene notes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- AGENTS.md: Replace agents-llama and agents-llama-all rows with generic
'Local-model agents' entry pointing to docs/agents-llama.md
- formulas/release.sh: Remove agents-llama from docker compose stop/up
commands (line 181-182)
- formulas/release.toml: Remove agents-llama references from restart-agents
step description (lines 192, 195, 206)
These changes complete the removal of the legacy ENABLE_LLAMA_AGENT activation
path. The release formula now only references the 'agents' service, which is
the only service that exists after disinto init regenerates docker-compose.yml
based on [agents.X] TOML sections.
`compgen -G ... | wc -l` under `set -eo pipefail` aborts the script on
the non-zero pipeline exit (compgen returns 1 on no match) before the
FATAL diagnostic branch can run. The container still fast-fails, but
operators saw no explanation.
Switch to the conditional `if ! compgen -G ... >/dev/null 2>&1; then`
pattern already used at the two other compgen call sites in this file
(bootstrap_factory_repo and the PROJECT_NAME parser). The count for the
success-path log is computed after we've confirmed at least one match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend .woodpecker/nomad-validate.yml with three new fail-closed steps
that guard every artifact under vault/policies/ and vault/roles.yaml
before it can land:
4. vault-policy-fmt — cp+fmt+diff idempotence check (vault 1.18.5
has no `policy fmt -check` flag, so we
build the non-destructive check out of
`vault policy fmt` on a /tmp copy + diff
against the original)
5. vault-policy-validate — HCL syntax + capability validation via
`vault policy write` against an inline
dev-mode Vault server (no offline
`policy validate` subcommand exists;
dev-mode writes are ephemeral so this is
a validator, not a deploy)
6. vault-roles-validate — yamllint + PyYAML-based role→policy
reference check (every role's `policy:`
field must match a vault/policies/*.hcl
basename; also checks the four required
fields name/policy/namespace/job_id)
Secret-scan coverage for vault/policies/*.hcl is already provided by
the P11 gate (.woodpecker/secret-scan.yml) via its `vault/**/*` trigger
path — this pipeline intentionally does NOT duplicate that gate to
avoid the inline-heredoc / YAML-parse failure mode that sank the prior
attempt at this issue (PR #896).
Trigger paths extended: `vault/policies/**` and `vault/roles.yaml`.
`lib/init/nomad/vault-*.sh` is already covered by the existing
`lib/init/nomad/**` glob.
Docs: nomad/AGENTS.md and vault/policies/AGENTS.md updated with the
policy lifecycle, the CI enforcement table, and the common failure
modes authors will see.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lib/secret-scan.sh `(SECRET|TOKEN|...)=<16+ non-space chars>`
rule flagged the long `INTERNAL_TOKEN=VAULT-EMPTY-run-tools-vault-
seed-forgejo-sh` placeholder as a plaintext secret, failing CI's
secret-scan workflow on every PR that touched nomad/jobs/forgejo.hcl.
Shorten both placeholders to `seed-me` (<16 chars) — still visible in
a `grep FORGEJO__security__` audit, still obviously broken. The
operator-facing fix pointer moves to the `# WARNING` comment line in
the rendered env and to a new block comment above the template stanza.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrade nomad/jobs/forgejo.hcl to read SECRET_KEY + INTERNAL_TOKEN from
Vault via a template stanza using the service-forgejo role (S2.3).
Non-secret config (DB, ports, ROOT_URL, registration lockdown) stays
inline. An empty-Vault fallback (`with ... else ...`) renders visible
placeholder env vars so a fresh LXC still brings forgejo up — the
operator sees the warning instead of forgejo silently regenerating
SECRET_KEY on every restart.
Add tools/vault-seed-forgejo.sh — idempotent seeder that ensures the
kv/ mount is KV v2 and populates kv/data/disinto/shared/forgejo with
random secret_key (32B hex) + internal_token (64B hex) on a clean
install. Existing non-empty values are left untouched; partial paths
are filled in atomically. Parser shape is positional-arity case
dispatch to stay structurally distinct from the two sibling vault-*.sh
tools and avoid the 5-line sliding-window dup detector.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review feedback from PR #895 round 1:
- lib/AGENTS.md (hvault.sh row): add hvault_get_or_empty(PATH) to the
public-function list; replace the "not sourced at runtime yet" note
with the three actual callers (vault-apply-policies.sh,
vault-apply-roles.sh, vault-nomad-auth.sh).
- lib/AGENTS.md (lib/init/nomad/ row): add a one-line description of
vault-nomad-auth.sh (Step 2, this PR); relabel the row header from
"Step 0 installer scripts" to "installer scripts" since it now spans
Step 0 + Step 2.
- lib/init/nomad/vault-nomad-auth.sh: drop the `vault` CLI from the
binary precondition check — hvault.sh's helpers are all curl-based,
so the CLI is never invoked. The precondition would spuriously die on
a Nomad-client-only node that has Vault server reachable but no
`vault` binary installed. Inline comment preserves the rationale.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wires Nomad → Vault via workload identity so jobs can exchange their
short-lived JWT for a Vault token carrying the policies in
vault/policies/ — no shared VAULT_TOKEN in job env.
- `lib/init/nomad/vault-nomad-auth.sh` — idempotent script: enable jwt
auth at path `jwt-nomad`, config JWKS/algs, apply roles, install
server.hcl + SIGHUP nomad on change.
- `tools/vault-apply-roles.sh` — companion sync script (S2.1 sibling);
reads vault/roles.yaml and upserts each Vault role under
auth/jwt-nomad/role/<name> with created/updated/unchanged semantics.
- `vault/roles.yaml` — declarative role→policy→bound_claims map; one
entry per vault/policies/*.hcl. Keeps S2.1 policies and S2.3 role
bindings visible side-by-side at review time.
- `nomad/server.hcl` — adds vault stanza (enabled, address,
default_identity.aud=["vault.io"], ttl=1h).
- `lib/hvault.sh` — new `hvault_get_or_empty` helper shared between
vault-apply-policies.sh, vault-apply-roles.sh, and vault-nomad-auth.sh;
reads a Vault endpoint and distinguishes 200 / 404 / other.
- `vault/policies/AGENTS.md` — extends S2.1 docs with JWT-auth role
naming convention, token shape, and the "add new service" flow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `pull_policy: build` to every agent service emitted by the generator
that shares `docker/agents/Dockerfile` as its build context. Without it,
`docker compose up -d --force-recreate agents-<name>` reuses the cached
`disinto/agents:latest` image and silently keeps running stale
`docker/agents/entrypoint.sh` code even after the repo is updated. This
masked PR #864 (and likely earlier merges) — the fix landed on disk but
never reached the container.
#853 already paired `build:` with `image:` on hired-agent stanzas, which
was enough for first-time ups but not for re-ups. `pull_policy: build`
tells Compose to rebuild the image on every up; BuildKit's layer cache
makes the no-change case near-instant, and the change case picks up the
new source automatically. This covers:
- TOML-driven `agents-<name>` hired via `disinto hire-an-agent` — primary
target of the issue.
- Legacy `agents-llama` and `agents-llama-all` stanzas — same Dockerfile,
same staleness problem.
`bin/disinto up` already passed `--build`, so operators on the supported
UX path were already covered; this closes the gap for the direct
`docker compose` path the issue explicitly names in its acceptance.
Regression test added to `tests/lib-generators.bats` to pin the directive
alongside the existing #853 build/image invariants.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI's duplicate-detection step (sliding 5-line window) flagged 4 new
duplicate blocks shared with lib/init/nomad/cluster-up.sh — both used
the same `dry_run=false; while [ $# -gt 0 ]; do case "$1" in --dry-run)
... -h|--help) ... *) die "unknown flag: $1" ;; esac done` shape.
vault-apply-policies.sh has exactly one optional flag, so a flat
single-arg case with an `'')` no-op branch is shorter and structurally
distinct from the multi-flag while-loop parsers elsewhere in the repo.
The --help text now uses printf instead of a heredoc, which avoids the
EOF/exit/;;/die anchor that was the other half of the duplicate window.
DIFF_BASE=main .woodpecker/detect-duplicates.py now reports 0 new
duplicate blocks. Behavior unchanged: --dry-run, --help, --bogus, and
no-arg invocations all verified locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>