`sudo -n "VAULT_ADDR=$vault_addr" -- "$seed_script"` passed
VAULT_ADDR as a sudoers env-assignment argument. With the default
`env_reset=on` policy (almost all distros), sudo silently discards
env assignments unless the variable is in `env_keep` — and
VAULT_ADDR is not. The seeder then hit its own precondition check
at vault-seed-forgejo.sh:109 and died with "VAULT_ADDR unset",
breaking the fresh-LXC non-root acceptance path the PR was written
to close.
Fix: run `env` as the command under sudo — `sudo -n -- env
"VAULT_ADDR=$vault_addr" "$seed_script"` — so VAULT_ADDR is set in
the child process directly, unaffected by sudoers env handling.
The root (non-sudo) branch already used shell-level env assignment
and was correct.
Adds a grep-level regression guard that pins the `env VAR=val`
invocation and negative-asserts the unsafe bare-argument form.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`tools/vault-seed-forgejo.sh` existed and worked, but `bin/disinto init
--backend=nomad --with forgejo` never invoked it, so a fresh LXC with an
empty Vault hit `Template Missing: vault.read(kv/data/disinto/shared/
forgejo)` and the forgejo alloc timed out inside deploy.sh's 240s
healthy_deadline — operator had to run the seeder + `nomad alloc
restart` by hand to recover.
In `_disinto_init_nomad`, after `vault-import.sh` (or its skip branch)
and before `deploy.sh`, iterate `--with <svc>` and auto-invoke
`tools/vault-seed-<svc>.sh` when the file exists + is executable.
Services without a seeder are silently skipped — Step 3+ services
(woodpecker, chat, etc.) can ship their own seeder without touching
`bin/disinto`. VAULT_ADDR is passed explicitly because cluster-up.sh
writes the profile.d export during this same init run (current shell
hasn't sourced it yet) and `vault-seed-forgejo.sh` — unlike its
sibling vault-* scripts — requires the caller to set VAULT_ADDR
instead of defaulting it via `_hvault_default_env`. Mirror the loop in
the --dry-run plan so the operator-visible plan matches the real run.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The S2 Nomad+Vault migration switched the KV v2 mount from `secret/` to
`kv/` in policies, roles, templates, and lib/hvault.sh. tools/vault-import.sh
was missed — its curl URL and 4 error messages still hardcoded `secret/data/`,
so `disinto init --backend=nomad --with forgejo` hit 404 from vault on the
first write (issue body reproduces it with the gardener bot path).
Five call sites in _kv_put_secret flipped to `kv/data/`: the POST URL (L154)
and the curl-error / 404 / 403 / non-2xx branches (L156, L167, L171, L175).
The read helper is hvault_kv_get from lib/hvault.sh, which already resolves
through VAULT_KV_MOUNT (default `kv`), so no change needed there.
tests/vault-import.bats also updated: dev-mode vault only auto-mounts kv-v2
at secret/, so the test harness now enables a parallel kv-v2 mount at path=kv
during setup_file to mirror the production cluster layout. Test-side URLs
that assert round-trip reads all follow the same secret/ → kv/ rename.
shellcheck clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the `|`-delimited string accumulators with bash associative and
indexed arrays so any byte may appear in a secret value.
Two sites used `|` as a delimiter over data that includes user secrets:
1. ops_data["path:key"]="value|status" — extraction via `${data%%|*}`
truncated values at the first `|` (silently corrupting writes).
2. paths_to_write["path"]="k1=v1|k2=v2|..." — split back via
`IFS='|' read -ra` at write time, so a value containing `|` was
shattered across kv pairs (silently misrouting writes).
Fix:
- Split ops_data into two assoc arrays (`ops_value`, `ops_status`) keyed
on "vault_path:vault_key" — value and status are stored independently
with no in-band delimiter. (`:` is safe because both vault_path and
vault_key are identifier-safe.)
- Track distinct paths in `path_seen` and, for each path, collect its
kv pairs into a fresh indexed `pairs_array` by filtering ops_value.
`_kv_put_secret` already splits each entry on the first `=` only, so
`=` and `|` inside values are both preserved.
Added a bats regression that imports values like `abc|xyz`, `p1|p2|p3`,
and `admin|with|pipes` and asserts they round-trip through Vault
unmodified. Values are single-quoted in the .env so they survive
`source` — the accumulator is what this test exercises.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changes:
- Add VAULT_KV_MOUNT env var (default: kv) to make KV mount configurable
- Update hvault_kv_get to use ${VAULT_KV_MOUNT}/data/${path}
- Update hvault_kv_put to use ${VAULT_KV_MOUNT}/data/${path}
- Update hvault_kv_list to use ${VAULT_KV_MOUNT}/metadata/${path}
- Update tests to use kv/ paths instead of secret/
This ensures agents can read/write secrets using the same mount point
that the Nomad+Vault migration policies grant ACL for.
Addresses review #907 blocker: docs/nomad-migration.md claimed
--empty "skips policies/auth/import/deploy" but _disinto_init_nomad
had no $empty gate around those blocks — operators reaching the
"cluster-only escape hatch" would still invoke vault-apply-policies.sh
and vault-nomad-auth.sh, contradicting the runbook.
Changes:
- _disinto_init_nomad: exit 0 immediately after cluster-up when
--empty is set, in both dry-run and real-run branches. Only the
cluster-up plan appears; no policies, no auth, no import, no
deploy. Matches the docs.
- disinto_init: reject --empty combined with any --import-* flag.
--empty discards the import step, so the combination silently
does nothing (worse failure mode than a clear error up front).
Symmetric to the existing --empty vs --with check.
- Pre-flight existence check for policies/auth scripts now runs
unconditionally on the non-empty path (previously gated on
--import-*), matching the unconditional invocation. Import-script
check stays gated on --import-*.
Non-blocking observation also addressed: the pre-flight guard
comment + actual predicate were inconsistent ("unconditionally
invoke policies+auth" but only checked on import). Now the
predicate matches: [ "$empty" != "true" ] gates policies/auth,
and an inner --import-* guard gates the import script.
Tests (+3):
- --empty --dry-run shows no S2.x sections (negative assertions)
- --empty --import-env rejected
- --empty --import-sops --age-key rejected
30/30 nomad tests pass; shellcheck clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire the Step-2 building blocks (import, auth, policies) into
`disinto init --backend=nomad` so a single command on a fresh LXC
provisions cluster + policies + auth + imports secrets + deploys
services.
Adds three flags to `disinto init --backend=nomad`:
--import-env PATH plaintext .env from old stack
--import-sops PATH sops-encrypted .env.vault.enc (requires --age-key)
--age-key PATH age keyfile to decrypt --import-sops
Flow: cluster-up.sh → vault-apply-policies.sh → vault-nomad-auth.sh →
(optional) vault-import.sh → deploy.sh. Policies + auth run on every
nomad real-run path (idempotent); import runs only when --import-* is
set; all layers safe to re-run.
Flag validation:
--import-sops without --age-key → error
--age-key without --import-sops → error
--import-env alone (no sops) → OK
--backend=docker + any --import-* → error
Dry-run prints a five-section plan (cluster-up + policies + auth +
import + deploy) with every argv that would be executed; touches
nothing, logs no secret values.
Dry-run output prints one line per --import-* flag that is actually
set — not in an if/elif chain — so all three paths appear when all
three flags are passed. Prior attempts regressed this invariant.
Tests:
tests/disinto-init-nomad.bats +10 cases covering flag validation,
dry-run plan shape (each flag prints its own path), policies+auth
always-on (without --import-*), and --flag=value form.
Docs: docs/nomad-migration.md new file — cutover-day runbook with
invocation shape, flag summary, idempotency contract, dry-run, and
secret-hygiene notes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `pull_policy: build` to every agent service emitted by the generator
that shares `docker/agents/Dockerfile` as its build context. Without it,
`docker compose up -d --force-recreate agents-<name>` reuses the cached
`disinto/agents:latest` image and silently keeps running stale
`docker/agents/entrypoint.sh` code even after the repo is updated. This
masked PR #864 (and likely earlier merges) — the fix landed on disk but
never reached the container.
#853 already paired `build:` with `image:` on hired-agent stanzas, which
was enough for first-time ups but not for re-ups. `pull_policy: build`
tells Compose to rebuild the image on every up; BuildKit's layer cache
makes the no-change case near-instant, and the change case picks up the
new source automatically. This covers:
- TOML-driven `agents-<name>` hired via `disinto hire-an-agent` — primary
target of the issue.
- Legacy `agents-llama` and `agents-llama-all` stanzas — same Dockerfile,
same staleness problem.
`bin/disinto up` already passed `--build`, so operators on the supported
UX path were already covered; this closes the gap for the direct
`docker compose` path the issue explicitly names in its acceptance.
Regression test added to `tests/lib-generators.bats` to pin the directive
alongside the existing #853 build/image invariants.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nomad's docker task driver reports Healthy=false without a running
dockerd. On the factory dev box docker was pre-installed so Step 0's
cluster-up passed silently, but a fresh ubuntu:24.04 LXC hit "missing
drivers" placement failures the moment Step 1 tried to deploy forgejo
(the first docker-driver consumer).
Fix install.sh to also install docker.io + enable --now docker.service
when absent, and add a poll for the nomad self-node's docker driver
Detected+Healthy before declaring Step 8 done — otherwise the race
between dockerd startup and nomad driver fingerprinting lets the node
reach "ready" while docker is still unhealthy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TOML-driven hired-agent services (`_generate_local_model_services` in
`lib/generators.sh`) were emitting `image: ghcr.io/disinto/agents:<tag>`
for every hired agent. The ghcr image is not publicly pullable and
deployments don't carry ghcr credentials, so `docker compose up` failed
with `denied` on every new hire. The legacy `agents-llama` stanza dodged
this because it uses the registry-less local name plus a `build:` fallback.
Fix: match the legacy stanza — emit `build: { context: ., dockerfile:
docker/agents/Dockerfile }` paired with `image: disinto/agents:<tag>`.
Hosts that built locally with `disinto init --build` will find the image;
hosts without one will build it. No ghcr auth required either way.
Added a regression test that guards both the absence of the ghcr prefix
and the presence of the build directive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Acceptance items 1-4 landed previously: the primary compose emission
(FORGE_BOT_USER_*) was fixed in #849 by re-keying on forge_user via
`tr 'a-z-' 'A-Z_'`, and the load-project.sh AGENT_* Python emitter was
normalized via `.upper().replace('-', '_')` in #862. Together they
produce `FORGE_BOT_USER_DEV_QWEN2` and `AGENT_DEV_QWEN2_BASE_URL` for
`[agents.dev-qwen2]` with `forge_user = "dev-qwen2"`.
This patch closes acceptance item 5 — the defence-in-depth warn-and-skip
in load-project.sh's two export loops. Hire-agent's up-front reject is
the primary line of defence (a validated `^[a-z]([a-z0-9]|-[a-z0-9])*$`
agent name can't produce a bad identifier), but a hand-edited TOML can
still smuggle invalid keys through:
- `[mirrors] my-mirror = "…"` — the `MIRROR_<NAME>` emitter only
upper-cases, so `MY-MIRROR` retains its dash and fails `export`.
- `[agents."weird name"]` — quoted TOML keys bypass the bare-key
grammar entirely, so spaces and other disallowed shell chars reach
the export loop unchanged.
Before this change, either case would abort load-project.sh under
`set -euo pipefail` — the exact failure mode the original #852
crash-loop was diagnosed from. Now each loop validates `$_key` against
`^[A-Za-z_][A-Za-z0-9_]*$` and warn-skips offenders so siblings still
load.
- `lib/load-project.sh` — regex guard + WARNING on stderr in both
`_PROJECT_VARS` and `_AGENT_VARS` export loops.
- `tests/lib-load-project.bats` — two regressions: dashed mirror key,
quoted agent section with space. Both assert (a) the load does not
abort and (b) sane siblings still load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key `FORGE_BOT_USER_*` on `$user_upper` (forge_user normalized with
`tr 'a-z-' 'A-Z_'`) instead of `${service_name^^}`, matching the
`FORGE_TOKEN_<FORGE_USER>` / `FORGE_PASS_<FORGE_USER>` convention two
lines above in the same emitted block. For `[agents.llama]` with
`forge_user = "dev-qwen"` this emits `FORGE_BOT_USER_DEV_QWEN: "dev-qwen"`
instead of the legacy `FORGE_BOT_USER_LLAMA`.
No external consumers read `FORGE_BOT_USER_*` today (verified via grep),
so no fallback/deprecation shim is needed — this is purely a one-site
fix at the sole producer.
Adds `tests/lib-generators.bats` as a regression guard. Follows the
existing `tests/lib-*.bats` pattern (developer-run, not CI-wired).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TOML allows dashes in bare keys, so `[agents.dev-qwen2]` is a valid
section. Before this fix, load-project.sh derived bash var names via
Python `.upper()` alone, which kept the dash and produced
`AGENT_DEV-QWEN2_BASE_URL` — an invalid shell identifier. Under
`set -euo pipefail` the subsequent `export` aborted the whole file,
silently taking the factory down on the N+1 run after a dashed agent
was hired via `disinto hire-an-agent`.
Normalize via `.upper().replace('-', '_')` to match the
`tr 'a-z-' 'A-Z_'` convention already used by hire-agent.sh (#834)
and generators.sh (#852). Also harden hire-agent.sh to reject invalid
agent names at hire time (before any Forgejo side effects), so
unparseable TOML sections never land on disk.
- `lib/load-project.sh` — dash-to-underscore in emitted shell var names
- `lib/hire-agent.sh` — validate agent name against
`^[a-z]([a-z0-9]|-[a-z0-9])*$` up front
- `tests/lib-load-project.bats` — regression guard covering the parse
path and the hire-time reject path
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hire-an-agent now adds the new Forgejo user as a `write` collaborator on
`$FORGE_REPO` right after the token step, mirroring the collaborator setup
lib/forge-setup.sh applies to the canonical bot users. Without this, a
freshly hired agent's PATCH to assign itself an issue returned 403 Forbidden
and the dev-agent polled forever logging "claim lost to <none>".
issue_claim() now captures the PATCH HTTP status via `-w '%{http_code}'`
instead of swallowing failures with `curl -sf ... || return 1`. A 403 (or
any non-2xx) now surfaces a distinct log line naming the code — the missing
collaborator root cause would have been diagnosable in seconds instead of
minutes.
Also updates the lib-issue-claim bats mock to handle the new `-w` flag and
adds a regression test covering the HTTP-error log surfacing path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why: disinto_init() consumed $1 as repo_url before the argparse loop ran,
so `disinto init --backend=nomad --empty` had --backend=nomad swallowed
into repo_url, backend stayed at its "docker" default, and the --empty
validation then produced the nonsense "--empty is only valid with
--backend=nomad" error — flagged during S0.1 end-to-end verification on
a fresh LXC. nomad backend takes no positional anyway; the LXC already
has the repo cloned by the operator.
Change: only consume $1 as repo_url if it doesn't start with "--", then
defer the "repo URL required" check to after argparse (so the docker
path still errors with a helpful message on a missing positional, not
"Unknown option: --backend=docker").
Verified acceptance criteria:
1. init --backend=nomad --empty → dispatches to nomad
2. init --backend=nomad --empty --dry-run → 9-step plan, exit 0
3. init <repo-url> → docker path unchanged
4. init → "repo URL required"
5. init --backend=docker → "repo URL required"
(not "Unknown option")
6. shellcheck clean
Tests: 4 new regression cases in tests/disinto-init-nomad.bats covering
flag-first nomad invocation (both --flag=value and --flag value forms),
no-args docker default, and --backend=docker missing-positional error
path. Full suite: 10/10 pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Forgejo's assignees PATCH is last-write-wins, so two dev agents polling
concurrently could both observe .assignee == null at the pre-check, both
PATCH, and the loser would silently "succeed" and proceed to implement
the same issue — colliding at the PR/branch stage.
Re-read the assignee after the PATCH and bail out if it isn't self.
Label writes are moved AFTER this verification so a losing claim leaves
no stray in-progress label to roll back.
Adds tests/lib-issue-claim.bats covering the three paths:
- happy path (single agent, re-read confirms self)
- lost race (re-read shows another agent — returns 1, no labels added)
- pre-check skip (initial GET already shows another agent)
Prerequisite for the LLAMA_BOTS parametric refactor that will run N
dev containers against the same project.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bin/disinto flag loop has separate cases for `--backend value`
(space-separated) and `--backend=value`; a regression in either would
silently route to the docker default path. Per the "stub-first dispatch"
lesson, silent misrouting during a migration is the worst failure mode —
covering both forms closes that gap.
Also triggers a retry of the smoke-init pipeline step, which hit a known
Forgejo branch-indexing flake on pipeline #913 (same flake cleared on
retry for PR #829 pipelines #906 → #908); unrelated to the nomad-validate
changes, which went all-green in #913.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Locks in static validation for every Nomad+Vault artifact before it can
merge. Four fail-closed steps in .woodpecker/nomad-validate.yml, gated
to PRs touching nomad/, lib/init/nomad/, or bin/disinto:
1. nomad config validate nomad/server.hcl nomad/client.hcl
2. vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener
3. shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto
4. bats tests/disinto-init-nomad.bats — dispatcher smoke tests
bin/disinto picks up pre-existing SC2120 warnings on three passthrough
wrappers (generate_agent_docker, generate_caddyfile, generate_staging_index);
annotated with shellcheck disable=SC2120 so the new pipeline is clean
without narrowing the warning for future code.
Pinned image versions (hashicorp/nomad:1.9.5, hashicorp/vault:1.18.5)
match lib/init/nomad/install.sh — bump both or neither.
nomad/AGENTS.md documents the stack layout, how to add a jobspec in
Step 1, how CI validates it, and the two-place version pinning rule.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make `disinto init` safe to re-run on the same box:
- Store admin token as FORGE_ADMIN_TOKEN in .env; preserve on re-run
(previously deleted and recreated every run, churning DB state)
- Fix human token creation: use admin_pass for basic-auth since
human_user == admin_user (previously used a random password that
never matched the actual user password, so HUMAN_TOKEN was never
created successfully)
- Preserve HUMAN_TOKEN in .env on re-run (same pattern as bot tokens)
- Bot tokens were already idempotent (preserved unless --rotate-tokens)
Add --dry-run flag that reports every intended action (file writes,
API calls, docker commands) based on current state, then exits 0
without touching state. Useful for CI gating and cutover confidence.
Update smoke test:
- Add dry-run test (verifies exit 0 and no .env modification)
- Add idempotency state diff (verifies .env is unchanged on re-run)
- Verify FORGE_ADMIN_TOKEN and HUMAN_TOKEN are stored in .env
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix off-by-one in mock admin/users/{username}/repos path extraction
(parts[4] was 'users', not the username — should be parts[5])
- Change _install_cron_impl to return 1 instead of exit 1 when crontab
is missing, so cron failure doesn't abort entire init
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup trap to smoke-init.sh that kills all mock-forgejo.py processes
on exit (success or failure). Also ensure cleanup at test start removes
any leftover processes from prior runs.
In .woodpecker/smoke-init.yml:
- Store the PID of the mock-forgejo.py background process
- Kill the process after smoke test completes
This prevents accumulation of orphaned Python processes that caused
OOM issues (2881 processes consuming 7.45GB RAM).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix collaborator PUT: use parts[7] instead of parts[6]
- Fix user/repos: store username in token object and use it for lookup
- Fix /mock/shutdown: strip leading slash unconditionally
- Fix SIGTERM: call server.shutdown() in a thread
- Use socket module constants for setsockopt
- Remove duplicate pattern
The mock docker in smoke-init.sh only handled 'admin user create' and
'admin user list'. Add a 'change-password' handler that PATCHes the
user via the Forgejo admin API to clear must_change_password.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mock crontab file was not being created despite PATH precedence
working correctly. Replace the mock with the real BusyBox crontab
already available in the Forgejo Alpine image. Verify cron entries
via 'crontab -l' output instead of checking a mock state file.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>