Why: disinto_init() consumed $1 as repo_url before the argparse loop ran,
so `disinto init --backend=nomad --empty` had --backend=nomad swallowed
into repo_url, backend stayed at its "docker" default, and the --empty
validation then produced the nonsense "--empty is only valid with
--backend=nomad" error — flagged during S0.1 end-to-end verification on
a fresh LXC. nomad backend takes no positional anyway; the LXC already
has the repo cloned by the operator.
Change: only consume $1 as repo_url if it doesn't start with "--", then
defer the "repo URL required" check to after argparse (so the docker
path still errors with a helpful message on a missing positional, not
"Unknown option: --backend=docker").
Verified acceptance criteria:
1. init --backend=nomad --empty → dispatches to nomad
2. init --backend=nomad --empty --dry-run → 9-step plan, exit 0
3. init <repo-url> → docker path unchanged
4. init → "repo URL required"
5. init --backend=docker → "repo URL required"
(not "Unknown option")
6. shellcheck clean
Tests: 4 new regression cases in tests/disinto-init-nomad.bats covering
flag-first nomad invocation (both --flag=value and --flag value forms),
no-args docker default, and --backend=docker missing-positional error
path. Full suite: 10/10 pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hiring a second llama-backed dev agent (e.g. `dev-qwen2`) alongside
`dev-qwen` tripped four defects that prevented safe parallel operation.
Gap 1 — hire-agent keyed per-agent token as FORGE_<ROLE>_TOKEN, so two
dev-role agents overwrote each other's token in .env. Re-key by agent
name via `tr 'a-z-' 'A-Z_'`: FORGE_TOKEN_<AGENT_UPPER>.
Gap 2 — hire-agent generated a random FORGE_PASS but never wrote it to
.env. The container's git credential helper needs both token and pass
to push over HTTPS (#361). Persist FORGE_PASS_<AGENT_UPPER> with the
same update-in-place idempotency as the token.
Gap 3 — _generate_local_model_services hardcoded FORGE_TOKEN_LLAMA for
every local-model service, forcing all hired llama agents to share one
Forgejo identity. Derive USER_UPPER from the TOML's `forge_user` field
and emit \${FORGE_TOKEN_<USER_UPPER>:-} per service.
Gap 4 — every local-model service mounted the shared `project-repos`
volume, so concurrent llama devs collided on /_factory worktree and
state/.dev-active. Switch to per-agent `project-repos-<service_name>`
and emit the matching top-level volume. Also escape embedded newlines
in `$all_vols` before the sed insertion so multi-agent volume lists
don't unterminate the substitute command.
.env.example documents the new FORGE_TOKEN_<AGENT> / FORGE_PASS_<AGENT>
naming convention (and preserves the legacy FORGE_TOKEN_LLAMA path used
by the ENABLE_LLAMA_AGENT=1 singleton build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Forgejo's assignees PATCH is last-write-wins, so two dev agents polling
concurrently could both observe .assignee == null at the pre-check, both
PATCH, and the loser would silently "succeed" and proceed to implement
the same issue — colliding at the PR/branch stage.
Re-read the assignee after the PATCH and bail out if it isn't self.
Label writes are moved AFTER this verification so a losing claim leaves
no stray in-progress label to roll back.
Adds tests/lib-issue-claim.bats covering the three paths:
- happy path (single agent, re-read confirms self)
- lost race (re-read shows another agent — returns 1, no labels added)
- pre-check skip (initial GET already shows another agent)
Prerequisite for the LLAMA_BOTS parametric refactor that will run N
dev containers against the same project.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bin/disinto flag loop has separate cases for `--backend value`
(space-separated) and `--backend=value`; a regression in either would
silently route to the docker default path. Per the "stub-first dispatch"
lesson, silent misrouting during a migration is the worst failure mode —
covering both forms closes that gap.
Also triggers a retry of the smoke-init pipeline step, which hit a known
Forgejo branch-indexing flake on pipeline #913 (same flake cleared on
retry for PR #829 pipelines #906 → #908); unrelated to the nomad-validate
changes, which went all-green in #913.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline #911 on PR #833 failed because `vault operator diagnose -config=
nomad/vault.hcl -skip=storage -skip=listener` returns exit code 2 — not
on a hard failure, but because our factory dev-box vault.hcl deliberately
runs TLS-disabled on a localhost-only listener (documented in the file
header), which triggers an advisory "Check Listener TLS" warning.
The -skip flag disables runtime sub-checks (storage access, listener
bind) but does NOT suppress the advisory checks on the parsed config, so
a valid dev-box config with documented-and-intentional warnings still
exits non-zero under strict CI.
Fix: wrap the command in a case on exit code. Treat rc=0 (all green)
and rc=2 (advisory warnings only — config still parses) as success, and
fail hard on rc=1 (real HCL/schema/storage failure) or any other rc.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Locks in static validation for every Nomad+Vault artifact before it can
merge. Four fail-closed steps in .woodpecker/nomad-validate.yml, gated
to PRs touching nomad/, lib/init/nomad/, or bin/disinto:
1. nomad config validate nomad/server.hcl nomad/client.hcl
2. vault operator diagnose -config=nomad/vault.hcl -skip=storage -skip=listener
3. shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto
4. bats tests/disinto-init-nomad.bats — dispatcher smoke tests
bin/disinto picks up pre-existing SC2120 warnings on three passthrough
wrappers (generate_agent_docker, generate_caddyfile, generate_staging_index);
annotated with shellcheck disable=SC2120 so the new pipeline is clean
without narrowing the warning for future code.
Pinned image versions (hashicorp/nomad:1.9.5, hashicorp/vault:1.18.5)
match lib/init/nomad/install.sh — bump both or neither.
nomad/AGENTS.md documents the stack layout, how to add a jobspec in
Step 1, how CI validates it, and the two-place version pinning rule.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI duplicate-detection flagged the in-line vault + nomad polling loops
in cluster-up.sh as matching a 5-line window in vault-init.sh (the
`ready=1 / break / fi / sleep 1 / done` boilerplate).
Extracts the repeated pattern into three helpers at the top of the
file:
- nomad_has_ready_node wrapper so poll_until_healthy can take a
bare command name.
- _die_with_service_status shared "log + dump systemctl status +
die" path (factored out of the two
callsites + the timeout branch).
- poll_until_healthy ticks once per second up to TIMEOUT,
fail-fasts on systemd "failed" state,
and returns 0 on first successful check.
Step 7 (vault unseal) and Step 8 (nomad ready node) each collapse from
~15 lines of explicit for-loop bookkeeping to a one-line call. No
behavioural change: same tick cadence, same fail-fast, same status
dump on timeout. Local detect-duplicates.py run against main confirms
no new duplicates introduced.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wires S0.1–S0.3 into a single idempotent bring-up script and replaces
the S0.1 stub in _disinto_init_nomad so `disinto init --backend=nomad
--empty` produces a running empty single-node cluster on a fresh box.
lib/init/nomad/cluster-up.sh (new):
1. install.sh (nomad + vault binaries)
2. systemd-nomad.sh (unit + enable, not started)
3. systemd-vault.sh (unit + vault.hcl + enable)
4. host-volume dirs under /srv/disinto/* (matching nomad/client.hcl)
5. /etc/nomad.d/{server,client}.hcl (content-compare before write)
6. vault-init.sh (first-run init + unseal + persist keys)
7. systemctl start vault (poll until unsealed; fail-fast on
is-failed)
8. systemctl start nomad (poll until ≥1 node ready)
9. /etc/profile.d/disinto-nomad.sh (VAULT_ADDR + NOMAD_ADDR for
interactive shells)
Re-running on a healthy box is a no-op — each sub-step is itself
idempotent and steps 7/8 fast-path when already active + healthy.
`--dry-run` prints the full step list and exits 0.
bin/disinto:
- _disinto_init_nomad: replaces the S0.1 stub. Invokes cluster-up.sh
directly (as root) or via `sudo -n` otherwise. Both `--empty` and
the default (no flag) call cluster-up.sh today; Step 1 will branch
on $empty to gate job deployment. --dry-run forwards through.
- disinto_init: adds `--empty` flag parsing; rejects `--empty`
combined with `--backend=docker` explicitly instead of silently
ignoring it.
- usage: documents `--empty` and drops the "stub, S0.1" annotation
from --backend.
Closes#824.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the Vault half of the factory-dev-box bringup, landed but not started
(per the install-but-don't-start pattern used for nomad in #822):
- lib/init/nomad/install.sh — now also installs vault from the shared
HashiCorp apt repo. VAULT_VERSION pinned (1.18.5). Fast-path skips apt
entirely when both binaries are at their pins; partial upgrades only
touch the package that drifted.
- nomad/vault.hcl — single-node config: file storage backend at
/var/lib/vault/data, localhost listener on :8200, ui on, mlock kept on.
No TLS / HA / audit yet; those land in later steps.
- lib/init/nomad/systemd-vault.sh — writes /etc/systemd/system/vault.service
(Type=notify, ExecStartPost auto-unseals from /etc/vault.d/unseal.key,
CAP_IPC_LOCK granted for mlock), deploys nomad/vault.hcl to
/etc/vault.d/, creates /var/lib/vault/data (0700 root), enables the
unit without starting it. Idempotent via content-compare.
- lib/init/nomad/vault-init.sh — first-run init: spawns a temporary
`vault server` if not already reachable, runs operator-init with
key-shares=1/threshold=1, persists unseal.key + root.token (0400 root),
unseals once in-process, shuts down the temp server. Re-run detects
initialized + unseal.key present → no-op. Initialized but key missing
is a hard failure (can't recover).
lib/hvault.sh already defaults VAULT_TOKEN to /etc/vault.d/root.token
when the env var is absent, so no change needed there.
Seal model: the single unseal key lives on disk; seal-key theft equals
vault theft. Factory-dev-box-acceptable tradeoff — avoids running a
second Vault to auto-unseal the first.
Blocks S0.4 (#824).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lands the Nomad install + baseline HCL config for the single-node factory
dev box. Nothing is wired into `disinto init` yet — S0.4 does that.
- lib/init/nomad/install.sh: idempotent apt install pinned to
NOMAD_VERSION (default 1.9.5). Adds HashiCorp apt keyring and sources
list only if absent; fast-paths when the pinned version is already
installed.
- lib/init/nomad/systemd-nomad.sh: writes /etc/systemd/system/nomad.service
(rewrites only when content differs), creates /etc/nomad.d and
/var/lib/nomad, runs `systemctl enable nomad` WITHOUT starting.
- nomad/server.hcl: single-node combined server+client role. bootstrap_expect=1,
localhost bind, default ports pinned explicitly, UI enabled. No TLS/ACL —
factory dev box baseline.
- nomad/client.hcl: Docker task driver (allow_privileged=false, volumes
enabled) and host_volume pre-wiring for forgejo-data, woodpecker-data,
agent-data, project-repos, caddy-data, chat-history, ops-repo under
/srv/disinto/*.
Verified: `nomad config validate nomad/*.hcl` reports "Configuration is
valid!" (with expected TLS/bootstrap warnings for a dev box). Shellcheck
clean across the repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lands the dispatch entry point for the Nomad+Vault migration. The docker
path remains the default and is byte-for-byte unchanged. The new
`--backend=nomad` value routes to a `_disinto_init_nomad` stub that fails
loud (exit 99) so no silent misrouting can happen while S0.2–S0.5 fill in
the real implementation. With `--dry-run --backend=nomad` the stub reports
status and exits 0 so dry-run callers (P7) don't see a hard failure.
- New `--backend <value>` flag (accepts `docker` | `nomad`); supports
both `--backend nomad` and `--backend=nomad` forms.
- Invalid backend values are rejected with a clear error.
- `_disinto_init_nomad` lives next to `disinto_init` so future S0.x
issues only need to fill in this function — flag parsing and dispatch
stay frozen.
- `--help` lists the flag and both values.
- `shellcheck bin/disinto` introduces no new findings beyond the
pre-existing baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump AGENTS.md watermarks to HEAD (c363ee0) across all 9 per-directory files
- supervisor/AGENTS.md: document dual-container trigger (agents + edge) and SUPERVISOR_INTERVAL env var added by P1/#801
- lib/AGENTS.md: document agents-llama-all compose service (all 7 roles) added to generators.sh by P1/#801
- pending-actions.json: comment #623 (all deps now closed, ready for planner decomposition), comment #758 (needs human Forgejo admin action to unblock ops repo writes)
Replace write_result's direct filesystem write with commit_result_via_git,
which clones the ops repo into a scratch directory, writes the result file,
commits as vault-bot, and pushes. This removes the requirement for a shared
bind-mount between the dispatcher container and the host ops-repo clone.
- Idempotent: skips if result.json already exists upstream
- Retry loop: handles push conflicts with rebase-and-push (up to 3 attempts)
- Scratch dir: cleaned up via RETURN trap regardless of outcome
- Works identically under docker and future nomad backends
Prevents infinite retry loops when secret resolution or mount alias
validation fails before the docker run is attempted.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add supervisor role to entrypoint.sh polling loop (SUPERVISOR_INTERVAL,
default 20 min) and include it in default AGENT_ROLES
- Add agents-llama-all compose service (profile: agents-llama-all) with
all 7 roles: review, dev, gardener, architect, planner, predictor, supervisor
- Add agents-llama-all to lib/generators.sh for disinto init generation
- Update docs/agents-llama.md with profile table and usage instructions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make `disinto init` safe to re-run on the same box:
- Store admin token as FORGE_ADMIN_TOKEN in .env; preserve on re-run
(previously deleted and recreated every run, churning DB state)
- Fix human token creation: use admin_pass for basic-auth since
human_user == admin_user (previously used a random password that
never matched the actual user password, so HUMAN_TOKEN was never
created successfully)
- Preserve HUMAN_TOKEN in .env on re-run (same pattern as bot tokens)
- Bot tokens were already idempotent (preserved unless --rotate-tokens)
Add --dry-run flag that reports every intended action (file writes,
API calls, docker commands) based on current state, then exits 0
without touching state. Useful for CI gating and cutover confidence.
Update smoke test:
- Add dry-run test (verifies exit 0 and no .env modification)
- Add idempotency state diff (verifies .env is unchanged on re-run)
- Verify FORGE_ADMIN_TOKEN and HUMAN_TOKEN are stored in .env
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- _hvault_err: use jq instead of printf to produce valid JSON on all inputs
- hvault_kv_get: use jq --arg for key lookup to prevent filter injection
- hvault_kv_put: build payload entirely via jq to properly escape keys
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sites touched:
- lib/generators.sh: WOODPECKER_BACKEND_DOCKER_NETWORK now reads from
${WOODPECKER_CI_NETWORK:-disinto_disinto-net} so nomad jobspecs can
override the compose-generated network name.
- lib/forge-setup.sh: bare-mode _forgejo_exec() and setup_forge() use
${FORGEJO_CONTAINER_NAME:-disinto-forgejo} instead of hardcoding the
container name. Compose mode is unaffected (uses service name).
Documented exceptions (container_name directives in generators.sh
compose template output): these define names inside docker-compose.yml,
which is compose-specific output. Under nomad the generator is not used.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- FORGE_API is repo-scoped; /repos/migrate needs the global FORGE_API_BASE
- Use jq -n --arg for safe JSON construction (no shell interpolation)
- Update docs to reference FORGE_API_BASE
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded host-side bind-mount paths with env vars so Nomad
jobspecs can reuse the same variables at cutover:
- CLAUDE_BIN_DIR: path to claude CLI binary (resolved at init time)
- CLAUDE_CONFIG_FILE: path to .claude.json (default ${HOME}/.claude.json)
- CLAUDE_DIR: path to .claude directory (default ${HOME}/.claude)
- AGENT_SSH_DIR: path to SSH keys (default ${HOME}/.ssh)
- SOPS_AGE_DIR: path to SOPS age keys (default ${HOME}/.config/sops/age)
generators.sh now writes CLAUDE_BIN_DIR to .env instead of sed-replacing
CLAUDE_BIN_PLACEHOLDER in docker-compose.yml.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>