Add `pull_policy: build` to every agent service emitted by the generator
that shares `docker/agents/Dockerfile` as its build context. Without it,
`docker compose up -d --force-recreate agents-<name>` reuses the cached
`disinto/agents:latest` image and silently keeps running stale
`docker/agents/entrypoint.sh` code even after the repo is updated. This
masked PR #864 (and likely earlier merges) — the fix landed on disk but
never reached the container.
#853 already paired `build:` with `image:` on hired-agent stanzas, which
was enough for first-time ups but not for re-ups. `pull_policy: build`
tells Compose to rebuild the image on every up; BuildKit's layer cache
makes the no-change case near-instant, and the change case picks up the
new source automatically. This covers:
- TOML-driven `agents-<name>` hired via `disinto hire-an-agent` — primary
target of the issue.
- Legacy `agents-llama` and `agents-llama-all` stanzas — same Dockerfile,
same staleness problem.
`bin/disinto up` already passed `--build`, so operators on the supported
UX path were already covered; this closes the gap for the direct
`docker compose` path the issue explicitly names in its acceptance.
Regression test added to `tests/lib-generators.bats` to pin the directive
alongside the existing #853 build/image invariants.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI's duplicate-detection step (sliding 5-line window) flagged 4 new
duplicate blocks shared with lib/init/nomad/cluster-up.sh — both used
the same `dry_run=false; while [ $# -gt 0 ]; do case "$1" in --dry-run)
... -h|--help) ... *) die "unknown flag: $1" ;; esac done` shape.
vault-apply-policies.sh has exactly one optional flag, so a flat
single-arg case with an `'')` no-op branch is shorter and structurally
distinct from the multi-flag while-loop parsers elsewhere in the repo.
The --help text now uses printf instead of a heredoc, which avoids the
EOF/exit/;;/die anchor that was the other half of the duplicate window.
DIFF_BASE=main .woodpecker/detect-duplicates.py now reports 0 new
duplicate blocks. Behavior unchanged: --dry-run, --help, --bogus, and
no-arg invocations all verified locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Land the Vault ACL policies and an idempotent apply script. 18 policies:
service-{forgejo,woodpecker}, bot-{dev,review,gardener,architect,planner,
predictor,supervisor,vault,dev-qwen}, runner-{GITHUB,CODEBERG,CLAWHUB,
NPM,DOCKER_HUB}_TOKEN + runner-DEPLOY_KEY, and dispatcher.
tools/vault-apply-policies.sh diffs each file against the on-server
policy text before calling hvault_policy_apply, reporting created /
updated / unchanged per file. --dry-run prints planned names + SHA256
and makes no Vault calls.
vault/policies/AGENTS.md documents the naming convention (service-/
bot-/runner-/dispatcher), the KV path each policy grants, the rationale
for one-policy-per-runner-secret (AD-006 least-privilege at dispatch
time), and what lands in later S2.* issues (#880-#884).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nomad's docker task driver reports Healthy=false without a running
dockerd. On the factory dev box docker was pre-installed so Step 0's
cluster-up passed silently, but a fresh ubuntu:24.04 LXC hit "missing
drivers" placement failures the moment Step 1 tried to deploy forgejo
(the first docker-driver consumer).
Fix install.sh to also install docker.io + enable --now docker.service
when absent, and add a poll for the nomad self-node's docker driver
Detected+Healthy before declaring Step 8 done — otherwise the race
between dockerd startup and nomad driver fingerprinting lets the node
reach "ready" while docker is still unhealthy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In _generate_local_model_services:
- Add FACTORY_REPO environment variable to enable factory bootstrap
- Add volume mounts for ./projects, ./.env, and ./state to provide real project TOMLs
In entrypoint.sh:
- Add validate_projects_dir() function that fails loudly if no real .toml files
are found in the projects directory (prevents silent-zombie mode where the
polling loop matches zero files and does nothing forever)
This fixes the issue where hired agents (via hire-an-agent) ran forever without
picking up any work because they were pinned to the baked /home/agent/disinto
directory with only *.toml.example files.
The TOML-driven hired-agent services (`_generate_local_model_services` in
`lib/generators.sh`) were emitting `image: ghcr.io/disinto/agents:<tag>`
for every hired agent. The ghcr image is not publicly pullable and
deployments don't carry ghcr credentials, so `docker compose up` failed
with `denied` on every new hire. The legacy `agents-llama` stanza dodged
this because it uses the registry-less local name plus a `build:` fallback.
Fix: match the legacy stanza — emit `build: { context: ., dockerfile:
docker/agents/Dockerfile }` paired with `image: disinto/agents:<tag>`.
Hosts that built locally with `disinto init --build` will find the image;
hosts without one will build it. No ghcr auth required either way.
Added a regression test that guards both the absence of the ghcr prefix
and the presence of the build directive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Acceptance items 1-4 landed previously: the primary compose emission
(FORGE_BOT_USER_*) was fixed in #849 by re-keying on forge_user via
`tr 'a-z-' 'A-Z_'`, and the load-project.sh AGENT_* Python emitter was
normalized via `.upper().replace('-', '_')` in #862. Together they
produce `FORGE_BOT_USER_DEV_QWEN2` and `AGENT_DEV_QWEN2_BASE_URL` for
`[agents.dev-qwen2]` with `forge_user = "dev-qwen2"`.
This patch closes acceptance item 5 — the defence-in-depth warn-and-skip
in load-project.sh's two export loops. Hire-agent's up-front reject is
the primary line of defence (a validated `^[a-z]([a-z0-9]|-[a-z0-9])*$`
agent name can't produce a bad identifier), but a hand-edited TOML can
still smuggle invalid keys through:
- `[mirrors] my-mirror = "…"` — the `MIRROR_<NAME>` emitter only
upper-cases, so `MY-MIRROR` retains its dash and fails `export`.
- `[agents."weird name"]` — quoted TOML keys bypass the bare-key
grammar entirely, so spaces and other disallowed shell chars reach
the export loop unchanged.
Before this change, either case would abort load-project.sh under
`set -euo pipefail` — the exact failure mode the original #852
crash-loop was diagnosed from. Now each loop validates `$_key` against
`^[A-Za-z_][A-Za-z0-9_]*$` and warn-skips offenders so siblings still
load.
- `lib/load-project.sh` — regex guard + WARNING on stderr in both
`_PROJECT_VARS` and `_AGENT_VARS` export loops.
- `tests/lib-load-project.bats` — two regressions: dashed mirror key,
quoted agent section with space. Both assert (a) the load does not
abort and (b) sane siblings still load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key `FORGE_BOT_USER_*` on `$user_upper` (forge_user normalized with
`tr 'a-z-' 'A-Z_'`) instead of `${service_name^^}`, matching the
`FORGE_TOKEN_<FORGE_USER>` / `FORGE_PASS_<FORGE_USER>` convention two
lines above in the same emitted block. For `[agents.llama]` with
`forge_user = "dev-qwen"` this emits `FORGE_BOT_USER_DEV_QWEN: "dev-qwen"`
instead of the legacy `FORGE_BOT_USER_LLAMA`.
No external consumers read `FORGE_BOT_USER_*` today (verified via grep),
so no fallback/deprecation shim is needed — this is purely a one-site
fix at the sole producer.
Adds `tests/lib-generators.bats` as a regression guard. Follows the
existing `tests/lib-*.bats` pattern (developer-run, not CI-wired).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review blocker on PR #868: the S1.3 PR renamed
nomad/jobs/forgejo.nomad.hcl → forgejo.hcl and changed the CI glob
from *.nomad.hcl to *.hcl, but nomad/AGENTS.md — the canonical spec
for the jobspec naming convention — still documented the old suffix
in six places. An agent following it would create <svc>.nomad.hcl
files (which match *.hcl and stay green) but the stated convention
would be wrong.
Updated all five references to use the new *.hcl / <service>.hcl
convention. Acceptance signal: `grep .nomad.hcl nomad/AGENTS.md`
returns zero matches.
- bin/disinto: Remove '[ -n "$base_url" ] || continue' guard that caused
all Anthropic-backend agents to be silently skipped during validation.
The base_url check is now scoped only to backend-credential selection.
- lib/hire-agent.sh: Add sed escaping for ANTHROPIC_BASE_URL value before
sed substitution (same pattern as ANTHROPIC_API_KEY at line 256).
Fixes AI review BLOCKER and MINOR issues on PR #866.
Picks up from abandoned PR #859 (branch fix/issue-842 @ 6408023). Two
bugs in the prior art:
1. The `--empty is only valid with --backend=nomad` guard was removed
when the `--with`/mutually-exclusive guards were added. This regressed
test #6 in tests/disinto-init-nomad.bats:102 — `disinto init
--backend=docker --empty --dry-run` was exiting 0 instead of failing.
Restored alongside the new guards.
2. `_disinto_init_nomad` unconditionally appended `--dry-run` to the
real-run deploy_cmd, so even `disinto init --backend=nomad --with
forgejo` (no --dry-run) would only echo the deploy plan instead of
actually running nomad job run. That violates the issue's acceptance
criteria ("Forgejo job deploys", "curl http://localhost:3000/api/v1/version
returns 200"). Removed.
All 17 tests in tests/disinto-init-nomad.bats now pass; shellcheck clean.
TOML allows dashes in bare keys, so `[agents.dev-qwen2]` is a valid
section. Before this fix, load-project.sh derived bash var names via
Python `.upper()` alone, which kept the dash and produced
`AGENT_DEV-QWEN2_BASE_URL` — an invalid shell identifier. Under
`set -euo pipefail` the subsequent `export` aborted the whole file,
silently taking the factory down on the N+1 run after a dashed agent
was hired via `disinto hire-an-agent`.
Normalize via `.upper().replace('-', '_')` to match the
`tr 'a-z-' 'A-Z_'` convention already used by hire-agent.sh (#834)
and generators.sh (#852). Also harden hire-agent.sh to reject invalid
agent names at hire time (before any Forgejo side effects), so
unparseable TOML sections never land on disk.
- `lib/load-project.sh` — dash-to-underscore in emitted shell var names
- `lib/hire-agent.sh` — validate agent name against
`^[a-z]([a-z0-9]|-[a-z0-9])*$` up front
- `tests/lib-load-project.bats` — regression guard covering the parse
path and the hire-time reject path
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TOML-driven agent services (emitted by `_generate_local_model_services`
for every `[agents.X]` entry) carried `profiles: ["agents-<name>"]`.
With `docker compose up -d --remove-orphans` and no `COMPOSE_PROFILES`
set, compose treated the hired agent container as an orphan and removed
it on every subsequent `disinto up` — silently killing dev-qwen and any
other TOML-declared local-model agent.
The profile gate was vestigial: the `[agents.X]` TOML entry is already
the activation gate — its presence is what drives emission of the
service block in the first place (#846). Drop the profile from emitted
services so they land in the default profile and survive `disinto up`.
Also update the "To start the agent, run" hint in `hire-an-agent` from
`docker compose --profile … up -d …` to `disinto up`, matching the new
activation model.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hire-an-agent now adds the new Forgejo user as a `write` collaborator on
`$FORGE_REPO` right after the token step, mirroring the collaborator setup
lib/forge-setup.sh applies to the canonical bot users. Without this, a
freshly hired agent's PATCH to assign itself an issue returned 403 Forbidden
and the dev-agent polled forever logging "claim lost to <none>".
issue_claim() now captures the PATCH HTTP status via `-w '%{http_code}'`
instead of swallowing failures with `curl -sf ... || return 1`. A 403 (or
any non-2xx) now surfaces a distinct log line naming the code — the missing
collaborator root cause would have been diagnosable in seconds instead of
minutes.
Also updates the lib-issue-claim bats mock to handle the new `-w` flag and
adds a regression test covering the HTTP-error log surfacing path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step 2 of .woodpecker/nomad-validate.yml previously ran
`nomad job validate` against a single explicit path
(nomad/jobs/forgejo.nomad.hcl, wired up during the S1.1 review). Replace
that with a POSIX-sh loop over nomad/jobs/*.nomad.hcl so every jobspec
gets CI coverage automatically — no "edit the pipeline" step to forget
when the next jobspec (woodpecker, caddy, agents, …) lands.
Why reverse S1.1's explicit-line approach: the "no-ad-hoc-steps"
principle that drove the explicit list was about keeping step *classes*
enumerated, not about re-listing every file of the same class. Globbing
over `*.nomad.hcl` still encodes a single class ("jobspec validation")
and is strictly stricter — a dropped jobspec can't silently bypass CI
because someone forgot to add its line. The `.nomad.hcl` suffix (set as
convention by S1.1 review) is what keeps non-jobspec HCL out of this
loop.
Implementation notes:
- `[ -f "$f" ] || continue` guards the no-match case. POSIX sh has no
nullglob, so an empty jobs/ dir would otherwise leave the literal
glob in $f and fail nomad job validate with "no such file". Not
reachable today (forgejo.nomad.hcl exists), but keeps the step safe
against any transient empty state during future refactors.
- `set -e` inside the block ensures the first failing jobspec aborts
(default Woodpecker behavior, but explicit is cheap).
- Loop echoes the file being validated so CI logs point at the
specific jobspec on failure.
Docs (nomad/AGENTS.md):
- "How CI validates these files" now lists all *five* steps (the S1.1
review added step 2 but didn't update the doc; fixed in passing).
- Step 2 is documented with explicit scope: what offline validate
catches (unknown stanzas, missing required fields, wrong value
types, bad driver config) and what it does NOT catch (cross-file
host_volume name resolution against client.hcl — that's a
scheduling-time check; image reachability).
- "Adding a jobspec" step 4 updated: no pipeline edit required as
long as the file follows the `*.nomad.hcl` naming convention. The
suffix is now documented as load-bearing in step 1.
- Step 2 of the "Adding a jobspec" checklist cross-links the
host_volume scheduling-time check, so contributors know the
paired-write rule (client.hcl + cluster-up.sh) is the real
guardrail for that class of drift.
Acceptance criteria:
- Broken jobspec (typo in stanza, missing required field) fails step
2 with nomad's error message — covered by the loop over every file.
- Fixed jobspec passes — standard validate behavior.
- Step 1 (nomad config validate) untouched.
- No .sh changes, so no shellcheck impact; manual shellcheck pass
shown clean.
- Trigger path `nomad/**` already covers `nomad/jobs/**` (confirmed,
no change needed to `when:` block).
Refs: #843 (S1.4), #825 (S0.5 base pipeline), #840 (S1.1 first jobspec)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two blockers from the #844 review:
1. Rename nomad/jobs/forgejo.hcl → nomad/jobs/forgejo.nomad.hcl to match
the convention documented in nomad/AGENTS.md:38 (*.nomad.hcl suffix).
First jobspec sets the pattern for all future ones; keeps any glob-
based tooling over nomad/jobs/*.nomad.hcl working.
2. Add a dedicated `nomad-job-validate` step to .woodpecker/nomad-validate.yml.
`nomad config validate` (step 1) parses agent configs only — it rejects
jobspec HCL as "unknown block 'job'". `nomad job validate` is the
correct offline validator for jobspec HCL. Per the Hashicorp docs it
does not require a running agent (exit 0 clean, 1 on syntax/semantic
error). New jobspecs will add an explicit line alongside forgejo's,
matching step 1's enumeration pattern and this file's "no-ad-hoc-steps"
principle.
Also updated the file header comment and the pipeline's top-of-file step
index to reflect the new step ordering (2. nomad-job-validate inserted;
old 2-4 renumbered to 3-5).
Refs: #840 (S1.1), PR #844
First Nomad jobspec to land under nomad/jobs/ as part of the Nomad+Vault
migration. Proves the docker driver + host_volume plumbing wired up in
Step 0 (client.hcl) by defining a real factory service:
- job type=service, datacenters=["dc1"], 1 group × 1 task
- docker driver, image pinned to codeberg.org/forgejo/forgejo:11.0
(matches docker-compose.yml)
- network port "http" static=3000, to=3000 (same host:port as compose,
so agents/woodpecker/caddy reach forgejo unchanged across cutover)
- mounts the forgejo-data host_volume from nomad/client.hcl at /data
- non-secret env subset from docker-compose's forgejo service (DB
type, ROOT_URL, HTTP_PORT, INSTALL_LOCK, DISABLE_REGISTRATION,
webhook allow-list); OAuth/secret env vars land in Step 2 via Vault
- Nomad-native service discovery (provider="nomad", no Consul) with
HTTP check on /api/v1/version (10s interval, 3s timeout). No
initial_status override — Nomad waits for first probe to pass.
- restart: 3 attempts / 5m / 15s delay / mode=delay
- resources: cpu=300 memory=512 baseline
No changes to docker-compose.yml — the docker stack remains the
factory's runtime until cutover. CI integration (`nomad job validate`)
is tracked by #843.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why: disinto_init() consumed $1 as repo_url before the argparse loop ran,
so `disinto init --backend=nomad --empty` had --backend=nomad swallowed
into repo_url, backend stayed at its "docker" default, and the --empty
validation then produced the nonsense "--empty is only valid with
--backend=nomad" error — flagged during S0.1 end-to-end verification on
a fresh LXC. nomad backend takes no positional anyway; the LXC already
has the repo cloned by the operator.
Change: only consume $1 as repo_url if it doesn't start with "--", then
defer the "repo URL required" check to after argparse (so the docker
path still errors with a helpful message on a missing positional, not
"Unknown option: --backend=docker").
Verified acceptance criteria:
1. init --backend=nomad --empty → dispatches to nomad
2. init --backend=nomad --empty --dry-run → 9-step plan, exit 0
3. init <repo-url> → docker path unchanged
4. init → "repo URL required"
5. init --backend=docker → "repo URL required"
(not "Unknown option")
6. shellcheck clean
Tests: 4 new regression cases in tests/disinto-init-nomad.bats covering
flag-first nomad invocation (both --flag=value and --flag value forms),
no-args docker default, and --backend=docker missing-positional error
path. Full suite: 10/10 pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hiring a second llama-backed dev agent (e.g. `dev-qwen2`) alongside
`dev-qwen` tripped four defects that prevented safe parallel operation.
Gap 1 — hire-agent keyed per-agent token as FORGE_<ROLE>_TOKEN, so two
dev-role agents overwrote each other's token in .env. Re-key by agent
name via `tr 'a-z-' 'A-Z_'`: FORGE_TOKEN_<AGENT_UPPER>.
Gap 2 — hire-agent generated a random FORGE_PASS but never wrote it to
.env. The container's git credential helper needs both token and pass
to push over HTTPS (#361). Persist FORGE_PASS_<AGENT_UPPER> with the
same update-in-place idempotency as the token.
Gap 3 — _generate_local_model_services hardcoded FORGE_TOKEN_LLAMA for
every local-model service, forcing all hired llama agents to share one
Forgejo identity. Derive USER_UPPER from the TOML's `forge_user` field
and emit \${FORGE_TOKEN_<USER_UPPER>:-} per service.
Gap 4 — every local-model service mounted the shared `project-repos`
volume, so concurrent llama devs collided on /_factory worktree and
state/.dev-active. Switch to per-agent `project-repos-<service_name>`
and emit the matching top-level volume. Also escape embedded newlines
in `$all_vols` before the sed insertion so multi-agent volume lists
don't unterminate the substitute command.
.env.example documents the new FORGE_TOKEN_<AGENT> / FORGE_PASS_<AGENT>
naming convention (and preserves the legacy FORGE_TOKEN_LLAMA path used
by the ENABLE_LLAMA_AGENT=1 singleton build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>