diff --git a/.env.example b/.env.example index a1f24d5..c1c0b98 100644 --- a/.env.example +++ b/.env.example @@ -32,10 +32,13 @@ FORGE_URL=http://localhost:3000 # [CONFIG] local Forgejo instance # - FORGE_PASS_DEV_QWEN2 # Name conversion: tr 'a-z-' 'A-Z_' (lowercase→UPPER, hyphens→underscores). # The compose generator looks these up via the agent's `forge_user` field in -# the project TOML. Configure local-model agents via [agents.X] sections in -# projects/*.toml — this is the canonical activation path. +# the project TOML. The pre-existing `dev-qwen` llama agent uses +# FORGE_TOKEN_LLAMA / FORGE_PASS_LLAMA (kept for backwards-compat with the +# legacy `ENABLE_LLAMA_AGENT=1` single-agent path). FORGE_TOKEN= # [SECRET] dev-bot API token (default for all agents) FORGE_PASS= # [SECRET] dev-bot password for git HTTP push (#361) +FORGE_TOKEN_LLAMA= # [SECRET] dev-qwen API token (for agents-llama) +FORGE_PASS_LLAMA= # [SECRET] dev-qwen password for git HTTP push FORGE_REVIEW_TOKEN= # [SECRET] review-bot API token FORGE_REVIEW_PASS= # [SECRET] review-bot password for git HTTP push FORGE_PLANNER_TOKEN= # [SECRET] planner-bot API token @@ -104,6 +107,13 @@ FORWARD_AUTH_SECRET= # [SECRET] Shared secret for Caddy ↔ # Store all project secrets here so formulas reference env vars, never hardcode. BASE_RPC_URL= # [SECRET] on-chain RPC endpoint +# ── Local Qwen dev agent (optional) ────────────────────────────────────── +# Set ENABLE_LLAMA_AGENT=1 to emit agents-llama in docker-compose.yml. +# Requires a running llama-server reachable at ANTHROPIC_BASE_URL. +# See docs/agents-llama.md for details. +ENABLE_LLAMA_AGENT=0 # [CONFIG] 1 = enable agents-llama service +ANTHROPIC_BASE_URL= # [CONFIG] e.g. http://host.docker.internal:8081 + # ── Tuning ──────────────────────────────────────────────────────────────── CLAUDE_TIMEOUT=7200 # [CONFIG] max seconds per Claude invocation diff --git a/.gitignore b/.gitignore index a29450c..21c6fbc 100644 --- a/.gitignore +++ b/.gitignore @@ -20,6 +20,7 @@ metrics/supervisor-metrics.jsonl # OS .DS_Store dev/ci-fixes-*.json +gardener/dust.jsonl # Individual encrypted secrets (managed by disinto secrets add) secrets/ diff --git a/.woodpecker/nomad-validate.yml b/.woodpecker/nomad-validate.yml index 5a1cc7c..81e45ae 100644 --- a/.woodpecker/nomad-validate.yml +++ b/.woodpecker/nomad-validate.yml @@ -1,21 +1,16 @@ # ============================================================================= # .woodpecker/nomad-validate.yml — Static validation for Nomad+Vault artifacts # -# Part of the Nomad+Vault migration (S0.5, issue #825; extended in S2.6, -# issue #884). Locks in the "no-ad-hoc-steps" principle: every HCL/shell -# artifact under nomad/, lib/init/nomad/, vault/policies/, plus the -# `disinto init` dispatcher and vault/roles.yaml, gets checked before it -# can land. +# Part of the Nomad+Vault migration (S0.5, issue #825). Locks in the +# "no-ad-hoc-steps" principle: every HCL/shell artifact under nomad/ or +# lib/init/nomad/, plus the `disinto init` dispatcher, gets checked +# before it can land. # # Triggers on PRs (and pushes) that touch any of: # nomad/** — HCL configs (server, client, vault) -# lib/init/nomad/** — cluster-up / install / systemd / vault-init / -# vault-nomad-auth (S2.6 trigger: vault-*.sh -# is a subset of this glob) +# lib/init/nomad/** — cluster-up / install / systemd / vault-init # bin/disinto — `disinto init --backend=nomad` dispatcher # tests/disinto-init-nomad.bats — the bats suite itself -# vault/policies/** — Vault ACL policy HCL files (S2.1, S2.6) -# vault/roles.yaml — JWT-auth role bindings (S2.3, S2.6) # .woodpecker/nomad-validate.yml — the pipeline definition # # Steps (all fail-closed — any error blocks merge): @@ -24,22 +19,8 @@ # nomad/jobs/*.hcl (new jobspecs get # CI coverage automatically) # 3. vault-operator-diagnose — `vault operator diagnose` syntax check on vault.hcl -# 4. vault-policy-fmt — `vault policy fmt` idempotence check on -# every vault/policies/*.hcl (format drift = -# CI fail; non-destructive via cp+diff) -# 5. vault-policy-validate — HCL syntax + capability validation for every -# vault/policies/*.hcl via `vault policy write` -# against an inline dev-mode Vault server -# 6. vault-roles-validate — yamllint + role→policy reference check on -# vault/roles.yaml (every referenced policy -# must exist as vault/policies/.hcl) -# 7. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto -# 8. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests -# -# Secret-scan coverage: vault/policies/*.hcl is already scanned by the -# P11 gate (.woodpecker/secret-scan.yml, issue #798) — its trigger path -# `vault/**/*` covers everything under this directory. We intentionally -# do NOT duplicate that gate here; one scanner, one source of truth. +# 4. shellcheck-nomad — shellcheck the cluster-up + install scripts + disinto +# 5. bats-init-nomad — `disinto init --backend=nomad --dry-run` smoke tests # # Pinned image versions match lib/init/nomad/install.sh (nomad 1.9.5 / # vault 1.18.5). Bump there AND here together — drift = CI passing on @@ -53,8 +34,6 @@ when: - "lib/init/nomad/**" - "bin/disinto" - "tests/disinto-init-nomad.bats" - - "vault/policies/**" - - "vault/roles.yaml" - ".woodpecker/nomad-validate.yml" # Authenticated clone — same pattern as .woodpecker/ci.yml. Forgejo is @@ -144,176 +123,7 @@ steps: *) echo "vault config: hard failure (rc=$rc)" >&2; exit "$rc" ;; esac - # ── 4. Vault policy fmt idempotence check ──────────────────────────────── - # `vault policy fmt ` formats a local HCL policy file in place. - # There's no `-check`/dry-run flag (vault 1.18.5), so we implement a - # non-destructive check as cp → fmt-on-copy → diff against original. - # Any diff means the committed file would be rewritten by `vault policy - # fmt` — failure steers the author to run `vault policy fmt ` - # locally before pushing. - # - # Scope: vault/policies/*.hcl only. The `[ -f "$f" ]` guard handles the - # no-match case (POSIX sh does not nullglob) so an empty policies/ - # directory does not fail this step. - # - # Note: `vault policy fmt` is purely local (HCL text transform) and does - # not require a running Vault server, which is why this step can run - # without starting one. - - name: vault-policy-fmt - image: hashicorp/vault:1.18.5 - commands: - - | - set -e - failed=0 - for f in vault/policies/*.hcl; do - [ -f "$f" ] || continue - tmp="/tmp/$(basename "$f").fmt" - cp "$f" "$tmp" - vault policy fmt "$tmp" >/dev/null 2>&1 - if ! diff -u "$f" "$tmp"; then - echo "ERROR: $f is not formatted — run 'vault policy fmt $f' locally" >&2 - failed=1 - fi - done - if [ "$failed" -gt 0 ]; then - echo "vault-policy-fmt: formatting drift detected" >&2 - exit 1 - fi - echo "vault-policy-fmt: all policies formatted correctly" - - # ── 5. Vault policy HCL syntax + capability validation ─────────────────── - # Vault has no offline `vault policy validate` subcommand — the closest - # in-CLI validator is `vault policy write`, which sends the HCL to a - # running server which parses it, checks capability names against the - # known set (read, list, create, update, delete, patch, sudo, deny), - # and rejects unknown stanzas / malformed path blocks. We start an - # inline dev-mode Vault (in-memory, no persistence, root token = "root") - # for the duration of this step and loop `vault policy write` over every - # vault/policies/*.hcl; the policies never leave the ephemeral dev - # server, so this is strictly a validator — not a deploy. - # - # Exit-code handling: - # - `vault policy write` exits 0 on success, non-zero on any parse / - # semantic error. We aggregate failures across all files so a single - # CI run surfaces every broken policy (not just the first). - # - The dev server is killed on any step exit via EXIT trap so the - # step tears down cleanly even on failure. - # - # Why dev-mode is sufficient: we're not persisting secrets, only asking - # Vault to parse policy text. The factory's production Vault is NOT - # contacted. - - name: vault-policy-validate - image: hashicorp/vault:1.18.5 - commands: - - | - set -e - vault server -dev -dev-root-token-id=root -dev-listen-address=127.0.0.1:8200 >/tmp/vault-dev.log 2>&1 & - VAULT_PID=$! - trap 'kill "$VAULT_PID" 2>/dev/null || true' EXIT INT TERM - export VAULT_ADDR=http://127.0.0.1:8200 - export VAULT_TOKEN=root - ready=0 - i=0 - while [ "$i" -lt 30 ]; do - if vault status >/dev/null 2>&1; then - ready=1 - break - fi - i=$((i + 1)) - sleep 0.5 - done - if [ "$ready" -ne 1 ]; then - echo "vault-policy-validate: dev server failed to start after 15s" >&2 - cat /tmp/vault-dev.log >&2 || true - exit 1 - fi - failed=0 - for f in vault/policies/*.hcl; do - [ -f "$f" ] || continue - name=$(basename "$f" .hcl) - echo "validate: $f" - if ! vault policy write "$name" "$f"; then - echo " ERROR: $f failed validation" >&2 - failed=1 - fi - done - if [ "$failed" -gt 0 ]; then - echo "vault-policy-validate: validation errors found" >&2 - exit 1 - fi - echo "vault-policy-validate: all policies valid" - - # ── 6. vault/roles.yaml validator ──────────────────────────────────────── - # Validates the JWT-auth role bindings file (S2.3). Two checks: - # - # a. `yamllint` — catches YAML syntax errors and indentation drift. - # Uses a relaxed config (line length bumped to 200) because - # roles.yaml's comments are wide by design. - # b. role → policy reference check — every role's `policy:` field - # must match a basename in vault/policies/*.hcl. A role pointing - # at a non-existent policy = runtime "permission denied" at job - # placement; catching the drift here turns it into a CI failure. - # Also verifies each role entry has the four required fields - # (name, policy, namespace, job_id) per the file's documented - # format. - # - # Parsing is done with PyYAML (the roles.yaml format is a strict - # subset that awk-level parsing in tools/vault-apply-roles.sh handles - # too, but PyYAML in CI gives us structural validation for free). If - # roles.yaml is ever absent (e.g. reverted), the step skips rather - # than fails — presence is enforced by S2.3's own tooling, not here. - - name: vault-roles-validate - image: python:3.12-alpine - commands: - - pip install --quiet --disable-pip-version-check pyyaml yamllint - - | - set -e - if [ ! -f vault/roles.yaml ]; then - echo "vault-roles-validate: vault/roles.yaml not present, skipping" - exit 0 - fi - yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}' vault/roles.yaml - echo "vault-roles-validate: yamllint OK" - python3 - <<'PY' - import os - import sys - import yaml - - with open('vault/roles.yaml') as f: - data = yaml.safe_load(f) or {} - roles = data.get('roles') or [] - if not roles: - print("vault-roles-validate: no roles defined in vault/roles.yaml", file=sys.stderr) - sys.exit(1) - existing = { - os.path.splitext(e)[0] - for e in os.listdir('vault/policies') - if e.endswith('.hcl') - } - required = ('name', 'policy', 'namespace', 'job_id') - failed = 0 - for r in roles: - if not isinstance(r, dict): - print(f"ERROR: role entry is not a mapping: {r!r}", file=sys.stderr) - failed = 1 - continue - for field in required: - if r.get(field) in (None, ''): - print(f"ERROR: role entry missing required field '{field}': {r}", file=sys.stderr) - failed = 1 - policy = r.get('policy') - if policy and policy not in existing: - print( - f"ERROR: role '{r.get('name')}' references policy '{policy}' " - f"but vault/policies/{policy}.hcl does not exist", - file=sys.stderr, - ) - failed = 1 - sys.exit(failed) - PY - echo "vault-roles-validate: all role→policy references valid" - - # ── 7. Shellcheck ──────────────────────────────────────────────────────── + # ── 4. Shellcheck ──────────────────────────────────────────────────────── # Covers the new lib/init/nomad/*.sh scripts plus bin/disinto (which owns # the backend dispatcher). bin/disinto has no .sh extension so the # repo-wide shellcheck in .woodpecker/ci.yml skips it — this step is the @@ -323,7 +133,7 @@ steps: commands: - shellcheck --severity=warning lib/init/nomad/*.sh bin/disinto - # ── 8. bats: `disinto init --backend=nomad --dry-run` ──────────────────── + # ── 5. bats: `disinto init --backend=nomad --dry-run` ──────────────────── # Smoke-tests the CLI dispatcher: both --backend=nomad variants exit 0 # with the expected step list, and --backend=docker stays on the docker # path (regression guard). Pure dry-run — no sudo, no network. diff --git a/AGENTS.md b/AGENTS.md index fced0c6..eec058c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,4 +1,4 @@ - + # Disinto — Agent Instructions ## What this repo is @@ -39,18 +39,15 @@ disinto/ (code repo) │ hooks/ — Claude Code session hooks (on-compact-reinject, on-idle-stop, on-phase-change, on-pretooluse-guard, on-session-end, on-stop-failure) │ init/nomad/ — cluster-up.sh, install.sh, vault-init.sh, lib-systemd.sh (Nomad+Vault Step 0 installers, #821-#825) ├── nomad/ server.hcl, client.hcl, vault.hcl — HCL configs deployed to /etc/nomad.d/ and /etc/vault.d/ by lib/init/nomad/cluster-up.sh -│ jobs/ — Nomad jobspecs (forgejo.hcl reads Vault secrets via template stanza, S2.4) ├── projects/ *.toml.example — templates; *.toml — local per-box config (gitignored) ├── formulas/ Issue templates (TOML specs for multi-step agent tasks) ├── docker/ Dockerfiles and entrypoints: reproduce, triage, edge dispatcher, chat (server.py, entrypoint-chat.sh, Dockerfile, ui/) ├── tools/ Operational tools: edge-control/ (register.sh, install.sh, verify-chat-sandbox.sh) -│ vault-apply-policies.sh, vault-apply-roles.sh, vault-import.sh — Vault provisioning (S2.1/S2.2) -│ vault-seed-.sh — per-service Vault secret seeders; auto-invoked by `bin/disinto --with ` (add a new file to support a new service) ├── docs/ Protocol docs (PHASE-PROTOCOL.md, EVIDENCE-ARCHITECTURE.md) ├── site/ disinto.ai website content -├── tests/ Test files (mock-forgejo.py, smoke-init.sh, lib-hvault.bats, lib-generators.bats, vault-import.bats, disinto-init-nomad.bats) +├── tests/ Test files (mock-forgejo.py, smoke-init.sh, lib-hvault.bats, disinto-init-nomad.bats) ├── templates/ Issue templates -├── bin/ The `disinto` CLI script (`--with ` deploys services + runs their Vault seeders) +├── bin/ The `disinto` CLI script ├── disinto-factory/ Setup documentation and skill ├── state/ Runtime state ├── .woodpecker/ Woodpecker CI pipeline configs @@ -123,7 +120,8 @@ bash dev/phase-test.sh | Reproduce | `docker/reproduce/` | Bug reproduction using Playwright MCP | `formulas/reproduce.toml` | | Triage | `docker/reproduce/` | Deep root cause analysis | `formulas/triage.toml` | | Edge dispatcher | `docker/edge/` | Polls ops repo for vault actions, executes via Claude sessions | `docker/edge/dispatcher.sh` | -| Local-model agents | `docker/agents/` (same image) | Local llama-server agents configured via `[agents.X]` sections in project TOML | [docs/agents-llama.md](docs/agents-llama.md) | +| agents-llama | `docker/agents/` (same image) | Local-Qwen dev agent (`AGENT_ROLES=dev`), gated on `ENABLE_LLAMA_AGENT=1` | [docs/agents-llama.md](docs/agents-llama.md) | +| agents-llama-all | `docker/agents/` (same image) | Local-Qwen all-roles agent (all 7 roles), profile `agents-llama-all` | [docs/agents-llama.md](docs/agents-llama.md) | > **Vault:** Being redesigned as a PR-based approval workflow (issues #73-#77). > See [docs/VAULT.md](docs/VAULT.md) for the vault PR workflow details. @@ -194,7 +192,9 @@ Humans write these. Agents read and enforce them. ## Phase-Signaling Protocol -When running as a persistent tmux session, Claude must signal the orchestrator at each phase boundary by writing to a phase file (e.g. `/tmp/dev-session-{project}-{issue}.phase`). +When running as a persistent tmux session, Claude must signal the orchestrator +at each phase boundary by writing to a phase file (e.g. +`/tmp/dev-session-{project}-{issue}.phase`). Key phases: `PHASE:awaiting_ci` → `PHASE:awaiting_review` → `PHASE:done`. Also: `PHASE:escalate` (needs human input), `PHASE:failed`. See [docs/PHASE-PROTOCOL.md](docs/PHASE-PROTOCOL.md) for the complete spec, orchestrator reaction matrix, sequence diagram, and crash recovery. diff --git a/architect/AGENTS.md b/architect/AGENTS.md index 51b24b1..9582b03 100644 --- a/architect/AGENTS.md +++ b/architect/AGENTS.md @@ -1,4 +1,4 @@ - + # Architect — Agent Instructions ## What this agent is diff --git a/bin/disinto b/bin/disinto index 5f57927..1d5e01e 100755 --- a/bin/disinto +++ b/bin/disinto @@ -60,7 +60,7 @@ Usage: Read CI logs from Woodpecker SQLite disinto release Create vault PR for release (e.g., v1.2.0) disinto hire-an-agent [--formula ] [--local-model ] [--model ] - Hire a new agent (create user + .profile repo; re-run to rotate credentials) + Hire a new agent (create user + .profile repo) disinto agent Manage agent state (enable/disable) disinto edge [options] Manage edge tunnel registrations @@ -89,9 +89,6 @@ Init options: --yes Skip confirmation prompts --rotate-tokens Force regeneration of all bot tokens/passwords (idempotent by default) --dry-run Print every intended action without executing - --import-env (nomad) Path to .env file for import into Vault KV (S2.5) - --import-sops (nomad) Path to sops-encrypted .env.vault.enc for import (S2.5) - --age-key (nomad) Path to age keyfile (required with --import-sops) (S2.5) Hire an agent options: --formula Path to role formula TOML (default: formulas/.toml) @@ -667,13 +664,8 @@ prompt_admin_password() { # `sudo disinto init ...` directly. _disinto_init_nomad() { local dry_run="${1:-false}" empty="${2:-false}" with_services="${3:-}" - local import_env="${4:-}" import_sops="${5:-}" age_key="${6:-}" local cluster_up="${FACTORY_ROOT}/lib/init/nomad/cluster-up.sh" local deploy_sh="${FACTORY_ROOT}/lib/init/nomad/deploy.sh" - local vault_engines_sh="${FACTORY_ROOT}/lib/init/nomad/vault-engines.sh" - local vault_policies_sh="${FACTORY_ROOT}/tools/vault-apply-policies.sh" - local vault_auth_sh="${FACTORY_ROOT}/lib/init/nomad/vault-nomad-auth.sh" - local vault_import_sh="${FACTORY_ROOT}/tools/vault-import.sh" if [ ! -x "$cluster_up" ]; then echo "Error: ${cluster_up} not found or not executable" >&2 @@ -685,42 +677,6 @@ _disinto_init_nomad() { exit 1 fi - # --empty short-circuits after cluster-up: no policies, no auth, no - # import, no deploy. It's the "cluster-only escape hatch" for debugging - # (docs/nomad-migration.md). Caller-side validation already rejects - # --empty combined with --with or any --import-* flag, so reaching - # this branch with those set is a bug in the caller. - # - # On the default (non-empty) path, vault-engines.sh (enables the kv/ - # mount), vault-apply-policies.sh, and vault-nomad-auth.sh are invoked - # unconditionally — they are idempotent and cheap to re-run, and - # subsequent --with deployments depend on them. vault-import.sh is - # invoked only when an --import-* flag is set. vault-engines.sh runs - # first because every policy and role below references kv/disinto/* - # paths, which 403 if the engine is not yet mounted (issue #912). - local import_any=false - if [ -n "$import_env" ] || [ -n "$import_sops" ]; then - import_any=true - fi - if [ "$empty" != "true" ]; then - if [ ! -x "$vault_engines_sh" ]; then - echo "Error: ${vault_engines_sh} not found or not executable" >&2 - exit 1 - fi - if [ ! -x "$vault_policies_sh" ]; then - echo "Error: ${vault_policies_sh} not found or not executable" >&2 - exit 1 - fi - if [ ! -x "$vault_auth_sh" ]; then - echo "Error: ${vault_auth_sh} not found or not executable" >&2 - exit 1 - fi - if [ "$import_any" = true ] && [ ! -x "$vault_import_sh" ]; then - echo "Error: ${vault_import_sh} not found or not executable" >&2 - exit 1 - fi - fi - # --empty and default both invoke cluster-up today. Log the requested # mode so the dispatch is visible in factory bootstrap logs — Step 1 # will branch on $empty to gate the job-deployment path. @@ -730,7 +686,7 @@ _disinto_init_nomad() { echo "nomad backend: default (cluster-up; jobs deferred to Step 1)" fi - # Dry-run: print cluster-up plan + policies/auth/import plan + deploy.sh plan + # Dry-run: print cluster-up plan + deploy.sh plan if [ "$dry_run" = "true" ]; then echo "" echo "── Cluster-up dry-run ─────────────────────────────────" @@ -738,74 +694,10 @@ _disinto_init_nomad() { "${cmd[@]}" || true echo "" - # --empty skips policies/auth/import/deploy — cluster-up only, no - # workloads. The operator-visible dry-run plan must match the real - # run, so short-circuit here too. - if [ "$empty" = "true" ]; then - exit 0 - fi - - # Vault engines + policies + auth are invoked on every nomad real-run - # path regardless of --import-* flags (they're idempotent; S2.1 + S2.3). - # Engines runs first because policies/roles/templates all reference the - # kv/ mount it enables (issue #912). Mirror that ordering in the - # dry-run plan so the operator sees the full sequence Step 2 will - # execute. - echo "── Vault engines dry-run ──────────────────────────────" - echo "[engines] [dry-run] ${vault_engines_sh} --dry-run" - echo "" - echo "── Vault policies dry-run ─────────────────────────────" - echo "[policies] [dry-run] ${vault_policies_sh} --dry-run" - echo "" - echo "── Vault auth dry-run ─────────────────────────────────" - echo "[auth] [dry-run] ${vault_auth_sh}" - echo "" - - # Import plan: one line per --import-* flag that is actually set. - # Printing independently (not in an if/elif chain) means that all - # three flags appearing together each echo their own path — the - # regression that bit prior implementations of this issue (#883). - if [ "$import_any" = true ]; then - echo "── Vault import dry-run ───────────────────────────────" - [ -n "$import_env" ] && echo "[import] --import-env env file: ${import_env}" - [ -n "$import_sops" ] && echo "[import] --import-sops sops file: ${import_sops}" - [ -n "$age_key" ] && echo "[import] --age-key age key: ${age_key}" - local -a import_dry_cmd=("$vault_import_sh") - [ -n "$import_env" ] && import_dry_cmd+=("--env" "$import_env") - [ -n "$import_sops" ] && import_dry_cmd+=("--sops" "$import_sops") - [ -n "$age_key" ] && import_dry_cmd+=("--age-key" "$age_key") - import_dry_cmd+=("--dry-run") - echo "[import] [dry-run] ${import_dry_cmd[*]}" - echo "" - else - echo "[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" - echo "" - fi - if [ -n "$with_services" ]; then - # Vault seed plan (S2.6, #928): one line per service whose - # tools/vault-seed-.sh ships. Services without a seeder are - # silently skipped — the real-run loop below mirrors this, - # making `--with woodpecker` in Step 3 auto-invoke - # tools/vault-seed-woodpecker.sh once that file lands without - # any further change to bin/disinto. - local seed_hdr_printed=false - local IFS=',' - for svc in $with_services; do - svc=$(echo "$svc" | xargs) # trim whitespace - local seed_script="${FACTORY_ROOT}/tools/vault-seed-${svc}.sh" - if [ -x "$seed_script" ]; then - if [ "$seed_hdr_printed" = false ]; then - echo "── Vault seed dry-run ─────────────────────────────────" - seed_hdr_printed=true - fi - echo "[seed] [dry-run] ${seed_script} --dry-run" - fi - done - [ "$seed_hdr_printed" = true ] && echo "" - echo "── Deploy services dry-run ────────────────────────────" echo "[deploy] services to deploy: ${with_services}" + local IFS=',' for svc in $with_services; do svc=$(echo "$svc" | xargs) # trim whitespace # Validate known services first @@ -829,7 +721,7 @@ _disinto_init_nomad() { exit 0 fi - # Real run: cluster-up + policies + auth + (optional) import + deploy + # Real run: cluster-up + deploy services local -a cluster_cmd=("$cluster_up") if [ "$(id -u)" -eq 0 ]; then "${cluster_cmd[@]}" || exit $? @@ -841,122 +733,6 @@ _disinto_init_nomad() { sudo -n -- "${cluster_cmd[@]}" || exit $? fi - # --empty short-circuits here: cluster-up only, no policies/auth/import - # and no deploy. Matches the dry-run plan above and the docs/runbook. - if [ "$empty" = "true" ]; then - exit 0 - fi - - # Enable Vault secret engines (S2.1 / issue #912) — must precede - # policies/auth/import because every policy and every import target - # addresses paths under kv/. Idempotent, safe to re-run. - echo "" - echo "── Enabling Vault secret engines ──────────────────────" - local -a engines_cmd=("$vault_engines_sh") - if [ "$(id -u)" -eq 0 ]; then - "${engines_cmd[@]}" || exit $? - else - if ! command -v sudo >/dev/null 2>&1; then - echo "Error: vault-engines.sh must run as root and sudo is not installed" >&2 - exit 1 - fi - sudo -n -- "${engines_cmd[@]}" || exit $? - fi - - # Apply Vault policies (S2.1) — idempotent, safe to re-run. - echo "" - echo "── Applying Vault policies ────────────────────────────" - local -a policies_cmd=("$vault_policies_sh") - if [ "$(id -u)" -eq 0 ]; then - "${policies_cmd[@]}" || exit $? - else - if ! command -v sudo >/dev/null 2>&1; then - echo "Error: vault-apply-policies.sh must run as root and sudo is not installed" >&2 - exit 1 - fi - sudo -n -- "${policies_cmd[@]}" || exit $? - fi - - # Configure Vault JWT auth + Nomad workload identity (S2.3) — idempotent. - echo "" - echo "── Configuring Vault JWT auth ─────────────────────────" - local -a auth_cmd=("$vault_auth_sh") - if [ "$(id -u)" -eq 0 ]; then - "${auth_cmd[@]}" || exit $? - else - if ! command -v sudo >/dev/null 2>&1; then - echo "Error: vault-nomad-auth.sh must run as root and sudo is not installed" >&2 - exit 1 - fi - sudo -n -- "${auth_cmd[@]}" || exit $? - fi - - # Import secrets if any --import-* flag is set (S2.2). - if [ "$import_any" = true ]; then - echo "" - echo "── Importing secrets into Vault ───────────────────────" - local -a import_cmd=("$vault_import_sh") - [ -n "$import_env" ] && import_cmd+=("--env" "$import_env") - [ -n "$import_sops" ] && import_cmd+=("--sops" "$import_sops") - [ -n "$age_key" ] && import_cmd+=("--age-key" "$age_key") - if [ "$(id -u)" -eq 0 ]; then - "${import_cmd[@]}" || exit $? - else - if ! command -v sudo >/dev/null 2>&1; then - echo "Error: vault-import.sh must run as root and sudo is not installed" >&2 - exit 1 - fi - sudo -n -- "${import_cmd[@]}" || exit $? - fi - else - echo "" - echo "[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services" - fi - - # Seed Vault for services that ship their own seeder (S2.6, #928). - # Convention: tools/vault-seed-.sh — auto-invoked when --with - # is requested. Runs AFTER vault-import so that real imported values - # win over generated seeds when both are present; each seeder is - # idempotent on a per-key basis (see vault-seed-forgejo.sh's - # "missing → generate, present → unchanged" contract), so re-running - # init does not rotate existing keys. Services without a seeder are - # silently skipped — keeps this loop forward-compatible with Step 3+ - # services that may ship their own seeder without touching bin/disinto. - # - # VAULT_ADDR is passed explicitly because cluster-up.sh writes the - # profile.d export *during* this same init run, so the current shell - # hasn't sourced it yet; sibling vault-* scripts (engines/policies/ - # auth/import) default VAULT_ADDR internally via _hvault_default_env, - # but vault-seed-forgejo.sh requires the caller to set it. - # - # The non-root branch invokes the seeder as `sudo -n -- env VAR=val - # script` rather than `sudo -n VAR=val -- script`: sudo treats bare - # `VAR=val` args as sudoers env-assignments, which the default - # `env_reset=on` policy silently discards unless the variable is in - # `env_keep` (VAULT_ADDR is not). Using `env` as the actual command - # sets VAULT_ADDR in the child process regardless of sudoers policy. - if [ -n "$with_services" ]; then - local vault_addr="${VAULT_ADDR:-http://127.0.0.1:8200}" - local IFS=',' - for svc in $with_services; do - svc=$(echo "$svc" | xargs) # trim whitespace - local seed_script="${FACTORY_ROOT}/tools/vault-seed-${svc}.sh" - if [ -x "$seed_script" ]; then - echo "" - echo "── Seeding Vault for ${svc} ───────────────────────────" - if [ "$(id -u)" -eq 0 ]; then - VAULT_ADDR="$vault_addr" "$seed_script" || exit $? - else - if ! command -v sudo >/dev/null 2>&1; then - echo "Error: vault-seed-${svc}.sh must run as root and sudo is not installed" >&2 - exit 1 - fi - sudo -n -- env "VAULT_ADDR=$vault_addr" "$seed_script" || exit $? - fi - fi - done - fi - # Deploy services if requested if [ -n "$with_services" ]; then echo "" @@ -986,6 +762,7 @@ _disinto_init_nomad() { fi deploy_cmd+=("$svc") done + deploy_cmd+=("--dry-run") # deploy.sh supports --dry-run if [ "$(id -u)" -eq 0 ]; then "${deploy_cmd[@]}" || exit $? @@ -1001,16 +778,6 @@ _disinto_init_nomad() { echo "" echo "── Summary ────────────────────────────────────────────" echo "Cluster: Nomad+Vault cluster is up" - echo "Policies: applied (Vault ACL)" - echo "Auth: Vault JWT auth + Nomad workload identity configured" - if [ "$import_any" = true ]; then - local import_desc="" - [ -n "$import_env" ] && import_desc+="${import_env} " - [ -n "$import_sops" ] && import_desc+="${import_sops} " - echo "Imported: ${import_desc% }" - else - echo "Imported: (none — seed kv/disinto/* manually before deploying secret-dependent services)" - fi echo "Deployed: ${with_services}" if echo "$with_services" | grep -q "forgejo"; then echo "Ports: forgejo: 3000" @@ -1037,7 +804,6 @@ disinto_init() { # Parse flags local branch="" repo_root="" ci_id="0" auto_yes=false forge_url_flag="" bare=false rotate_tokens=false use_build=false dry_run=false backend="docker" empty=false with_services="" - local import_env="" import_sops="" age_key="" while [ $# -gt 0 ]; do case "$1" in --branch) branch="$2"; shift 2 ;; @@ -1054,12 +820,6 @@ disinto_init() { --yes) auto_yes=true; shift ;; --rotate-tokens) rotate_tokens=true; shift ;; --dry-run) dry_run=true; shift ;; - --import-env) import_env="$2"; shift 2 ;; - --import-env=*) import_env="${1#--import-env=}"; shift ;; - --import-sops) import_sops="$2"; shift 2 ;; - --import-sops=*) import_sops="${1#--import-sops=}"; shift ;; - --age-key) age_key="$2"; shift 2 ;; - --age-key=*) age_key="${1#--age-key=}"; shift ;; *) echo "Unknown option: $1" >&2; exit 1 ;; esac done @@ -1080,14 +840,6 @@ disinto_init() { exit 1 fi - # --empty is nomad-only today (the docker path has no concept of an - # "empty cluster"). Reject explicitly rather than letting it silently - # do nothing on --backend=docker. - if [ "$empty" = true ] && [ "$backend" != "nomad" ]; then - echo "Error: --empty is only valid with --backend=nomad" >&2 - exit 1 - fi - # --with requires --backend=nomad if [ -n "$with_services" ] && [ "$backend" != "nomad" ]; then echo "Error: --with requires --backend=nomad" >&2 @@ -1100,40 +852,11 @@ disinto_init() { exit 1 fi - # --import-* flag validation (S2.5). These three flags form an import - # triple and must be consistent before dispatch: sops encryption is - # useless without the age key to decrypt it, so either both --import-sops - # and --age-key are present or neither is. --import-env alone is fine - # (it just imports the plaintext dotenv). All three flags are nomad-only. - if [ -n "$import_sops" ] && [ -z "$age_key" ]; then - echo "Error: --import-sops requires --age-key" >&2 - exit 1 - fi - if [ -n "$age_key" ] && [ -z "$import_sops" ]; then - echo "Error: --age-key requires --import-sops" >&2 - exit 1 - fi - if { [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; } \ - && [ "$backend" != "nomad" ]; then - echo "Error: --import-env, --import-sops, and --age-key require --backend=nomad" >&2 - exit 1 - fi - - # --empty is the cluster-only escape hatch — it skips policies, auth, - # import, and deploy. Pairing it with --import-* silently does nothing, - # which is a worse failure mode than a clear error. Reject explicitly. - if [ "$empty" = true ] \ - && { [ -n "$import_env" ] || [ -n "$import_sops" ] || [ -n "$age_key" ]; }; then - echo "Error: --empty and --import-env/--import-sops/--age-key are mutually exclusive" >&2 - exit 1 - fi - # Dispatch on backend — the nomad path runs lib/init/nomad/cluster-up.sh # (S0.4). The default and --empty variants are identical today; Step 1 # will branch on $empty to add job deployment to the default path. if [ "$backend" = "nomad" ]; then - _disinto_init_nomad "$dry_run" "$empty" "$with_services" \ - "$import_env" "$import_sops" "$age_key" + _disinto_init_nomad "$dry_run" "$empty" "$with_services" # shellcheck disable=SC2317 # _disinto_init_nomad always exits today; # `return` is defensive against future refactors. return @@ -1247,6 +970,7 @@ p.write_text(text) echo "" echo "[ensure] Forgejo admin user 'disinto-admin'" echo "[ensure] 8 bot users: dev-bot, review-bot, planner-bot, gardener-bot, vault-bot, supervisor-bot, predictor-bot, architect-bot" + echo "[ensure] 2 llama bot users: dev-qwen, dev-qwen-nightly" echo "[ensure] .profile repos for all bots" echo "[ensure] repo ${forge_repo} on Forgejo with collaborators" echo "[run] preflight checks" @@ -1442,6 +1166,19 @@ p.write_text(text) echo "Config: CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 saved to .env" fi + # Write local-Qwen dev agent env keys with safe defaults (#769) + if ! grep -q '^ENABLE_LLAMA_AGENT=' "$env_file" 2>/dev/null; then + cat >> "$env_file" <<'LLAMAENVEOF' + +# Local Qwen dev agent (optional) — set to 1 to enable +ENABLE_LLAMA_AGENT=0 +FORGE_TOKEN_LLAMA= +FORGE_PASS_LLAMA= +ANTHROPIC_BASE_URL= +LLAMAENVEOF + echo "Config: ENABLE_LLAMA_AGENT keys written to .env (disabled by default)" + fi + # Create labels on remote create_labels "$forge_repo" "$forge_url" @@ -2108,118 +1845,6 @@ _regen_file() { fi } -# Validate that required environment variables are present for all services -# that reference them in docker-compose.yml -_validate_env_vars() { - local env_file="${FACTORY_ROOT}/.env" - local errors=0 - local -a missing_vars=() - - # Load env vars from .env file into associative array - declare -A env_vars - if [ -f "$env_file" ]; then - while IFS='=' read -r key value; do - # Skip empty lines and comments - [[ -z "$key" || "$key" =~ ^[[:space:]]*# ]] && continue - env_vars["$key"]="$value" - done < "$env_file" - fi - - # Check for local-model agent services - # Each [agents.*] section in projects/*.toml requires: - # - FORGE_TOKEN_ - # - FORGE_PASS_ - # - ANTHROPIC_BASE_URL (local model) OR ANTHROPIC_API_KEY (Anthropic backend) - - # Parse projects/*.toml for [agents.*] sections - local projects_dir="${FACTORY_ROOT}/projects" - for toml in "${projects_dir}"/*.toml; do - [ -f "$toml" ] || continue - - # Extract agent config using Python - while IFS='|' read -r service_name forge_user base_url _api_key; do - [ -n "$service_name" ] || continue - [ -n "$forge_user" ] || continue - - # Derive variable names (user -> USER_UPPER) - local user_upper - user_upper=$(echo "$forge_user" | tr 'a-z-' 'A-Z_') - local token_var="FORGE_TOKEN_${user_upper}" - local pass_var="FORGE_PASS_${user_upper}" - - # Check token - if [ -z "${env_vars[$token_var]:-}" ]; then - missing_vars+=("$token_var (for agent ${service_name}/${forge_user})") - errors=$((errors + 1)) - fi - - # Check password - if [ -z "${env_vars[$pass_var]:-}" ]; then - missing_vars+=("$pass_var (for agent ${service_name}/${forge_user})") - errors=$((errors + 1)) - fi - - # Check backend URL or API key (conditional based on base_url presence) - if [ -n "$base_url" ]; then - # Local model: needs ANTHROPIC_BASE_URL - if [ -z "${env_vars[ANTHROPIC_BASE_URL]:-}" ]; then - missing_vars+=("ANTHROPIC_BASE_URL (for agent ${service_name})") - errors=$((errors + 1)) - fi - else - # Anthropic backend: needs ANTHROPIC_API_KEY - if [ -z "${env_vars[ANTHROPIC_API_KEY]:-}" ]; then - missing_vars+=("ANTHROPIC_API_KEY (for agent ${service_name})") - errors=$((errors + 1)) - fi - fi - - done < <(python3 -c ' -import sys, tomllib, re - -with open(sys.argv[1], "rb") as f: - cfg = tomllib.load(f) - -agents = cfg.get("agents", {}) -for name, config in agents.items(): - if not isinstance(config, dict): - continue - - base_url = config.get("base_url", "") - model = config.get("model", "") - api_key = config.get("api_key", "") - forge_user = config.get("forge_user", f"{name}-bot") - - safe_name = name.lower() - safe_name = re.sub(r"[^a-z0-9]", "-", safe_name) - - print(f"{safe_name}|{forge_user}|{base_url}|{api_key}") -' "$toml" 2>/dev/null) - done - - # Check for legacy ENABLE_LLAMA_AGENT services - if [ "${env_vars[ENABLE_LLAMA_AGENT]:-0}" = "1" ]; then - if [ -z "${env_vars[FORGE_TOKEN_LLAMA]:-}" ]; then - missing_vars+=("FORGE_TOKEN_LLAMA (ENABLE_LLAMA_AGENT=1)") - errors=$((errors + 1)) - fi - if [ -z "${env_vars[FORGE_PASS_LLAMA]:-}" ]; then - missing_vars+=("FORGE_PASS_LLAMA (ENABLE_LLAMA_AGENT=1)") - errors=$((errors + 1)) - fi - fi - - if [ "$errors" -gt 0 ]; then - echo "Error: missing required environment variables:" >&2 - for var in "${missing_vars[@]}"; do - echo " - $var" >&2 - done - echo "" >&2 - echo "Run 'disinto hire-an-agent ' to create the agent and write credentials to .env" >&2 - exit 1 - fi -} - disinto_up() { local compose_file="${FACTORY_ROOT}/docker-compose.yml" local caddyfile="${FACTORY_ROOT}/docker/Caddyfile" @@ -2229,9 +1854,6 @@ disinto_up() { exit 1 fi - # Validate environment variables before proceeding - _validate_env_vars - # Parse --no-regen flag; remaining args pass through to docker compose local no_regen=false local -a compose_args=() diff --git a/dev/AGENTS.md b/dev/AGENTS.md index 02fd612..481bb1f 100644 --- a/dev/AGENTS.md +++ b/dev/AGENTS.md @@ -1,4 +1,4 @@ - + # Dev Agent **Role**: Implement issues autonomously — write code, push branches, address diff --git a/dev/dev-agent.sh b/dev/dev-agent.sh index 913a2a7..cd8d390 100755 --- a/dev/dev-agent.sh +++ b/dev/dev-agent.sh @@ -254,11 +254,7 @@ agent_recover_session # WORKTREE SETUP # ============================================================================= status "setting up worktree" -if ! cd "$REPO_ROOT"; then - log "ERROR: REPO_ROOT=${REPO_ROOT} does not exist — cannot cd" - log "Check PROJECT_REPO_ROOT vs compose PROJECT_NAME vs TOML name mismatch" - exit 1 -fi +cd "$REPO_ROOT" # Determine forge remote by matching FORGE_URL host against git remotes _forge_host=$(printf '%s' "$FORGE_URL" | sed 's|https\?://||; s|/.*||') diff --git a/docker/agents/Dockerfile b/docker/agents/Dockerfile index 1bcba89..2939230 100644 --- a/docker/agents/Dockerfile +++ b/docker/agents/Dockerfile @@ -2,7 +2,7 @@ FROM debian:bookworm-slim RUN apt-get update && apt-get install -y --no-install-recommends \ bash curl git jq tmux python3 python3-pip openssh-client ca-certificates age shellcheck procps gosu \ - && pip3 install --break-system-packages networkx tomlkit \ + && pip3 install --break-system-packages networkx \ && rm -rf /var/lib/apt/lists/* # Pre-built binaries (copied from docker/agents/bin/) diff --git a/docker/agents/entrypoint.sh b/docker/agents/entrypoint.sh index 7c58674..b7593a2 100644 --- a/docker/agents/entrypoint.sh +++ b/docker/agents/entrypoint.sh @@ -17,38 +17,6 @@ set -euo pipefail # - predictor: every 24 hours (288 iterations * 5 min) # - supervisor: every SUPERVISOR_INTERVAL seconds (default: 1200 = 20 min) -# ── Migration check: reject ENABLE_LLAMA_AGENT ─────────────────────────────── -# #846: The legacy ENABLE_LLAMA_AGENT env flag is no longer supported. -# Activation is now done exclusively via [agents.X] sections in project TOML. -# If this legacy flag is detected, fail immediately with a migration message. -if [ "${ENABLE_LLAMA_AGENT:-}" = "1" ]; then - cat <<'MIGRATION_ERR' -FATAL: ENABLE_LLAMA_AGENT is no longer supported. - -The legacy ENABLE_LLAMA_AGENT=1 flag has been removed (#846). -Activation is now done exclusively via [agents.X] sections in projects/*.toml. - -To migrate: - 1. Remove ENABLE_LLAMA_AGENT from your .env or .env.enc file - 2. Add an [agents.] section to your project TOML: - - [agents.dev-qwen] - base_url = "http://your-llama-server:8081" - model = "unsloth/Qwen3.5-35B-A3B" - api_key = "sk-no-key-required" - roles = ["dev"] - forge_user = "dev-qwen" - compact_pct = 60 - poll_interval = 60 - - 3. Run: disinto init - 4. Start the agent: docker compose up -d agents-dev-qwen - -See docs/agents-llama.md for full details. -MIGRATION_ERR - exit 1 -fi - DISINTO_BAKED="/home/agent/disinto" DISINTO_LIVE="/home/agent/repos/_factory" DISINTO_DIR="$DISINTO_BAKED" # start with baked copy; switched to live checkout after bootstrap @@ -347,24 +315,6 @@ _setup_git_creds configure_git_identity configure_tea_login -# Parse first available project TOML to get the project name for cloning. -# This ensures PROJECT_NAME matches the TOML 'name' field, not the compose -# default of 'project'. The clone will land at /home/agent/repos/ -# and subsequent env exports in the main loop will be consistent. -if compgen -G "${DISINTO_DIR}/projects/*.toml" >/dev/null 2>&1; then - _first_toml=$(compgen -G "${DISINTO_DIR}/projects/*.toml" | head -1) - _pname=$(python3 -c " -import sys, tomllib -with open(sys.argv[1], 'rb') as f: - print(tomllib.load(f).get('name', '')) -" "$_first_toml" 2>/dev/null) || _pname="" - if [ -n "$_pname" ]; then - export PROJECT_NAME="$_pname" - export PROJECT_REPO_ROOT="/home/agent/repos/${_pname}" - log "Parsed PROJECT_NAME=${PROJECT_NAME} from ${_first_toml}" - fi -fi - # Clone project repo on first run (makes agents self-healing, #589) ensure_project_clone @@ -374,32 +324,9 @@ bootstrap_ops_repos # Bootstrap factory repo — switch DISINTO_DIR to live checkout (#593) bootstrap_factory_repo -# Validate that projects directory has at least one real .toml file (not .example) -# This prevents the silent-zombie mode where the polling loop matches zero files -# and does nothing forever. -validate_projects_dir() { - # NOTE: compgen -G exits non-zero when no matches exist, so piping it through - # `wc -l` under `set -eo pipefail` aborts the script before the FATAL branch - # can log a diagnostic (#877). Use the conditional form already adopted at - # lines above (see bootstrap_factory_repo, PROJECT_NAME parsing). - if ! compgen -G "${DISINTO_DIR}/projects/*.toml" >/dev/null 2>&1; then - log "FATAL: No real .toml files found in ${DISINTO_DIR}/projects/" - log "Expected at least one project config file (e.g., disinto.toml)" - log "The directory only contains *.toml.example template files." - log "Mount the host ./projects volume or copy real .toml files into the container." - exit 1 - fi - local toml_count - toml_count=$(compgen -G "${DISINTO_DIR}/projects/*.toml" | wc -l) - log "Projects directory validated: ${toml_count} real .toml file(s) found" -} - # Initialize state directory for check_active guards init_state_dir -# Validate projects directory before entering polling loop -validate_projects_dir - # Parse AGENT_ROLES env var (default: all agents) # Expected format: comma-separated list like "review,dev,gardener" AGENT_ROLES="${AGENT_ROLES:-review,dev,gardener,architect,planner,predictor,supervisor}" diff --git a/docs/agents-llama.md b/docs/agents-llama.md index b3a1334..88622a7 100644 --- a/docs/agents-llama.md +++ b/docs/agents-llama.md @@ -1,194 +1,59 @@ -# Local-Model Agents +# agents-llama — Local-Qwen Agents -Local-model agents run the same agent code as the Claude-backed agents, but -connect to a local llama-server (or compatible OpenAI-API endpoint) instead of -the Anthropic API. This document describes the canonical activation flow using -`disinto hire-an-agent` and `[agents.X]` TOML configuration. +The `agents-llama` service is an optional compose service that runs agents +backed by a local llama-server instance (e.g. Qwen) instead of the Anthropic +API. It uses the same Docker image as the main `agents` service but connects to +a local inference endpoint via `ANTHROPIC_BASE_URL`. -> **Note:** The legacy `ENABLE_LLAMA_AGENT=1` env flag has been removed (#846). -> Activation is now done exclusively via `[agents.X]` sections in project TOML. +Two profiles are available: -## Overview +| Profile | Service | Roles | Use case | +|---------|---------|-------|----------| +| _(default)_ | `agents-llama` | `dev` only | Conservative: single-role soak test | +| `agents-llama-all` | `agents-llama-all` | all 7 (review, dev, gardener, architect, planner, predictor, supervisor) | Pre-migration: validate every role on llama before Nomad cutover | -Local-model agents are configured via `[agents.]` sections in -`projects/.toml`. Each agent gets: -- Its own Forgejo bot user with dedicated API token and password -- A dedicated compose service `agents-` -- Isolated credentials stored as `FORGE_TOKEN_` and `FORGE_PASS_` in `.env` +## Enabling + +Set `ENABLE_LLAMA_AGENT=1` in `.env` (or `.env.enc`) and provide the required +credentials: + +```env +ENABLE_LLAMA_AGENT=1 +FORGE_TOKEN_LLAMA= +FORGE_PASS_LLAMA= +ANTHROPIC_BASE_URL=http://host.docker.internal:8081 # llama-server endpoint +``` + +Then regenerate the compose file (`disinto init ...`) and bring the stack up. + +### Running all 7 roles (agents-llama-all) + +```bash +docker compose --profile agents-llama-all up -d +``` + +This starts the `agents-llama-all` container with all 7 bot roles against the +local llama endpoint. The per-role forge tokens (`FORGE_REVIEW_TOKEN`, +`FORGE_GARDENER_TOKEN`, etc.) must be set in `.env` — they are the same tokens +used by the Claude-backed `agents` container. ## Prerequisites - **llama-server** (or compatible OpenAI-API endpoint) running on the host, - reachable from inside Docker at the URL you will configure. -- A disinto factory already initialized (`disinto init` completed). - -## Hiring a local-model agent - -Use `disinto hire-an-agent` with `--local-model` to create a bot user and -configure the agent: - -```bash -# Hire a local-model agent for the dev role -disinto hire-an-agent dev-qwen dev \ - --local-model http://10.10.10.1:8081 \ - --model unsloth/Qwen3.5-35B-A3B -``` - -The command performs these steps: - -1. **Creates a Forgejo user** `dev-qwen` with a random password -2. **Generates an API token** for the user -3. **Writes credentials to `.env`**: - - `FORGE_TOKEN_DEV_QWEN` — the API token - - `FORGE_PASS_DEV_QWEN` — the password - - `ANTHROPIC_BASE_URL` — the llama endpoint (required by the agent) -4. **Writes `[agents.dev-qwen]` to `projects/.toml`** with: - - `base_url`, `model`, `api_key` - - `roles = ["dev"]` - - `forge_user = "dev-qwen"` - - `compact_pct = 60` - - `poll_interval = 60` -5. **Regenerates `docker-compose.yml`** to include the `agents-dev-qwen` service - -### Anthropic backend agents - -For agents that use Anthropic API instead of a local model, omit `--local-model`: - -```bash -# Anthropic backend agent (requires ANTHROPIC_API_KEY in environment) -export ANTHROPIC_API_KEY="sk-..." -disinto hire-an-agent dev-claude dev -``` - -This writes `ANTHROPIC_API_KEY` to `.env` instead of `ANTHROPIC_BASE_URL`. - -## Activation and running - -Once hired, the agent service is added to `docker-compose.yml`. Start the -service with `docker compose up -d`: - -```bash -# Start all agent services -docker compose up -d - -# Start a single named agent service -docker compose up -d agents-dev-qwen - -# Start multiple named agent services -docker compose up -d agents-dev-qwen agents-planner -``` - -### Stopping agents - -```bash -# Stop a specific agent service -docker compose down agents-dev-qwen - -# Stop all agent services -docker compose down -``` - -## Credential rotation - -Re-running `disinto hire-an-agent ` with the same parameters rotates -credentials idempotently: - -```bash -# Re-hire the same agent to rotate token and password -disinto hire-an-agent dev-qwen dev \ - --local-model http://10.10.10.1:8081 \ - --model unsloth/Qwen3.5-35B-A3B - -# The command will: -# 1. Detect the user already exists -# 2. Reset the password to a new random value -# 3. Create a new API token -# 4. Update .env with the new credentials -``` - -This is the recommended way to rotate agent credentials. The `.env` file is -updated in place, so no manual editing is required. - -If you need to manually rotate credentials: -1. Generate a new token in Forgejo admin UI -2. Edit `.env` and replace `FORGE_TOKEN_` and `FORGE_PASS_` -3. Restart the agent service: `docker compose restart agents-` - -## Configuration reference - -### Environment variables (`.env`) - -| Variable | Description | Example | -|----------|-------------|---------| -| `FORGE_TOKEN_` | Forgejo API token for the bot user | `FORGE_TOKEN_DEV_QWEN` | -| `FORGE_PASS_` | Forgejo password for the bot user | `FORGE_PASS_DEV_QWEN` | -| `ANTHROPIC_BASE_URL` | Local llama endpoint (local model agents) | `http://host.docker.internal:8081` | -| `ANTHROPIC_API_KEY` | Anthropic API key (Anthropic backend agents) | `sk-...` | - -### Project TOML (`[agents.]` section) - -```toml -[agents.dev-qwen] -base_url = "http://10.10.10.1:8081" -model = "unsloth/Qwen3.5-35B-A3B" -api_key = "sk-no-key-required" -roles = ["dev"] -forge_user = "dev-qwen" -compact_pct = 60 -poll_interval = 60 -``` - -| Field | Description | -|-------|-------------| -| `base_url` | llama-server endpoint | -| `model` | Model name (for logging/identification) | -| `api_key` | Required by API; set to placeholder for llama | -| `roles` | Agent roles this instance handles | -| `forge_user` | Forgejo bot username | -| `compact_pct` | Context compaction threshold (lower = more aggressive) | -| `poll_interval` | Seconds between polling cycles | + reachable from inside Docker at the URL set in `ANTHROPIC_BASE_URL`. +- A Forgejo bot user (e.g. `dev-qwen`) with its own API token and password, + stored as `FORGE_TOKEN_LLAMA` / `FORGE_PASS_LLAMA`. ## Behaviour -- Each agent runs with `AGENT_ROLES` set to its configured roles +- `agents-llama`: `AGENT_ROLES=dev` — only picks up dev work. +- `agents-llama-all`: `AGENT_ROLES=review,dev,gardener,architect,planner,predictor,supervisor` — runs all 7 roles. - `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60` — more aggressive compaction for smaller - context windows -- Agents serialize on the llama-server's single KV cache (AD-002) + context windows. +- Serialises on the llama-server's single KV cache (AD-002). -## Troubleshooting +## Disabling -### Agent service not starting - -Check that the service was created by `disinto hire-an-agent`: - -```bash -docker compose config | grep -A5 "agents-dev-qwen" -``` - -If the service is missing, re-run `disinto hire-an-agent dev-qwen dev` to -regenerate `docker-compose.yml`. - -### Model endpoint unreachable - -Verify llama-server is accessible from inside Docker: - -```bash -docker compose -f docker-compose.yml exec agents curl -sf http://host.docker.internal:8081/health -``` - -If using a custom host IP, update `ANTHROPIC_BASE_URL` in `.env`: - -```bash -# Update the base URL -sed -i 's|^ANTHROPIC_BASE_URL=.*|ANTHROPIC_BASE_URL=http://192.168.1.100:8081|' .env - -# Restart the agent -docker compose restart agents-dev-qwen -``` - -### Invalid agent name - -Agent names must match `^[a-z]([a-z0-9]|-[a-z0-9])*$` (lowercase letters, digits, -hyphens; starts with letter, ends with alphanumeric). Invalid names like -`dev-qwen2` (trailing digit is OK) or `dev--qwen` (consecutive hyphens) will -be rejected. +Set `ENABLE_LLAMA_AGENT=0` (or leave it unset) and regenerate. The service +block is omitted entirely from `docker-compose.yml`; the stack starts cleanly +without it. diff --git a/docs/nomad-migration.md b/docs/nomad-migration.md deleted file mode 100644 index 02ff023..0000000 --- a/docs/nomad-migration.md +++ /dev/null @@ -1,124 +0,0 @@ - -# Nomad+Vault migration — cutover-day runbook - -`disinto init --backend=nomad` is the single entry-point that turns a fresh -LXC (with the disinto repo cloned) into a running Nomad+Vault cluster with -policies applied, JWT workload-identity auth configured, secrets imported -from the old docker stack, and services deployed. - -## Cutover-day invocation - -On the new LXC, as root (or an operator with NOPASSWD sudo): - -```bash -# Copy the plaintext .env + sops-encrypted .env.vault.enc + age keyfile -# from the old box first (out of band — SSH, USB, whatever your ops -# procedure allows). Then: - -sudo ./bin/disinto init \ - --backend=nomad \ - --import-env /tmp/.env \ - --import-sops /tmp/.env.vault.enc \ - --age-key /tmp/keys.txt \ - --with forgejo -``` - -This runs, in order: - -1. **`lib/init/nomad/cluster-up.sh`** (S0) — installs Nomad + Vault - binaries, writes `/etc/nomad.d/*`, initializes Vault, starts both - services, waits for the Nomad node to become ready. -2. **`tools/vault-apply-policies.sh`** (S2.1) — syncs every - `vault/policies/*.hcl` into Vault as an ACL policy. Idempotent. -3. **`lib/init/nomad/vault-nomad-auth.sh`** (S2.3) — enables Vault's - JWT auth method at `jwt-nomad`, points it at Nomad's JWKS, writes - one role per policy, reloads Nomad so jobs can exchange - workload-identity tokens for Vault tokens. Idempotent. -4. **`tools/vault-import.sh`** (S2.2) — reads `/tmp/.env` and the - sops-decrypted `/tmp/.env.vault.enc`, writes them to the KV paths - matching the S2.1 policy layout (`kv/disinto/bots/*`, `kv/disinto/shared/*`, - `kv/disinto/runner/*`). Idempotent (overwrites KV v2 data in place). -5. **`lib/init/nomad/deploy.sh forgejo`** (S1) — validates + runs the - `nomad/jobs/forgejo.hcl` jobspec. Forgejo reads its admin creds from - Vault via the `template` stanza (S2.4). - -## Flag summary - -| Flag | Meaning | -|---|---| -| `--backend=nomad` | Switch the init dispatcher to the Nomad+Vault path (instead of docker compose). | -| `--empty` | Bring the cluster up, skip policies/auth/import/deploy. Escape hatch for debugging. | -| `--with forgejo[,…]` | Deploy these services after the cluster is up. | -| `--import-env PATH` | Plaintext `.env` from the old stack. Optional. | -| `--import-sops PATH` | Sops-encrypted `.env.vault.enc` from the old stack. Requires `--age-key`. | -| `--age-key PATH` | Age keyfile used to decrypt `--import-sops`. Requires `--import-sops`. | -| `--dry-run` | Print the full plan (cluster-up + policies + auth + import + deploy) and exit. Touches nothing. | - -### Flag validation - -- `--import-sops` without `--age-key` → error. -- `--age-key` without `--import-sops` → error. -- `--import-env` alone (no sops) → OK (imports just the plaintext `.env`). -- `--backend=docker` with any `--import-*` flag → error. -- `--empty` with any `--import-*` flag → error (mutually exclusive: `--empty` - skips the import step, so pairing them silently discards the import - intent). - -## Idempotency - -Every layer is idempotent by design. Re-running the same command on an -already-provisioned box is a no-op at every step: - -- **Cluster-up:** second run detects running `nomad`/`vault` systemd - units and state files, skips re-init. -- **Policies:** byte-for-byte compare against on-server policy text; - "unchanged" for every untouched file. -- **Auth:** skips auth-method create if `jwt-nomad/` already enabled, - skips config write if the JWKS + algs match, skips server.hcl write if - the file on disk is identical to the repo copy. -- **Import:** KV v2 writes overwrite in place (same path, same keys, - same values → no new version). -- **Deploy:** `nomad job run` is declarative; same jobspec → no new - allocation. - -## Dry-run - -```bash -./bin/disinto init --backend=nomad \ - --import-env /tmp/.env \ - --import-sops /tmp/.env.vault.enc \ - --age-key /tmp/keys.txt \ - --with forgejo \ - --dry-run -``` - -Prints the five-section plan — cluster-up, policies, auth, import, -deploy — with every path and every argv that would be executed. No -network, no sudo, no state mutation. See -`tests/disinto-init-nomad.bats` for the exact output shape. - -## No-import path - -If you already have `kv/disinto/*` seeded by other means (manual -`vault kv put`, a replica, etc.), omit all three `--import-*` flags. -`disinto init --backend=nomad --with forgejo` still applies policies, -configures auth, and deploys — but skips the import step with: - -``` -[import] no --import-env/--import-sops — skipping; set them or seed kv/disinto/* manually before deploying secret-dependent services -``` - -Forgejo's template stanza will fail to render (and thus the allocation -will stall) until those KV paths exist — so either import them or seed -them first. - -## Secret hygiene - -- Never log a secret value. The CLI only prints paths (`--import-env`, - `--age-key`) and KV *paths* (`kv/disinto/bots/review/token`), never - the values themselves. `tools/vault-import.sh` is the only thing that - reads the values, and it pipes them directly into Vault's HTTP API. -- The age keyfile must be mode 0400 — `vault-import.sh` refuses to - source a keyfile with looser permissions. -- `VAULT_ADDR` must be localhost during import — the import tool - refuses to run against a remote Vault, preventing accidental exposure. diff --git a/formulas/release.sh b/formulas/release.sh index 6526d1a..b8c4eb6 100644 --- a/formulas/release.sh +++ b/formulas/release.sh @@ -178,8 +178,8 @@ log "Tagged disinto/agents:${RELEASE_VERSION}" log "Step 6/6: Restarting agent containers" -docker compose stop agents 2>/dev/null || true -docker compose up -d agents +docker compose stop agents agents-llama 2>/dev/null || true +docker compose up -d agents agents-llama log "Agent containers restarted" # ── Done ───────────────────────────────────────────────────────────────── diff --git a/formulas/release.toml b/formulas/release.toml index ccd7f95..f702f42 100644 --- a/formulas/release.toml +++ b/formulas/release.toml @@ -189,10 +189,10 @@ Restart agent containers to use the new image. - docker compose pull agents 2. Stop and remove existing agent containers: - - docker compose down agents + - docker compose down agents agents-llama 2>/dev/null || true 3. Start agents with new image: - - docker compose up -d agents + - docker compose up -d agents agents-llama 4. Wait for containers to be healthy: - for i in {1..30}; do @@ -203,7 +203,7 @@ Restart agent containers to use the new image. - done 5. Verify containers are running: - - docker compose ps agents + - docker compose ps agents agents-llama 6. Log restart: - echo "Restarted agents containers" diff --git a/formulas/run-supervisor.toml b/formulas/run-supervisor.toml index 4101252..f31e6bc 100644 --- a/formulas/run-supervisor.toml +++ b/formulas/run-supervisor.toml @@ -29,7 +29,7 @@ and injected into your prompt above. Review them now. 1. Read the injected metrics data carefully (System Resources, Docker, Active Sessions, Phase Files, Stale Phase Cleanup, Lock Files, Agent Logs, - CI Pipelines, Open PRs, Issue Status, Stale Worktrees, **Woodpecker Agent Health**). + CI Pipelines, Open PRs, Issue Status, Stale Worktrees). Note: preflight.sh auto-removes PHASE:escalate files for closed issues (24h grace period). Check the "Stale Phase Cleanup" section for any files cleaned or in grace period this run. @@ -75,10 +75,6 @@ Categorize every finding from the metrics into priority levels. - Dev/action sessions in PHASE:escalate for > 24h (session timeout) (Note: PHASE:escalate files for closed issues are auto-cleaned by preflight; this check covers sessions where the issue is still open) -- **Woodpecker agent unhealthy** — see "Woodpecker Agent Health" section in preflight: - - Container not running or in unhealthy state - - gRPC errors >= 3 in last 20 minutes - - Fast-failure pipelines (duration < 60s) >= 3 in last 15 minutes ### P3 — Factory degraded - PRs stale: CI finished >20min ago AND no git push to the PR branch since CI completed @@ -104,15 +100,6 @@ For each finding from the health assessment, decide and execute an action. ### Auto-fixable (execute these directly) -**P2 Woodpecker agent unhealthy:** -The supervisor-run.sh script automatically handles WP agent recovery: -- Detects unhealthy state via preflight.sh health checks -- Restarts container via `docker restart` -- Scans for `blocked: ci_exhausted` issues updated in last 30 minutes -- Unassigns and removes blocked label from affected issues -- Posts recovery comment with infra-flake context -- Avoids duplicate restarts via 5-minute cooldown in history file - **P0 Memory crisis:** # Kill stale one-shot claude processes (>3h old) pgrep -f "claude -p" --older 10800 2>/dev/null | xargs kill 2>/dev/null || true @@ -261,11 +248,6 @@ Format: - (or "No actions needed") - ### WP Agent Recovery (if applicable) - - WP agent restart: