bug: disinto-edge hard-fails on missing age key / secrets even when collect-engagement feature is not configured #1038

Open
opened 2026-04-19 09:14:39 +00:00 by disinto-admin · 0 comments

Problem

disinto-edge crashloops on any deployment that hasn't opted into the age-encrypted secret store (#777), because the edge entrypoint treats four secrets as unconditionally required:

FATAL: age key (/home/agent/.config/sops/age/keys.txt) or secrets dir (/opt/disinto/secrets) not found — cannot load required secrets

Observed on disinto-dev-box (container disinto-edge, restarting every ~30s), which blocks PR #1033 (edge-subpath smoke test) and any other work that depends on a running edge.

Root cause

docker/edge/entrypoint-edge.sh:176-205 requires:

  • ~/.config/sops/age/keys.txt
  • /opt/disinto/secrets/ with .enc files for CADDY_SSH_KEY, CADDY_SSH_HOST, CADDY_SSH_USER, CADDY_ACCESS_LOG.

These four secrets feed exactly one feature: the daily 23:50 UTC collect-engagement.sh cron (#745), which SCPs Caddy access logs from a remote production edge host for engagement parsing. On a local factory box (disinto-dev-box) or any deployment that hasn't set up a remote edge, this code path has no target to fetch from — yet its absence kills the whole edge container.

Pre-#777 this was soft-degrade. #777 turned it into a hard-fail at startup as part of the broader "single granular secrets store" consolidation — the hard-fail fit the general secrets-required model but is wrong for this specific, optional feature.

Fix

Make the secrets block optional. When age key or secrets dir is missing, or any of the four CADDY_ secrets fail to decrypt, log a warning and skip the collect-engagement cron loop. Caddy itself does not depend on these secrets and should start normally.

Concrete edit to docker/edge/entrypoint-edge.sh (around lines 176-205):

# ── Load optional engagement-collection secrets from secrets/*.enc (#777, #XXXX) ──
# These secrets feed the nightly collect-engagement SCP (#745). They are only
# relevant on production edges that fetch logs from a remote Caddy. Missing
# secrets are NOT fatal — we warn and skip the collect-engagement cron.
_AGE_KEY_FILE="${HOME}/.config/sops/age/keys.txt"
_SECRETS_DIR="/opt/disinto/secrets"
EDGE_ENGAGEMENT_SECRETS="CADDY_SSH_KEY CADDY_SSH_HOST CADDY_SSH_USER CADDY_ACCESS_LOG"
EDGE_ENGAGEMENT_READY=0

_edge_decrypt_secret() {
  local enc_path="${_SECRETS_DIR}/${1}.enc"
  [ -f "$enc_path" ] || return 1
  age -d -i "$_AGE_KEY_FILE" "$enc_path" 2>/dev/null
}

if [ -f "$_AGE_KEY_FILE" ] && [ -d "$_SECRETS_DIR" ]; then
  _missing=""
  for _secret_name in $EDGE_ENGAGEMENT_SECRETS; do
    _val=$(_edge_decrypt_secret "$_secret_name") || { _missing="${_missing} ${_secret_name}"; continue; }
    export "$_secret_name=$_val"
  done
  if [ -n "$_missing" ]; then
    echo "edge: engagement-collection disabled — missing secrets:${_missing}" >&2
    echo "  Run 'disinto secrets add <NAME>' for each to enable." >&2
  else
    echo "edge: loaded engagement secrets: ${EDGE_ENGAGEMENT_SECRETS}" >&2
    EDGE_ENGAGEMENT_READY=1
  fi
else
  echo "edge: engagement-collection disabled — age key or secrets dir not found" >&2
  echo "  ($_AGE_KEY_FILE / $_SECRETS_DIR)" >&2
fi

Then gate the collect-engagement cron loop on EDGE_ENGAGEMENT_READY:

if [ "$EDGE_ENGAGEMENT_READY" = "1" ]; then
  (while true; do
    # ... existing collect-engagement cron body unchanged ...
  done) &
else
  echo "edge: skipping collect-engagement cron (secrets not configured)" >&2
fi

Acceptance criteria

  • disinto-edge starts and stays healthy on a box with no age key and no secrets/*.enc files
  • Caddy serves traffic normally (confirmed via curl -fsS http://localhost:2019/config/ in the container)
  • Log output contains a clear single-line warning that engagement collection is disabled (no FATAL, no exit)
  • When all four secrets are present and decrypt cleanly, behavior is unchanged (cron scheduled, existing success log line printed)
  • When some but not all secrets are present, cron is skipped and the per-secret missing list is logged once
  • PR #1033 (edge-subpath smoke) proceeds past the edge-boot gate
  • shellcheck docker/edge/entrypoint-edge.sh clean

Test

Manual verification on disinto-dev-box:

# Confirm no age key / secrets
ls /home/johba/.config/sops/age/ /home/johba/disinto/secrets/
# Rebuild + restart edge
cd /home/johba/disinto && docker compose up -d --build edge
# Verify healthy (not restarting)
docker ps --filter name=disinto-edge --format '{{.Status}}'
docker logs disinto-edge 2>&1 | grep -E 'engagement|FATAL|skipping'

Non-goals

  • Adding an EDGE_REQUIRE_SECRETS=1 opt-in strict mode (can be a follow-up; production edges currently work because they set up the secrets anyway).
  • Migrating anything to age encryption on dev boxes.
  • Touching the collect-engagement script itself or its cron timing.

Affected files

  • docker/edge/entrypoint-edge.sh — soften secrets check, gate cron on ready flag
  • #777 — introduced the hard requirement
  • #745 — owns the collect-engagement feature
  • #1033 — blocked by this crashloop
## Problem `disinto-edge` crashloops on any deployment that hasn't opted into the age-encrypted secret store (#777), because the edge entrypoint treats four secrets as unconditionally required: ``` FATAL: age key (/home/agent/.config/sops/age/keys.txt) or secrets dir (/opt/disinto/secrets) not found — cannot load required secrets ``` Observed on `disinto-dev-box` (container `disinto-edge`, restarting every ~30s), which blocks PR #1033 (edge-subpath smoke test) and any other work that depends on a running edge. ## Root cause `docker/edge/entrypoint-edge.sh:176-205` requires: - `~/.config/sops/age/keys.txt` - `/opt/disinto/secrets/` with `.enc` files for `CADDY_SSH_KEY`, `CADDY_SSH_HOST`, `CADDY_SSH_USER`, `CADDY_ACCESS_LOG`. These four secrets feed exactly one feature: the daily 23:50 UTC `collect-engagement.sh` cron (#745), which SCPs Caddy access logs from a **remote production edge host** for engagement parsing. On a local factory box (disinto-dev-box) or any deployment that hasn't set up a remote edge, this code path has no target to fetch from — yet its absence kills the whole edge container. Pre-#777 this was soft-degrade. #777 turned it into a hard-fail at startup as part of the broader "single granular secrets store" consolidation — the hard-fail fit the general secrets-required model but is wrong for this specific, optional feature. ## Fix Make the secrets block **optional**. When age key or secrets dir is missing, or any of the four CADDY_ secrets fail to decrypt, log a warning and skip the `collect-engagement` cron loop. Caddy itself does not depend on these secrets and should start normally. Concrete edit to `docker/edge/entrypoint-edge.sh` (around lines 176-205): ```bash # ── Load optional engagement-collection secrets from secrets/*.enc (#777, #XXXX) ── # These secrets feed the nightly collect-engagement SCP (#745). They are only # relevant on production edges that fetch logs from a remote Caddy. Missing # secrets are NOT fatal — we warn and skip the collect-engagement cron. _AGE_KEY_FILE="${HOME}/.config/sops/age/keys.txt" _SECRETS_DIR="/opt/disinto/secrets" EDGE_ENGAGEMENT_SECRETS="CADDY_SSH_KEY CADDY_SSH_HOST CADDY_SSH_USER CADDY_ACCESS_LOG" EDGE_ENGAGEMENT_READY=0 _edge_decrypt_secret() { local enc_path="${_SECRETS_DIR}/${1}.enc" [ -f "$enc_path" ] || return 1 age -d -i "$_AGE_KEY_FILE" "$enc_path" 2>/dev/null } if [ -f "$_AGE_KEY_FILE" ] && [ -d "$_SECRETS_DIR" ]; then _missing="" for _secret_name in $EDGE_ENGAGEMENT_SECRETS; do _val=$(_edge_decrypt_secret "$_secret_name") || { _missing="${_missing} ${_secret_name}"; continue; } export "$_secret_name=$_val" done if [ -n "$_missing" ]; then echo "edge: engagement-collection disabled — missing secrets:${_missing}" >&2 echo " Run 'disinto secrets add <NAME>' for each to enable." >&2 else echo "edge: loaded engagement secrets: ${EDGE_ENGAGEMENT_SECRETS}" >&2 EDGE_ENGAGEMENT_READY=1 fi else echo "edge: engagement-collection disabled — age key or secrets dir not found" >&2 echo " ($_AGE_KEY_FILE / $_SECRETS_DIR)" >&2 fi ``` Then gate the collect-engagement cron loop on `EDGE_ENGAGEMENT_READY`: ```bash if [ "$EDGE_ENGAGEMENT_READY" = "1" ]; then (while true; do # ... existing collect-engagement cron body unchanged ... done) & else echo "edge: skipping collect-engagement cron (secrets not configured)" >&2 fi ``` ## Acceptance criteria - [ ] `disinto-edge` starts and stays healthy on a box with no age key and no `secrets/*.enc` files - [ ] Caddy serves traffic normally (confirmed via `curl -fsS http://localhost:2019/config/` in the container) - [ ] Log output contains a clear single-line warning that engagement collection is disabled (no FATAL, no exit) - [ ] When all four secrets *are* present and decrypt cleanly, behavior is unchanged (cron scheduled, existing success log line printed) - [ ] When some but not all secrets are present, cron is skipped and the per-secret missing list is logged once - [ ] PR #1033 (edge-subpath smoke) proceeds past the edge-boot gate - [ ] `shellcheck docker/edge/entrypoint-edge.sh` clean ## Test Manual verification on disinto-dev-box: ```bash # Confirm no age key / secrets ls /home/johba/.config/sops/age/ /home/johba/disinto/secrets/ # Rebuild + restart edge cd /home/johba/disinto && docker compose up -d --build edge # Verify healthy (not restarting) docker ps --filter name=disinto-edge --format '{{.Status}}' docker logs disinto-edge 2>&1 | grep -E 'engagement|FATAL|skipping' ``` ## Non-goals - Adding an `EDGE_REQUIRE_SECRETS=1` opt-in strict mode (can be a follow-up; production edges currently work because they set up the secrets anyway). - Migrating anything to age encryption on dev boxes. - Touching the collect-engagement script itself or its cron timing. ## Affected files - `docker/edge/entrypoint-edge.sh` — soften secrets check, gate cron on ready flag ## Related - #777 — introduced the hard requirement - #745 — owns the collect-engagement feature - #1033 — blocked by this crashloop
disinto-admin added the
blocked
bug-report
labels 2026-04-19 09:14:39 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: disinto-admin/disinto#1038
No description provided.