disinto/disinto-factory/lessons-learned.md

# Lessons learned

## Debugging & Diagnostics

**Map the environment before changing code.** Silent failures often stem from runtime assumptions—missing paths, wrong user context, or unmet prerequisites. Verify the actual environment first.

**Silent termination is a logging failure.** When a script exits non-zero with no output, the bug is in error handling, not the command. Log at operation entry points, not just on success.

**Pipefail is not a silver bullet.** It propagates exit codes but doesn't guarantee visibility. Pair with explicit error logging for external commands (git, curl, etc.).

**Debug the pattern, not the symptom.** If one HTTP call fails with 403, audit all similar calls. If one script has the same bug, find where it's duplicated.

## Shell Scripting Patterns

**Exit codes don't indicate output.** Commands like `grep -c` exit 1 when count is 0 but still output a number. Test both output and exit status independently.

**The `||` pattern is fragile.** It appends on failure, doesn't replace output. Use command grouping or conditionals when output clarity matters.

**Arithmetic contexts are unforgiving.** `(( ))` fails on anything non-numeric. A stray newline or extra digit breaks everything.

**Source file boundaries matter.** Variables defined in sourced files are local unless exported. Trace the lifecycle: definition → export → usage.

## Environment & Deployment

**User context matters at every layer.** When using `gosu`/`su-exec`, ensure all file operations occur under the target user. Create resources with explicit `chown` before dropping privileges.

**Test under final runtime conditions.** Reproduce the exact user context the application will run under, not just "container runs."

**Fail fast with actionable diagnostics.** Entrypoints should exit immediately on dependency failures with clear messages explaining *why* and *what to do*.

**Throttle retry loops.** Infinite retries without backoff mask underlying problems and look identical to healthy startups.

## API & Integration

**Validate semantic types, not just names.** Don't infer resource type from naming conventions. Explicitly resolve whether an identifier is a user, org, or team before constructing URLs.

**403 errors can signal semantic mismatches.** When debugging auth failures, consider whether the request is going to the wrong resource type.

**Auth failures are rarely isolated.** If one endpoint requires credentials, scan for other unauthenticated calls. Environment assumptions about public access commonly break.

**Test against the most restrictive environment first.** If it works on a locked-down instance, it'll work everywhere.

## State & Configuration

**Idempotency requires state awareness.** Distinguish "needs setup" from "already configured." A naive always-rotate approach breaks reproducibility.

**Audit the full dependency chain.** When modifying shared resources, trace all consumers. Embedded tokens create hidden coupling.

**Check validity, not just existence.** Never assume a credential is invalid just because it exists. Verify expiry, permissions, or other validity criteria.

**Conservative defaults become problematic defaults.** Timeouts and limits should reflect real-world expectations, not worst-case scenarios. When in doubt, start aggressive and fail fast.

**Documentation and defaults must stay in sync.** When a default changes, docs should immediately reflect why.

## Validation & Testing

**Add validation after critical operations.** If a migration commits N commits, verify N commits exist afterward. The extra lines are cheaper than debugging incomplete work.

**Integration tests should cover both paths.** Test org and user scenarios, empty inputs, and edge cases explicitly.

**Reproduce with minimal examples.** Running the exact pipeline with test cases that trigger edge conditions catches bugs early.

**Treat "works locally but not in production" as environmental, not code.** The bug is in assumptions about the runtime, not the logic itself.
fix: tech-debt: sweep cron-isms from code comments, helpers, lib, and public site copy (#548) - Rename acquire_cron_lock → acquire_run_lock in lib/formula-session.sh and all five -run.sh call sites - Update all -run.sh file headers: "Cron wrapper" → "Polling-loop wrapper" - Rewrite docs/updating-factory.md: replace crontab check with pgrep, replace "Crontab empty after restart" section with polling-loop equivalent - Update docs/EVAL-MCP-SERVER.md to reflect polling-loop reality - Update lib/guard.sh, lib/AGENTS.md, lib/ci-setup.sh comments - Update formulas/*.toml comments (cron → polling loop) - Update dev/dev-poll.sh usage comment - Update tests/smoke-init.sh to handle compose vs bare-metal scheduling - Update .woodpecker/agent-smoke.sh comments - Update site HTML: architecture.html, quickstart.html, index.html - Clarify _install_cron_impl is bare-metal only (compose uses polling loop) - Keep site/collect-engagement.sh and site/collect-metrics.sh cron refs (genuinely cron-driven on the website host, separate from factory loop) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> 2026-04-10 08:54:11 +00:00			`# Lessons learned`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`## Debugging & Diagnostics`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Map the environment before changing code. Silent failures often stem from runtime assumptions—missing paths, wrong user context, or unmet prerequisites. Verify the actual environment first.`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Silent termination is a logging failure. When a script exits non-zero with no output, the bug is in error handling, not the command. Log at operation entry points, not just on success.`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Pipefail is not a silver bullet. It propagates exit codes but doesn't guarantee visibility. Pair with explicit error logging for external commands (git, curl, etc.).`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Debug the pattern, not the symptom. If one HTTP call fails with 403, audit all similar calls. If one script has the same bug, find where it's duplicated.`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`## Shell Scripting Patterns`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			Exit codes don't indicate output. Commands like `grep -c` exit 1 when count is 0 but still output a number. Test both output and exit status independently.
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			The `\|\|` pattern is fragile. It appends on failure, doesn't replace output. Use command grouping or conditionals when output clarity matters.
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			Arithmetic contexts are unforgiving. `(( ))` fails on anything non-numeric. A stray newline or extra digit breaks everything.
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Source file boundaries matter. Variables defined in sourced files are local unless exported. Trace the lifecycle: definition → export → usage.`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`## Environment & Deployment`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			User context matters at every layer. When using `gosu`/`su-exec`, ensure all file operations occur under the target user. Create resources with explicit `chown` before dropping privileges.
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Test under final runtime conditions. Reproduce the exact user context the application will run under, not just "container runs."`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Fail fast with actionable diagnostics. Entrypoints should exit immediately on dependency failures with clear messages explaining why and what to do.`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Throttle retry loops. Infinite retries without backoff mask underlying problems and look identical to healthy startups.`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`## API & Integration`
fix: docs: add factory interaction lessons to SKILL.md (#156) 2026-04-02 20:36:56 +00:00
fix: fix: make _generate_compose_impl the canonical compose source — remove tracked docker-compose.yml + update docs (#603) 2026-04-10 16:29:06 +00:00			`Validate semantic types, not just names. Don't infer resource type from naming conventions. Explicitly resolve whether an identifier is a user, org, or team before constructing URLs.`

			`403 errors can signal semantic mismatches. When debugging auth failures, consider whether the request is going to the wrong resource type.`

			`Auth failures are rarely isolated. If one endpoint requires credentials, scan for other unauthenticated calls. Environment assumptions about public access commonly break.`

			`Test against the most restrictive environment first. If it works on a locked-down instance, it'll work everywhere.`

			`## State & Configuration`

			`Idempotency requires state awareness. Distinguish "needs setup" from "already configured." A naive always-rotate approach breaks reproducibility.`

			`Audit the full dependency chain. When modifying shared resources, trace all consumers. Embedded tokens create hidden coupling.`

			`Check validity, not just existence. Never assume a credential is invalid just because it exists. Verify expiry, permissions, or other validity criteria.`

			`Conservative defaults become problematic defaults. Timeouts and limits should reflect real-world expectations, not worst-case scenarios. When in doubt, start aggressive and fail fast.`

			`Documentation and defaults must stay in sync. When a default changes, docs should immediately reflect why.`

			`## Validation & Testing`

			`Add validation after critical operations. If a migration commits N commits, verify N commits exist afterward. The extra lines are cheaper than debugging incomplete work.`

			`Integration tests should cover both paths. Test org and user scenarios, empty inputs, and edge cases explicitly.`

			`Reproduce with minimal examples. Running the exact pipeline with test cases that trigger edge conditions catches bugs early.`

			`Treat "works locally but not in production" as environmental, not code. The bug is in assumptions about the runtime, not the logic itself.`