disinto/docs/CLAUDE-AUTH-CONCURRENCY.md

# Claude Code OAuth Concurrency Model

## TL;DR

The factory runs N+1 concurrent Claude Code processes across containers
that all share `~/.claude` via bind mount. To avoid OAuth refresh races,
they MUST be serialized by the external `flock` on
`${HOME}/.claude/session.lock` in `lib/agent-sdk.sh`. Claude Code's
internal OAuth refresh lock does **not** work across containers in our
mount layout. Do not remove the external flock without also fixing the
lockfile placement upstream.

## What we run

| Container | Claude Code processes | Mount of `~/.claude` |
|---|---|---|
| `disinto-agents` (persistent) | polling-loop agents via `lib/agent-sdk.sh::agent_run` | `/home/johba/.claude` → `/home/agent/.claude` (rw) |
| `disinto-edge` (persistent) | none directly — spawns transient containers via `docker/edge/dispatcher.sh` | n/a |
| transient containers spawned by `dispatcher.sh` | one-shot `claude` per invocation | same mount, same path |

All N+1 processes can hit the OAuth refresh window concurrently when
the access token nears expiry.

## The race

OAuth access tokens are short-lived; refresh tokens rotate on each
refresh. If two processes both POST the same refresh token to
Anthropic's token endpoint simultaneously, only one wins — the other
gets `invalid_grant` and the operator is forced to re-login.

Historically this manifested as "agents losing auth, frequent re-logins",
which is the original reason `lib/agent-sdk.sh` introduced the external
flock. The current shape (post-#606 watchdog work) is at
`lib/agent-sdk.sh:139,144`:

```bash
local lock_file="${HOME}/.claude/session.lock"
...
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
  claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
```

This serializes every `claude` invocation across every process that
shares `${HOME}/.claude/`.

## Why Claude Code's internal lock does not save us

`src/utils/auth.ts:1491` (read from a leaked TS source — current as of
April 2026) calls:

```typescript
release = await lockfile.lock(claudeDir)
```

with no `lockfilePath` option. `proper-lockfile` defaults to creating
the lock at `<target>.lock` as a **sibling**, so for
`claudeDir = /home/agent/.claude`, the lockfile is created at
`/home/agent/.claude.lock`.

`/home/agent/.claude` is bind-mounted from the host, but `/home/agent/`
itself is part of each container's local overlay filesystem. So each
container creates its own private `/home/agent/.claude.lock` — they
never see each other's locks. The internal cross-process lock is a
no-op across our containers.

Verified empirically:

```
$ docker exec disinto-agents findmnt /home/agent/.claude
TARGET              SOURCE                                  FSTYPE
/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs

$ docker exec disinto-agents findmnt /home/agent
(blank — not a mount, container-local overlay)

$ docker exec disinto-agents touch /home/agent/test-marker
$ docker exec disinto-edge ls /home/agent/test-marker
ls: cannot access '/home/agent/test-marker': No such file or directory
```

(Compare with `src/services/mcp/auth.ts:2097`, which does it correctly
by passing `lockfilePath: join(claudeDir, "mcp-refresh-X.lock")` — that
lockfile lives inside the bind-mounted directory and IS shared. The
OAuth refresh path is an upstream oversight worth filing once we have
bandwidth.)

## How the external flock fixes it

The lock file path `${HOME}/.claude/session.lock` is **inside**
`~/.claude/`, which IS shared via the bind mount. All containers see
the same inode and serialize correctly via `flock`. This is a
sledgehammer (it serializes the entire `claude -p` call, not just the
refresh window) but it works.

## Decision matrix for new claude-using containers

When adding a new container that runs Claude Code:

1. **If the container is a batch / agent context** (long-running calls,
   tolerant of serialization): mount the same `~/.claude` and route
   all `claude` calls through `lib/agent-sdk.sh::agent_run` so they
   take the external flock.

2. **If the container is interactive** (chat, REPL, anything where the
   operator is waiting on a response): do NOT join the external flock.
   Interactive starvation under the agent loop would be unusable —
   chat messages would block waiting for the current agent's
   `claude -p` call to finish, which can be minutes, and the 10-min
   `flock -w 600` would frequently expire under a busy loop. Instead,
   pick one of:
   - **Separate OAuth identity**: new `~/.claude-chat/` on the host with
     its own `claude auth login`, mounted to the new container's
     `/home/agent/.claude`. Independent refresh state.
   - **`ANTHROPIC_API_KEY` fallback**: the codebase already supports it
     in `docker/agents/entrypoint.sh:119-125`. Different billing track
     but trivial config and zero coupling to the agents' OAuth.

3. **Never** mount the parent directory `/home/agent/` instead of just
   `.claude/` to "fix" the lockfile placement — exposes too much host
   state to the container.

## Future fix

The right long-term fix is upstream: file an issue against Anthropic's
claude-code repo asking that `src/utils/auth.ts:1491` be changed to
follow the pattern at `src/services/mcp/auth.ts:2097` and pass an
explicit `lockfilePath` inside `claudeDir`. Once that lands and we
upgrade, the external flock can become a fast-path no-op or be removed
entirely.

## See also

- `lib/agent-sdk.sh:139,144` — the external flock
- `docker/agents/entrypoint.sh:119-125` — the `ANTHROPIC_API_KEY` fallback
- Issue #623 — chat container, auth strategy (informed by this doc)
docs: document Claude Code OAuth concurrency model and external flock rationale (#637) ## Summary Adds `docs/CLAUDE-AUTH-CONCURRENCY.md` documenting why the external `flock` on `${HOME}/.claude/session.lock` in `lib/agent-sdk.sh` is load-bearing rather than belt-and-suspenders, and provides a decision matrix for adding new containers that run Claude Code. Pure docs change. No code touched. ## Why The factory runs N+1 concurrent Claude Code processes across containers (`disinto-agents` plus every transient container spawned by `docker/edge/dispatcher.sh`), all sharing `~/.claude` via bind mount. The historical "agents losing auth, frequent re-logins" issue that motivated the original `session.lock` flock is the OAuth refresh race — and the flock is the only thing currently protecting against it. A reasonable assumption when looking at Claude Code is that its internal `proper-lockfile.lock(claudeDir)` (in `src/utils/auth.ts:1491` of the leaked TS source) handles the refresh race, making the external flock redundant. It does not, in our specific bind-mount layout. Empirically verified: - `proper-lockfile` defaults to `<target>.lock` as a sibling file when no `lockfilePath` is given - For `claudeDir = /home/agent/.claude`, the lock lands at `/home/agent/.claude.lock` - `/home/agent/` is not bind-mounted in our setup — it is the container's local overlay filesystem - Each container creates its own private `.claude.lock`, none shared - Cross-container OAuth refresh race is therefore unprotected by Claude Code's internal lock The external flock works because the lock file path `${HOME}/.claude/session.lock` is inside the bind-mounted directory, so all containers see the same inode. This came up during design discussion of the chat container in #623, where the temptation was to mount the existing `~/.claude` and skip the external flock for interactive responsiveness. The doc captures the analysis so future implementers don't take that shortcut. ## Changes - New file: `docs/CLAUDE-AUTH-CONCURRENCY.md` (~135 lines): rationale, empirical evidence, decision matrix for new containers, pointer to the upstream fix - `lib/AGENTS.md`: one-line Concurrency addendum to the `lib/agent-sdk.sh` row pointing at the new doc ## Test plan - [ ] Markdown renders correctly in Forgejo - [ ] Relative link from `lib/AGENTS.md` to `docs/CLAUDE-AUTH-CONCURRENCY.md` resolves (`../docs/CLAUDE-AUTH-CONCURRENCY.md`) - [ ] Code references in the doc still match the current state of `lib/agent-sdk.sh:139,144` and `docker/agents/entrypoint.sh:119-125` ## Refs - #623 — chat container, the issue this analysis was driven by; #623 has a comment with the same analysis pointing back here once merged Co-authored-by: Claude <noreply@anthropic.com> Reviewed-on: http://forgejo:3000/disinto-admin/disinto/pulls/637 Co-authored-by: dev-bot <dev-bot@disinto.local> Co-committed-by: dev-bot <dev-bot@disinto.local> 2026-04-10 18:01:18 +00:00			`# Claude Code OAuth Concurrency Model`

			`## TL;DR`

			`The factory runs N+1 concurrent Claude Code processes across containers`
			that all share `~/.claude` via bind mount. To avoid OAuth refresh races,
			they MUST be serialized by the external `flock` on
			`${HOME}/.claude/session.lock` in `lib/agent-sdk.sh`. Claude Code's
			`internal OAuth refresh lock does not work across containers in our`
			`mount layout. Do not remove the external flock without also fixing the`
			`lockfile placement upstream.`

			`## What we run`

			\| Container \| Claude Code processes \| Mount of `~/.claude` \|
			`\|---\|---\|---\|`
			\| `disinto-agents` (persistent) \| polling-loop agents via `lib/agent-sdk.sh::agent_run` \| `/home/johba/.claude` → `/home/agent/.claude` (rw) \|
			\| `disinto-edge` (persistent) \| none directly — spawns transient containers via `docker/edge/dispatcher.sh` \| n/a \|
			\| transient containers spawned by `dispatcher.sh` \| one-shot `claude` per invocation \| same mount, same path \|

			`All N+1 processes can hit the OAuth refresh window concurrently when`
			`the access token nears expiry.`

			`## The race`

			`OAuth access tokens are short-lived; refresh tokens rotate on each`
			`refresh. If two processes both POST the same refresh token to`
			`Anthropic's token endpoint simultaneously, only one wins — the other`
			gets `invalid_grant` and the operator is forced to re-login.

			`Historically this manifested as "agents losing auth, frequent re-logins",`
			which is the original reason `lib/agent-sdk.sh` introduced the external
			`flock. The current shape (post-#606 watchdog work) is at`
			`lib/agent-sdk.sh:139,144`:

			```bash
			`local lock_file="${HOME}/.claude/session.lock"`
			`...`
			`output=$(cd "$run_dir" && ( flock -w 600 9 \|\| exit 1;`
			`claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)`
			```

			This serializes every `claude` invocation across every process that
			shares `${HOME}/.claude/`.

			`## Why Claude Code's internal lock does not save us`

			`src/utils/auth.ts:1491` (read from a leaked TS source — current as of
			`April 2026) calls:`

			```typescript
			`release = await lockfile.lock(claudeDir)`
			```

			with no `lockfilePath` option. `proper-lockfile` defaults to creating
			the lock at `<target>.lock` as a sibling, so for
			`claudeDir = /home/agent/.claude`, the lockfile is created at
			`/home/agent/.claude.lock`.

			`/home/agent/.claude` is bind-mounted from the host, but `/home/agent/`
			`itself is part of each container's local overlay filesystem. So each`
			container creates its own private `/home/agent/.claude.lock` — they
			`never see each other's locks. The internal cross-process lock is a`
			`no-op across our containers.`

			`Verified empirically:`

			```
			`$ docker exec disinto-agents findmnt /home/agent/.claude`
			`TARGET SOURCE FSTYPE`
			`/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs`

			`$ docker exec disinto-agents findmnt /home/agent`
			`(blank — not a mount, container-local overlay)`

			`$ docker exec disinto-agents touch /home/agent/test-marker`
			`$ docker exec disinto-edge ls /home/agent/test-marker`
			`ls: cannot access '/home/agent/test-marker': No such file or directory`
			```

			(Compare with `src/services/mcp/auth.ts:2097`, which does it correctly
			by passing `lockfilePath: join(claudeDir, "mcp-refresh-X.lock")` — that
			`lockfile lives inside the bind-mounted directory and IS shared. The`
			`OAuth refresh path is an upstream oversight worth filing once we have`
			`bandwidth.)`

			`## How the external flock fixes it`

			The lock file path `${HOME}/.claude/session.lock` is inside
			`~/.claude/`, which IS shared via the bind mount. All containers see
			the same inode and serialize correctly via `flock`. This is a
			sledgehammer (it serializes the entire `claude -p` call, not just the
			`refresh window) but it works.`

			`## Decision matrix for new claude-using containers`

			`When adding a new container that runs Claude Code:`

			`1. If the container is a batch / agent context (long-running calls,`
			tolerant of serialization): mount the same `~/.claude` and route
			all `claude` calls through `lib/agent-sdk.sh::agent_run` so they
			`take the external flock.`

			`2. If the container is interactive (chat, REPL, anything where the`
			`operator is waiting on a response): do NOT join the external flock.`
			`Interactive starvation under the agent loop would be unusable —`
			`chat messages would block waiting for the current agent's`
			`claude -p` call to finish, which can be minutes, and the 10-min
			`flock -w 600` would frequently expire under a busy loop. Instead,
			`pick one of:`
			- Separate OAuth identity: new `~/.claude-chat/` on the host with
			its own `claude auth login`, mounted to the new container's
			`/home/agent/.claude`. Independent refresh state.
			- `ANTHROPIC_API_KEY` fallback: the codebase already supports it
			in `docker/agents/entrypoint.sh:119-125`. Different billing track
			`but trivial config and zero coupling to the agents' OAuth.`

			3. Never mount the parent directory `/home/agent/` instead of just
			`.claude/` to "fix" the lockfile placement — exposes too much host
			`state to the container.`

			`## Future fix`

			`The right long-term fix is upstream: file an issue against Anthropic's`
			claude-code repo asking that `src/utils/auth.ts:1491` be changed to
			follow the pattern at `src/services/mcp/auth.ts:2097` and pass an
			explicit `lockfilePath` inside `claudeDir`. Once that lands and we
			`upgrade, the external flock can become a fast-path no-op or be removed`
			`entirely.`

			`## See also`

			- `lib/agent-sdk.sh:139,144` — the external flock
			- `docker/agents/entrypoint.sh:119-125` — the `ANTHROPIC_API_KEY` fallback
			`- Issue #623 — chat container, auth strategy (informed by this doc)`