## Summary
Adds `docs/CLAUDE-AUTH-CONCURRENCY.md` documenting why the external `flock` on `${HOME}/.claude/session.lock` in `lib/agent-sdk.sh` is load-bearing rather than belt-and-suspenders, and provides a decision matrix for adding new containers that run Claude Code.
Pure docs change. No code touched.
## Why
The factory runs N+1 concurrent Claude Code processes across containers (`disinto-agents` plus every transient container spawned by `docker/edge/dispatcher.sh`), all sharing `~/.claude` via bind mount. The historical "agents losing auth, frequent re-logins" issue that motivated the original `session.lock` flock is the OAuth refresh race — and the flock is the only thing currently protecting against it.
A reasonable assumption when looking at Claude Code is that its internal `proper-lockfile.lock(claudeDir)` (in `src/utils/auth.ts:1491` of the leaked TS source) handles the refresh race, making the external flock redundant. **It does not**, in our specific bind-mount layout. Empirically verified:
- `proper-lockfile` defaults to `<target>.lock` as a sibling file when no `lockfilePath` is given
- For `claudeDir = /home/agent/.claude`, the lock lands at `/home/agent/.claude.lock`
- `/home/agent/` is **not** bind-mounted in our setup — it is the container's local overlay filesystem
- Each container creates its own private `.claude.lock`, none shared
- Cross-container OAuth refresh race is therefore unprotected by Claude Code's internal lock
The external flock works because the lock file path `${HOME}/.claude/session.lock` is **inside** the bind-mounted directory, so all containers see the same inode.
This came up during design discussion of the chat container in #623, where the temptation was to mount the existing `~/.claude` and skip the external flock for interactive responsiveness. The doc captures the analysis so future implementers don't take that shortcut.
## Changes
- New file: `docs/CLAUDE-AUTH-CONCURRENCY.md` (~135 lines): rationale, empirical evidence, decision matrix for new containers, pointer to the upstream fix
- `lib/AGENTS.md`: one-line **Concurrency** addendum to the `lib/agent-sdk.sh` row pointing at the new doc
## Test plan
- [ ] Markdown renders correctly in Forgejo
- [ ] Relative link from `lib/AGENTS.md` to `docs/CLAUDE-AUTH-CONCURRENCY.md` resolves (`../docs/CLAUDE-AUTH-CONCURRENCY.md`)
- [ ] Code references in the doc still match the current state of `lib/agent-sdk.sh:139,144` and `docker/agents/entrypoint.sh:119-125`
## Refs
- #623 — chat container, the issue this analysis was driven by; #623 has a comment with the same analysis pointing back here once merged
Co-authored-by: Claude <noreply@anthropic.com>
Reviewed-on: #637
Co-authored-by: dev-bot <dev-bot@disinto.local>
Co-committed-by: dev-bot <dev-bot@disinto.local>
5.5 KiB
Claude Code OAuth Concurrency Model
TL;DR
The factory runs N+1 concurrent Claude Code processes across containers
that all share ~/.claude via bind mount. To avoid OAuth refresh races,
they MUST be serialized by the external flock on
${HOME}/.claude/session.lock in lib/agent-sdk.sh. Claude Code's
internal OAuth refresh lock does not work across containers in our
mount layout. Do not remove the external flock without also fixing the
lockfile placement upstream.
What we run
| Container | Claude Code processes | Mount of ~/.claude |
|---|---|---|
disinto-agents (persistent) |
polling-loop agents via lib/agent-sdk.sh::agent_run |
/home/johba/.claude → /home/agent/.claude (rw) |
disinto-edge (persistent) |
none directly — spawns transient containers via docker/edge/dispatcher.sh |
n/a |
transient containers spawned by dispatcher.sh |
one-shot claude per invocation |
same mount, same path |
All N+1 processes can hit the OAuth refresh window concurrently when the access token nears expiry.
The race
OAuth access tokens are short-lived; refresh tokens rotate on each
refresh. If two processes both POST the same refresh token to
Anthropic's token endpoint simultaneously, only one wins — the other
gets invalid_grant and the operator is forced to re-login.
Historically this manifested as "agents losing auth, frequent re-logins",
which is the original reason lib/agent-sdk.sh introduced the external
flock. The current shape (post-#606 watchdog work) is at
lib/agent-sdk.sh:139,144:
local lock_file="${HOME}/.claude/session.lock"
...
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
This serializes every claude invocation across every process that
shares ${HOME}/.claude/.
Why Claude Code's internal lock does not save us
src/utils/auth.ts:1491 (read from a leaked TS source — current as of
April 2026) calls:
release = await lockfile.lock(claudeDir)
with no lockfilePath option. proper-lockfile defaults to creating
the lock at <target>.lock as a sibling, so for
claudeDir = /home/agent/.claude, the lockfile is created at
/home/agent/.claude.lock.
/home/agent/.claude is bind-mounted from the host, but /home/agent/
itself is part of each container's local overlay filesystem. So each
container creates its own private /home/agent/.claude.lock — they
never see each other's locks. The internal cross-process lock is a
no-op across our containers.
Verified empirically:
$ docker exec disinto-agents findmnt /home/agent/.claude
TARGET SOURCE FSTYPE
/home/agent/.claude /dev/loop15[/...rootfs/home/johba/.claude] btrfs
$ docker exec disinto-agents findmnt /home/agent
(blank — not a mount, container-local overlay)
$ docker exec disinto-agents touch /home/agent/test-marker
$ docker exec disinto-edge ls /home/agent/test-marker
ls: cannot access '/home/agent/test-marker': No such file or directory
(Compare with src/services/mcp/auth.ts:2097, which does it correctly
by passing lockfilePath: join(claudeDir, "mcp-refresh-X.lock") — that
lockfile lives inside the bind-mounted directory and IS shared. The
OAuth refresh path is an upstream oversight worth filing once we have
bandwidth.)
How the external flock fixes it
The lock file path ${HOME}/.claude/session.lock is inside
~/.claude/, which IS shared via the bind mount. All containers see
the same inode and serialize correctly via flock. This is a
sledgehammer (it serializes the entire claude -p call, not just the
refresh window) but it works.
Decision matrix for new claude-using containers
When adding a new container that runs Claude Code:
-
If the container is a batch / agent context (long-running calls, tolerant of serialization): mount the same
~/.claudeand route allclaudecalls throughlib/agent-sdk.sh::agent_runso they take the external flock. -
If the container is interactive (chat, REPL, anything where the operator is waiting on a response): do NOT join the external flock. Interactive starvation under the agent loop would be unusable — chat messages would block waiting for the current agent's
claude -pcall to finish, which can be minutes, and the 10-minflock -w 600would frequently expire under a busy loop. Instead, pick one of:- Separate OAuth identity: new
~/.claude-chat/on the host with its ownclaude auth login, mounted to the new container's/home/agent/.claude. Independent refresh state. ANTHROPIC_API_KEYfallback: the codebase already supports it indocker/agents/entrypoint.sh:119-125. Different billing track but trivial config and zero coupling to the agents' OAuth.
- Separate OAuth identity: new
-
Never mount the parent directory
/home/agent/instead of just.claude/to "fix" the lockfile placement — exposes too much host state to the container.
Future fix
The right long-term fix is upstream: file an issue against Anthropic's
claude-code repo asking that src/utils/auth.ts:1491 be changed to
follow the pattern at src/services/mcp/auth.ts:2097 and pass an
explicit lockfilePath inside claudeDir. Once that lands and we
upgrade, the external flock can become a fast-path no-op or be removed
entirely.
See also
lib/agent-sdk.sh:139,144— the external flockdocker/agents/entrypoint.sh:119-125— theANTHROPIC_API_KEYfallback- Issue #623 — chat container, auth strategy (informed by this doc)