Claude b5807b3516

ci/woodpecker/push/ci Pipeline was successful

Details

ci/woodpecker/pr/ci Pipeline was successful

Details

fix: docs/CLAUDE-AUTH-CONCURRENCY.md: rewrite for shared CLAUDE_CONFIG_DIR approach (#646 )

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-10 21:16:18 +00:00

5.3 KiB

Raw Blame History

Claude Code OAuth Concurrency Model

Problem statement

The factory runs multiple concurrent Claude Code processes across containers. OAuth access tokens are short-lived; refresh tokens rotate on each use. If two processes POST the same refresh token to Anthropic's token endpoint simultaneously, only one wins — the other gets invalid_grant and the operator is forced to re-login.

Claude Code already serializes OAuth refreshes internally using proper-lockfile (src/utils/auth.ts:1485-1491):

release = await lockfile.lock(claudeDir)

proper-lockfile creates a lockfile via an atomic mkdir(${path}.lock) call — a cross-process primitive that works across any number of processes on the same filesystem. The problem was never the lock implementation; it was that our old per-container bind-mount layout (~/.claude mounted but /home/agent/ container-local) caused each container to compute a different lockfile path, so the locks never coordinated.

The fix: shared `CLAUDE_CONFIG_DIR`

CLAUDE_CONFIG_DIR is an officially supported env var in Claude Code (src/utils/envUtils.ts). It controls where Claude resolves its config directory instead of the default ~/.claude.

By setting CLAUDE_CONFIG_DIR to a path on a shared bind mount, every container computes the same lockfile location. proper-lockfile's atomic mkdir(${CLAUDE_CONFIG_DIR}.lock) then gives free cross-container serialization — no external wrapper needed.

Current layout

Host filesystem:
  /var/lib/disinto/claude-shared/          ← CLAUDE_SHARED_DIR
  └── config/                              ← CLAUDE_CONFIG_DIR
      ├── credentials.json
      ├── settings.json
      └── ...

Inside every container:
  Same absolute path: /var/lib/disinto/claude-shared/config
  Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config

The shared directory is mounted at the same absolute path inside every container, so proper-lockfile resolves an identical lock path everywhere.

Where these values are defined

What	Where
Defaults for `CLAUDE_SHARED_DIR`, `CLAUDE_CONFIG_DIR`	`lib/env.sh:138-140`
`.env` documentation	`.env.example:92-99`
Container mounts + env passthrough (edge dispatcher)	`docker/edge/dispatcher.sh:446-448` (and analogous blocks for reproduce, triage, verify)
Auth detection using `CLAUDE_CONFIG_DIR`	`docker/agents/entrypoint.sh:101-102`
Bootstrap / migration during `disinto init`	`lib/claude-config.sh:setup_claude_config_dir()`, `bin/disinto:952-962`

Migration for existing dev boxes

For operators upgrading from the old ~/.claude bind-mount layout, disinto init handles the migration interactively (or with --yes). The manual equivalent is:

# 1. Stop the factory
disinto down

# 2. Create the shared directory
mkdir -p /var/lib/disinto/claude-shared

# 3. Move existing config
mv "$HOME/.claude" /var/lib/disinto/claude-shared/config

# 4. Create a back-compat symlink so host-side claude still works
ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude"

# 5. Export the env var (add to shell rc for persistence)
export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config

# 6. Start the factory
disinto up

Verification

Watch for these analytics events during concurrent agent runs:

Event	Meaning
`tengu_oauth_token_refresh_lock_acquiring`	A process is attempting to acquire the refresh lock
`tengu_oauth_token_refresh_lock_acquired`	Lock acquired; refresh proceeding
`tengu_oauth_token_refresh_lock_retry`	Lock is held by another process; retrying
`tengu_oauth_token_refresh_lock_race_resolved`	Contention detected and resolved normally
`tengu_oauth_token_refresh_lock_retry_limit_reached`	Lock acquisition failed after all retries

Healthy: _race_resolved appearing during contention windows — this means multiple processes tried to refresh simultaneously and the lock correctly serialized them.

Bad: _lock_retry_limit_reached — indicates the lock is stuck or the shared mount is not working. Verify that CLAUDE_CONFIG_DIR resolves to the same path in all containers and that the filesystem supports mkdir atomicity (any POSIX filesystem does).

The deferred external `flock` wrapper

lib/agent-sdk.sh:139,144 still wraps every claude invocation in an external flock on ${HOME}/.claude/session.lock:

local lock_file="${HOME}/.claude/session.lock"
...
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
  claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)

With the CLAUDE_CONFIG_DIR fix in place, this external lock is redundant but harmless — proper-lockfile serializes the refresh internally, and flock serializes the entire invocation externally. The external flock remains as a defense-in-depth measure; removal is tracked as a separate vision-tier issue.

5.3 KiB Raw Blame History