disinto/docs/CLAUDE-AUTH-CONCURRENCY.md

# Claude Code OAuth Concurrency Model

## Problem statement

The factory runs multiple concurrent Claude Code processes across
containers. OAuth access tokens are short-lived; refresh tokens rotate
on each use. If two processes POST the same refresh token to Anthropic's
token endpoint simultaneously, only one wins — the other gets
`invalid_grant` and the operator is forced to re-login.

Claude Code already serializes OAuth refreshes internally using
`proper-lockfile` (`src/utils/auth.ts:1485-1491`):

```typescript
release = await lockfile.lock(claudeDir)
```

`proper-lockfile` creates a lockfile via an atomic `mkdir(${path}.lock)`
call — a cross-process primitive that works across any number of
processes on the same filesystem. The problem was never the lock
implementation; it was that our old per-container bind-mount layout
(`~/.claude` mounted but `/home/agent/` container-local) caused each
container to compute a different lockfile path, so the locks never
coordinated.

## The fix: shared `CLAUDE_CONFIG_DIR`

`CLAUDE_CONFIG_DIR` is an officially supported env var in Claude Code
(`src/utils/envUtils.ts`). It controls where Claude resolves its config
directory instead of the default `~/.claude`.

By setting `CLAUDE_CONFIG_DIR` to a path on a shared bind mount, every
container computes the **same** lockfile location. `proper-lockfile`'s
atomic `mkdir(${CLAUDE_CONFIG_DIR}.lock)` then gives free cross-container
serialization — no external wrapper needed.

## Current layout

```
Host filesystem:
  /var/lib/disinto/claude-shared/          ← CLAUDE_SHARED_DIR
  └── config/                              ← CLAUDE_CONFIG_DIR
      ├── credentials.json
      ├── settings.json
      └── ...

Inside every container:
  Same absolute path: /var/lib/disinto/claude-shared/config
  Env: CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config
```

The shared directory is mounted at the **same absolute path** inside
every container, so `proper-lockfile` resolves an identical lock path
everywhere.

### Where these values are defined

| What | Where |
|------|-------|
| Defaults for `CLAUDE_SHARED_DIR`, `CLAUDE_CONFIG_DIR` | `lib/env.sh:138-140` |
| `.env` documentation | `.env.example:92-99` |
| Container mounts + env passthrough (edge dispatcher) | `docker/edge/dispatcher.sh:446-448` (and analogous blocks for reproduce, triage, verify) |
| Auth detection using `CLAUDE_CONFIG_DIR` | `docker/agents/entrypoint.sh:101-102` |
| Bootstrap / migration during `disinto init` | `lib/claude-config.sh:setup_claude_config_dir()`, `bin/disinto:952-962` |

## Migration for existing dev boxes

For operators upgrading from the old `~/.claude` bind-mount layout,
`disinto init` handles the migration interactively (or with `--yes`).
The manual equivalent is:

```bash
# 1. Stop the factory
disinto down

# 2. Create the shared directory
mkdir -p /var/lib/disinto/claude-shared

# 3. Move existing config
mv "$HOME/.claude" /var/lib/disinto/claude-shared/config

# 4. Create a back-compat symlink so host-side claude still works
ln -sfn /var/lib/disinto/claude-shared/config "$HOME/.claude"

# 5. Export the env var (add to shell rc for persistence)
export CLAUDE_CONFIG_DIR=/var/lib/disinto/claude-shared/config

# 6. Start the factory
disinto up
```

## Verification

Watch for these analytics events during concurrent agent runs:

| Event | Meaning |
|-------|---------|
| `tengu_oauth_token_refresh_lock_acquiring` | A process is attempting to acquire the refresh lock |
| `tengu_oauth_token_refresh_lock_acquired` | Lock acquired; refresh proceeding |
| `tengu_oauth_token_refresh_lock_retry` | Lock is held by another process; retrying |
| `tengu_oauth_token_refresh_lock_race_resolved` | Contention detected and resolved normally |
| `tengu_oauth_token_refresh_lock_retry_limit_reached` | Lock acquisition failed after all retries |

**Healthy:** `_race_resolved` appearing during contention windows — this
means multiple processes tried to refresh simultaneously and the lock
correctly serialized them.

**Bad:** `_lock_retry_limit_reached` — indicates the lock is stuck or
the shared mount is not working. Verify that `CLAUDE_CONFIG_DIR` resolves
to the same path in all containers and that the filesystem supports
`mkdir` atomicity (any POSIX filesystem does).

## The deferred external `flock` wrapper

`lib/agent-sdk.sh:139,144` still wraps every `claude` invocation in an
external `flock` on `${HOME}/.claude/session.lock`:

```bash
local lock_file="${HOME}/.claude/session.lock"
...
output=$(cd "$run_dir" && ( flock -w 600 9 || exit 1;
  claude_run_with_watchdog claude "${args[@]}" ) 9>"$lock_file" ...)
```

With the `CLAUDE_CONFIG_DIR` fix in place, this external lock is
**redundant but harmless** — `proper-lockfile` serializes the refresh
internally, and `flock` serializes the entire invocation externally.
The external flock remains as a defense-in-depth measure; removal is
tracked as a separate vision-tier issue.

## See also

- `lib/env.sh:138-140` — `CLAUDE_SHARED_DIR` / `CLAUDE_CONFIG_DIR` defaults
- `lib/claude-config.sh` — migration helper used by `disinto init`
- `lib/agent-sdk.sh:139,144` — the external `flock` wrapper (deferred removal)
- `docker/agents/entrypoint.sh:101-102` — `CLAUDE_CONFIG_DIR` auth detection
- `.env.example:92-99` — operator-facing documentation of the env vars
- Issue #623 — chat container auth strategy