AX-Copilot-Codex/docs/CODE_CONTEXT_RELIABILITY_PLAN.md

# Code Context Reliability Plan

Update: 2026-04-16 06:41 (KST)

- Added `src/AxCopilot/Services/Agent/AgentLoopLlmDispatchStageService.cs` so the LLM dispatch path is now split into:
  - history/query assembly
  - pre-LLM stage planning
  - dispatch/stream stage execution
  - tool execution and recovery
- `AgentLoopService` no longer owns the inline stream preview callback or the direct `SendWithToolsWithRecoveryAsync(...)` setup for the primary loop.
- `StreamingToolExecutionCoordinator.cs` was also normalized to English-only active-path status strings so the staged dispatch path no longer reintroduces mojibake text during wait/retry handling.
- Remaining structural gap versus the target `claw-code` shape:
  - the `NotSupportedException` / `ToolCallNotSupportedException` fallback branch still lives in `AgentLoopService`
  - the next extraction target should be a narrower fallback policy stage so the main loop keeps shrinking toward a pure orchestrator

Update: 2026-04-16 01:37 (KST)

## Background

Recent Code tab runs show that the LLM request payload is still growing over time. In the `2026-04-16 00:46:26` to `00:50:52` run, the request size grew from `messages=7` to `messages=125`. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.

The same log window repeatedly shows:

- `tool_calls/tool mismatch detected - flattening assistant message`
- `orphan tool message detected - converting to user`
- repeated rereads of nearby files after build failures
- shifting build failures such as `MC3089` followed by `CS0017` without a stable working set that preserves what was already changed and what remains broken

In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.

## Current Findings

### 1. Workspace context bootstrap is weak on first load

- AX targets:
  - `src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs`
  - `src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs`
- Finding:
  - When `.ax-context.md` is missing, the first Code request can return before background workspace-context generation becomes useful.
- Impact:
  - Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.

### 2. Build and file evidence is compacted too aggressively

- AX targets:
  - `src/AxCopilot/Services/Agent/AgentToolResultBudget.cs`
  - `src/AxCopilot/Services/Agent/ContextCondenser.cs`
- Current values:
  - `DefaultSoftCharLimit = 900`
  - `DefaultAggregateBudgetChars = 7_500`
  - `RecentKeepCount = 6`
- Impact:
  - Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.

### 3. Session learning is not a durable code working set

- AX targets:
  - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs`
  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
  - `src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs`
- Finding:
  - Session learnings are injected every loop, but they are not structured strongly enough to lock in:
    - current goal
    - current architecture
    - changed files
    - latest build or test failure
    - next repair target
- Impact:
  - The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.

### 4. Tool-trace invariant repairs are too common

- AX targets:
  - `src/AxCopilot/Services/LlmService.ToolUse.cs`
  - `src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs`
  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
- Finding:
  - The recent logs show repeated mismatch and orphan corrections.
- Impact:
  - Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.

### 5. There is no Code-specific working-set layer

- AX targets:
  - new service required
  - injection path should go through:
    - `AgentLoopLlmRequestPreparationService`
    - `AgentQueryContextBuilder`
    - `AgentLoopService`
- Finding:
  - The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
- Impact:
  - Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.

## External Research Notes

### Anthropic Claude Code memory docs

- Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via `/memory`.
- Planning implication:
  - AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
- Source:
  - [Anthropic Claude Code memory docs](https://docs.anthropic.com/zh-CN/docs/claude-code/memory)

### OpenAI practical guide to building agents

- The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
- Planning implication:
  - AX should log the exact context sections that enter each Code request, including what was compacted and why.
- Source:
  - [OpenAI practical guide to building agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf)

### SWE-Pruner

- The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
- Planning implication:
  - AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
- Source:
  - [SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents](https://arxiv.org/abs/2601.16746)

## `claude-code` Reference Points

Reference targets:

- `claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts`
- `claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts`
- `claw-code/en/concepts/memory-context.md`

Observed direction:

- `claude-code` builds a dedicated `messagesForQuery` window.
- It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
- It treats memory and post-compaction query windows as first-class parts of the request path.

AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.

## Remediation Plan

### Phase 1. Context observability and bootstrap repair

- Reference targets:
  - `claw-code/.../src/query.ts`
  - `claw-code/en/concepts/memory-context.md`
- AX targets:
  - `src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs`
  - `src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs`
  - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs`
  - `src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs`
- Work items:
  - guarantee workspace-context generation starts even on first miss
  - log the exact context sections injected into each request
  - add diagnostics for omitted sections
- Done criteria:
  - empty-workspace runs show workspace context generation by loop 2
  - logs show section names, sizes, and compaction status
- Quality scenario:
  - a fresh `E:\code` WPF scaffolding run should show folder and project context in the first two request cycles

### Phase 2. Code working-set memory layer

- Reference targets:
  - `claw-code/.../src/query.ts`
  - `claw-code/.../src/history.ts`
  - Anthropic memory docs
- AX targets:
  - new `CodeTaskWorkingSetService`
  - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs`
  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
  - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs`
- Work items:
  - maintain a stable structured ledger with:
    - current goal
    - selected architecture
    - changed files
    - latest successful writes
    - open diagnostics
    - next repair target
  - inject it only when changed
  - replace superseded failures with the latest active issue
- Done criteria:
  - long Code runs keep a single coherent working-set block without noisy duplication
  - build and test failures are preserved as part of the working set
- Quality scenario:
  - after fixing `MC3089`, the run should still remember the earlier structure change while focusing on the new `CS0017` entry-point failure

### Phase 3. Task-aware pruning and protected evidence

- Reference targets:
  - `claw-code/.../src/query.ts`
  - SWE-Pruner
- AX targets:
  - `src/AxCopilot/Services/Agent/AgentToolResultBudget.cs`
  - `src/AxCopilot/Services/Agent/ContextCondenser.cs`
  - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs`
- Work items:
  - protect:
    - latest build error block
    - latest test failure block
    - current plan or working set
    - latest folder tree snapshot
    - last N write diffs
  - move from pure char-based truncation toward semantic snapshots
  - tune compaction rules specifically for Code tasks
- Done criteria:
  - active repair evidence survives across loops until superseded
  - older noise shrinks without losing the current failure context
- Quality scenario:
  - a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload

### Phase 4. Tool-trace invariant hardening

- Reference targets:
  - `claw-code/.../src/query.ts`
  - `claw-code/.../src/history.ts`
- AX targets:
  - `src/AxCopilot/Services/LlmService.ToolUse.cs`
  - `src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs`
  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
- Work items:
  - shift from after-the-fact flattening to pre-request validation and normalization
  - classify mismatch and orphan causes and lock them with regression tests
  - add a final integrity pass before query submission
- Done criteria:
  - standard Code runs approach zero mismatch or orphan repair logs
  - assistant, tool, and tool_result chains remain intact end to end
- Quality scenario:
  - a 50-loop Code run should complete without repeated tool-trace repair events

### Phase 5. Encoding hygiene and prompt cleanup

- Reference targets:
  - Anthropic memory docs
  - OpenAI practical guide eval and observability recommendations
- AX targets:
  - `src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs`
  - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs`
  - active status, prompt, and catalog files
  - `AGENTS.md`
- Work items:
  - enforce English-only comments in code files
  - rewrite mojibake strings in active prompt paths into English
  - add long-run Code evals to catch prompt and status encoding regressions
- Done criteria:
  - no broken strings remain in active prompt or status paths
  - touched code files keep English comments only
- Quality scenario:
  - Windows Korean environments should show readable build, test, and status output without mojibake feedback loops

## Priority

1. Phase 1: bootstrap and observability
2. Phase 2: working-set memory
3. Phase 3: task-aware pruning
4. Phase 4: tool-trace invariants
5. Phase 5: encoding and prompt cleanup

## Expected Outcome

- fewer repeated build-failure loops
- better structural consistency for project generation and large edits
- less drift in long-running Code tasks
- fewer quality losses caused by broken strings and low-signal context replacements

## Latest Delivery

Updated: 2026-04-16 01:41 (KST)

- Delivered in this pass:
  - Phase 1 foundation:
    - `ChatWindow.UtilityPresentation.cs` now bootstraps workspace context generation on first access and returns language-workflow fallback hints while `.ax-context.md` is still being generated.
    - `AgentLoopService.cs` now records `query_context` workflow transitions with query-window, budget, supplemental-context, and working-set summaries.
  - Phase 2 foundation:
    - `CodeTaskWorkingSetService.cs` adds a Code-only structured ledger for:
      - goal
      - selected scaffold/profile
      - created directories
      - recent reads/writes
      - latest diagnostics
      - next repair focus
    - the working set is injected into each Code request as a supplemental `code_working_set` system message.
  - Phase 3 foundation:
    - `AgentToolResultBudget.cs` and `AgentQueryContextBuilder.cs` now expose a `code` query profile with a larger protected-recent window and larger retained budgets for `build_run`, `test_loop`, `process`, `file_read`, `multi_read`, `lsp_code_intel`, and `git_tool`.
  - Phase 4 observability step:
    - `LlmService.ToolUse.cs` now logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run.
- Remaining follow-up:
  - extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair
  - replace more mojibake prompt/status strings in active Code execution paths with English equivalents

Updated: 2026-04-16 01:57 (KST)

- Delivered in this pass:
  - Phase 4 partial delivery:
    - `AgentMessageInvariantHelper.cs` now normalizes historical tool traces before the request leaves the agent loop.
    - structured assistant tool-call messages without matching `tool_result` now flatten into plain assistant transcript text.
    - orphan `tool_result` messages now flatten into plain user transcript text instead of relying only on late OpenAI payload repair.
    - `AgentLoopLlmRequestPreparationService.cs` clones the query window first, then applies normalization, so request cleanup does not mutate stored conversation history.
    - `AgentLoopContextReliability.cs` now logs tool-trace repair counts inside the `query_context` transition for run-by-run observability.
  - Phase 5 partial delivery:
    - `SessionLearningCollector.cs` was rewritten with English-only comments and English injection text.
    - `AgentLoopDiagnosticsFormatter.cs` no longer emits mojibake-prone compaction status text in active Code paths.
- Remaining follow-up:
  - measure whether `tool_trace_repair` counts keep trending down in long Code runs after this preflight normalization
  - continue replacing older mojibake strings outside the active Code execution path

Updated: 2026-04-16 02:05 (KST)

- Delivered in this pass:
  - structural alignment step:
    - `AgentLoopQueryAssemblyService.cs` now owns the staged query/history assembly path.
    - `PrepareHistory(...)` handles session-learning refresh plus queued-command/query-window preparation.
    - `PrepareRequest(...)` handles Code working-set supplemental context and request-message assembly before dispatch.
    - `AgentLoopService.cs` now delegates those responsibilities instead of manually stitching them together inline.
  - test and encoding hygiene step:
    - `AgentLoopQueryAssemblyServiceTests.cs` locks the new staged assembly behavior.
    - `SessionLearningCollectorTests.cs` was rewritten to English-only comments and assertions to match the new repository rule.
- Remaining follow-up:
  - keep extracting more inline AgentLoop responsibilities into smaller staged services where it improves observability or retry correctness
  - continue measuring long Code runs against claw-code-style continuity scenarios

Updated: 2026-04-16 02:13 (KST)

- Delivered in this pass:
  - structural alignment step:
    - `AgentLoopPreLlmStageService.cs` now owns the iteration decisions immediately before the LLM call.
    - the service centralizes:
      - thinking-summary selection
      - Gemini free-tier delay planning
      - user-prompt submit hook fingerprint/payload planning
      - missing-tool guard shaping
      - request assembly handoff
    - `AgentLoopService.cs` now consumes that stage result instead of computing those branches inline.
  - test coverage step:
    - `AgentLoopPreLlmStageServiceTests.cs` now locks the new pre-LLM decision layer.
- Remaining follow-up:
  - continue extracting the actual LLM dispatch / streaming callback branch into a narrower execution service
  - compare long-running Code traces against claw-code-style staged transitions and keep reducing inline loop logic