# Code Context Reliability Plan Update: 2026-04-16 01:37 (KST) ## Background Recent Code tab runs show that the LLM request payload is still growing over time. In the `2026-04-16 00:46:26` to `00:50:52` run, the request size grew from `messages=7` to `messages=125`. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries. The same log window repeatedly shows: - `tool_calls/tool mismatch detected - flattening assistant message` - `orphan tool message detected - converting to user` - repeated rereads of nearby files after build failures - shifting build failures such as `MC3089` followed by `CS0017` without a stable working set that preserves what was already changed and what remains broken In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks. ## Current Findings ### 1. Workspace context bootstrap is weak on first load - AX targets: - `src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs` - `src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs` - Finding: - When `.ax-context.md` is missing, the first Code request can return before background workspace-context generation becomes useful. - Impact: - Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops. ### 2. Build and file evidence is compacted too aggressively - AX targets: - `src/AxCopilot/Services/Agent/AgentToolResultBudget.cs` - `src/AxCopilot/Services/Agent/ContextCondenser.cs` - Current values: - `DefaultSoftCharLimit = 900` - `DefaultAggregateBudgetChars = 7_500` - `RecentKeepCount = 6` - Impact: - Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context. ### 3. Session learning is not a durable code working set - AX targets: - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs` - `src/AxCopilot/Services/Agent/AgentLoopService.cs` - `src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs` - Finding: - Session learnings are injected every loop, but they are not structured strongly enough to lock in: - current goal - current architecture - changed files - latest build or test failure - next repair target - Impact: - The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer. ### 4. Tool-trace invariant repairs are too common - AX targets: - `src/AxCopilot/Services/LlmService.ToolUse.cs` - `src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs` - `src/AxCopilot/Services/Agent/AgentLoopService.cs` - Finding: - The recent logs show repeated mismatch and orphan corrections. - Impact: - Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable. ### 5. There is no Code-specific working-set layer - AX targets: - new service required - injection path should go through: - `AgentLoopLlmRequestPreparationService` - `AgentQueryContextBuilder` - `AgentLoopService` - Finding: - The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger. - Impact: - Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory. ## External Research Notes ### Anthropic Claude Code memory docs - Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via `/memory`. - Planning implication: - AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why. - Source: - [Anthropic Claude Code memory docs](https://docs.anthropic.com/zh-CN/docs/claude-code/memory) ### OpenAI practical guide to building agents - The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior. - Planning implication: - AX should log the exact context sections that enter each Code request, including what was compacted and why. - Source: - [OpenAI practical guide to building agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) ### SWE-Pruner - The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents. - Planning implication: - AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning. - Source: - [SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents](https://arxiv.org/abs/2601.16746) ## `claude-code` Reference Points Reference targets: - `claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts` - `claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts` - `claw-code/en/concepts/memory-context.md` Observed direction: - `claude-code` builds a dedicated `messagesForQuery` window. - It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact. - It treats memory and post-compaction query windows as first-class parts of the request path. AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling. ## Remediation Plan ### Phase 1. Context observability and bootstrap repair - Reference targets: - `claw-code/.../src/query.ts` - `claw-code/en/concepts/memory-context.md` - AX targets: - `src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs` - `src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs` - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs` - `src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs` - Work items: - guarantee workspace-context generation starts even on first miss - log the exact context sections injected into each request - add diagnostics for omitted sections - Done criteria: - empty-workspace runs show workspace context generation by loop 2 - logs show section names, sizes, and compaction status - Quality scenario: - a fresh `E:\code` WPF scaffolding run should show folder and project context in the first two request cycles ### Phase 2. Code working-set memory layer - Reference targets: - `claw-code/.../src/query.ts` - `claw-code/.../src/history.ts` - Anthropic memory docs - AX targets: - new `CodeTaskWorkingSetService` - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs` - `src/AxCopilot/Services/Agent/AgentLoopService.cs` - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs` - Work items: - maintain a stable structured ledger with: - current goal - selected architecture - changed files - latest successful writes - open diagnostics - next repair target - inject it only when changed - replace superseded failures with the latest active issue - Done criteria: - long Code runs keep a single coherent working-set block without noisy duplication - build and test failures are preserved as part of the working set - Quality scenario: - after fixing `MC3089`, the run should still remember the earlier structure change while focusing on the new `CS0017` entry-point failure ### Phase 3. Task-aware pruning and protected evidence - Reference targets: - `claw-code/.../src/query.ts` - SWE-Pruner - AX targets: - `src/AxCopilot/Services/Agent/AgentToolResultBudget.cs` - `src/AxCopilot/Services/Agent/ContextCondenser.cs` - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs` - Work items: - protect: - latest build error block - latest test failure block - current plan or working set - latest folder tree snapshot - last N write diffs - move from pure char-based truncation toward semantic snapshots - tune compaction rules specifically for Code tasks - Done criteria: - active repair evidence survives across loops until superseded - older noise shrinks without losing the current failure context - Quality scenario: - a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload ### Phase 4. Tool-trace invariant hardening - Reference targets: - `claw-code/.../src/query.ts` - `claw-code/.../src/history.ts` - AX targets: - `src/AxCopilot/Services/LlmService.ToolUse.cs` - `src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs` - `src/AxCopilot/Services/Agent/AgentLoopService.cs` - Work items: - shift from after-the-fact flattening to pre-request validation and normalization - classify mismatch and orphan causes and lock them with regression tests - add a final integrity pass before query submission - Done criteria: - standard Code runs approach zero mismatch or orphan repair logs - assistant, tool, and tool_result chains remain intact end to end - Quality scenario: - a 50-loop Code run should complete without repeated tool-trace repair events ### Phase 5. Encoding hygiene and prompt cleanup - Reference targets: - Anthropic memory docs - OpenAI practical guide eval and observability recommendations - AX targets: - `src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs` - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs` - active status, prompt, and catalog files - `AGENTS.md` - Work items: - enforce English-only comments in code files - rewrite mojibake strings in active prompt paths into English - add long-run Code evals to catch prompt and status encoding regressions - Done criteria: - no broken strings remain in active prompt or status paths - touched code files keep English comments only - Quality scenario: - Windows Korean environments should show readable build, test, and status output without mojibake feedback loops ## Priority 1. Phase 1: bootstrap and observability 2. Phase 2: working-set memory 3. Phase 3: task-aware pruning 4. Phase 4: tool-trace invariants 5. Phase 5: encoding and prompt cleanup ## Expected Outcome - fewer repeated build-failure loops - better structural consistency for project generation and large edits - less drift in long-running Code tasks - fewer quality losses caused by broken strings and low-signal context replacements ## Latest Delivery Updated: 2026-04-16 01:41 (KST) - Delivered in this pass: - Phase 1 foundation: - `ChatWindow.UtilityPresentation.cs` now bootstraps workspace context generation on first access and returns language-workflow fallback hints while `.ax-context.md` is still being generated. - `AgentLoopService.cs` now records `query_context` workflow transitions with query-window, budget, supplemental-context, and working-set summaries. - Phase 2 foundation: - `CodeTaskWorkingSetService.cs` adds a Code-only structured ledger for: - goal - selected scaffold/profile - created directories - recent reads/writes - latest diagnostics - next repair focus - the working set is injected into each Code request as a supplemental `code_working_set` system message. - Phase 3 foundation: - `AgentToolResultBudget.cs` and `AgentQueryContextBuilder.cs` now expose a `code` query profile with a larger protected-recent window and larger retained budgets for `build_run`, `test_loop`, `process`, `file_read`, `multi_read`, `lsp_code_intel`, and `git_tool`. - Phase 4 observability step: - `LlmService.ToolUse.cs` now logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run. - Remaining follow-up: - extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair - replace more mojibake prompt/status strings in active Code execution paths with English equivalents