지침과 문서에 코드 컨텍스트 안정화 계획을 반영한다

- AGENTS.md에 코드 파일 주석 영문화와 인코딩 손상 문자열 정리 규칙을 추가한다. - 최근 Code 탭 실행 로그를 재분석해 메시지 수 증가 대비 컨텍스트 충실도 저하 원인을 정리한다. - Code working set, task-aware pruning, tool trace invariant, bootstrap observability를 포함한 장기 수정 계획 문서를 추가한다. - README와 DEVELOPMENT 문서에 2026-04-16 01:28 KST 기준 분석 결과와 후속 계획을 기록한다. - 검증: dotnet build src\\AxCopilot\\AxCopilot.csproj -c Release -v minimal -p:OutputPath=bin\\verify_context_plan_docs\\ -p:IntermediateOutputPath=obj\\verify_context_plan_docs\\ (경고 0 / 오류 0)
2026-04-16 01:20:14 +09:00
parent e2278eec24
commit eb884e9263
4 changed files with 277 additions and 0 deletions
--- a/docs/CODE_CONTEXT_RELIABILITY_PLAN.md
+++ b/docs/CODE_CONTEXT_RELIABILITY_PLAN.md
@@ -0,0 +1,249 @@
+# Code Context Reliability Plan
+
+Update: 2026-04-16 01:37 (KST)
+
+## Background
+
+Recent Code tab runs show that the LLM request payload is still growing over time. In the `2026-04-16 00:46:26` to `00:50:52` run, the request size grew from `messages=7` to `messages=125`. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.
+
+The same log window repeatedly shows:
+
+- `tool_calls/tool mismatch detected - flattening assistant message`
+- `orphan tool message detected - converting to user`
+- repeated rereads of nearby files after build failures
+- shifting build failures such as `MC3089` followed by `CS0017` without a stable working set that preserves what was already changed and what remains broken
+
+In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.
+
+## Current Findings
+
+### 1. Workspace context bootstrap is weak on first load
+
+- AX targets:
+  - `src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs`
+  - `src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs`
+- Finding:
+  - When `.ax-context.md` is missing, the first Code request can return before background workspace-context generation becomes useful.
+- Impact:
+  - Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.
+
+### 2. Build and file evidence is compacted too aggressively
+
+- AX targets:
+  - `src/AxCopilot/Services/Agent/AgentToolResultBudget.cs`
+  - `src/AxCopilot/Services/Agent/ContextCondenser.cs`
+- Current values:
+  - `DefaultSoftCharLimit = 900`
+  - `DefaultAggregateBudgetChars = 7_500`
+  - `RecentKeepCount = 6`
+- Impact:
+  - Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.
+
+### 3. Session learning is not a durable code working set
+
+- AX targets:
+  - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs`
+  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
+  - `src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs`
+- Finding:
+  - Session learnings are injected every loop, but they are not structured strongly enough to lock in:
+    - current goal
+    - current architecture
+    - changed files
+    - latest build or test failure
+    - next repair target
+- Impact:
+  - The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.
+
+### 4. Tool-trace invariant repairs are too common
+
+- AX targets:
+  - `src/AxCopilot/Services/LlmService.ToolUse.cs`
+  - `src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs`
+  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
+- Finding:
+  - The recent logs show repeated mismatch and orphan corrections.
+- Impact:
+  - Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.
+
+### 5. There is no Code-specific working-set layer
+
+- AX targets:
+  - new service required
+  - injection path should go through:
+    - `AgentLoopLlmRequestPreparationService`
+    - `AgentQueryContextBuilder`
+    - `AgentLoopService`
+- Finding:
+  - The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
+- Impact:
+  - Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.
+
+## External Research Notes
+
+### Anthropic Claude Code memory docs
+
+- Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via `/memory`.
+- Planning implication:
+  - AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
+- Source:
+  - [Anthropic Claude Code memory docs](https://docs.anthropic.com/zh-CN/docs/claude-code/memory)
+
+### OpenAI practical guide to building agents
+
+- The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
+- Planning implication:
+  - AX should log the exact context sections that enter each Code request, including what was compacted and why.
+- Source:
+  - [OpenAI practical guide to building agents](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf)
+
+### SWE-Pruner
+
+- The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
+- Planning implication:
+  - AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
+- Source:
+  - [SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents](https://arxiv.org/abs/2601.16746)
+
+## `claude-code` Reference Points
+
+Reference targets:
+
+- `claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts`
+- `claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts`
+- `claw-code/en/concepts/memory-context.md`
+
+Observed direction:
+
+- `claude-code` builds a dedicated `messagesForQuery` window.
+- It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
+- It treats memory and post-compaction query windows as first-class parts of the request path.
+
+AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.
+
+## Remediation Plan
+
+### Phase 1. Context observability and bootstrap repair
+
+- Reference targets:
+  - `claw-code/.../src/query.ts`
+  - `claw-code/en/concepts/memory-context.md`
+- AX targets:
+  - `src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs`
+  - `src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs`
+  - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs`
+  - `src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs`
+- Work items:
+  - guarantee workspace-context generation starts even on first miss
+  - log the exact context sections injected into each request
+  - add diagnostics for omitted sections
+- Done criteria:
+  - empty-workspace runs show workspace context generation by loop 2
+  - logs show section names, sizes, and compaction status
+- Quality scenario:
+  - a fresh `E:\code` WPF scaffolding run should show folder and project context in the first two request cycles
+
+### Phase 2. Code working-set memory layer
+
+- Reference targets:
+  - `claw-code/.../src/query.ts`
+  - `claw-code/.../src/history.ts`
+  - Anthropic memory docs
+- AX targets:
+  - new `CodeTaskWorkingSetService`
+  - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs`
+  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
+  - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs`
+- Work items:
+  - maintain a stable structured ledger with:
+    - current goal
+    - selected architecture
+    - changed files
+    - latest successful writes
+    - open diagnostics
+    - next repair target
+  - inject it only when changed
+  - replace superseded failures with the latest active issue
+- Done criteria:
+  - long Code runs keep a single coherent working-set block without noisy duplication
+  - build and test failures are preserved as part of the working set
+- Quality scenario:
+  - after fixing `MC3089`, the run should still remember the earlier structure change while focusing on the new `CS0017` entry-point failure
+
+### Phase 3. Task-aware pruning and protected evidence
+
+- Reference targets:
+  - `claw-code/.../src/query.ts`
+  - SWE-Pruner
+- AX targets:
+  - `src/AxCopilot/Services/Agent/AgentToolResultBudget.cs`
+  - `src/AxCopilot/Services/Agent/ContextCondenser.cs`
+  - `src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs`
+- Work items:
+  - protect:
+    - latest build error block
+    - latest test failure block
+    - current plan or working set
+    - latest folder tree snapshot
+    - last N write diffs
+  - move from pure char-based truncation toward semantic snapshots
+  - tune compaction rules specifically for Code tasks
+- Done criteria:
+  - active repair evidence survives across loops until superseded
+  - older noise shrinks without losing the current failure context
+- Quality scenario:
+  - a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload
+
+### Phase 4. Tool-trace invariant hardening
+
+- Reference targets:
+  - `claw-code/.../src/query.ts`
+  - `claw-code/.../src/history.ts`
+- AX targets:
+  - `src/AxCopilot/Services/LlmService.ToolUse.cs`
+  - `src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs`
+  - `src/AxCopilot/Services/Agent/AgentLoopService.cs`
+- Work items:
+  - shift from after-the-fact flattening to pre-request validation and normalization
+  - classify mismatch and orphan causes and lock them with regression tests
+  - add a final integrity pass before query submission
+- Done criteria:
+  - standard Code runs approach zero mismatch or orphan repair logs
+  - assistant, tool, and tool_result chains remain intact end to end
+- Quality scenario:
+  - a 50-loop Code run should complete without repeated tool-trace repair events
+
+### Phase 5. Encoding hygiene and prompt cleanup
+
+- Reference targets:
+  - Anthropic memory docs
+  - OpenAI practical guide eval and observability recommendations
+- AX targets:
+  - `src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs`
+  - `src/AxCopilot/Services/Agent/SessionLearningCollector.cs`
+  - active status, prompt, and catalog files
+  - `AGENTS.md`
+- Work items:
+  - enforce English-only comments in code files
+  - rewrite mojibake strings in active prompt paths into English
+  - add long-run Code evals to catch prompt and status encoding regressions
+- Done criteria:
+  - no broken strings remain in active prompt or status paths
+  - touched code files keep English comments only
+- Quality scenario:
+  - Windows Korean environments should show readable build, test, and status output without mojibake feedback loops
+
+## Priority
+
+1. Phase 1: bootstrap and observability
+2. Phase 2: working-set memory
+3. Phase 3: task-aware pruning
+4. Phase 4: tool-trace invariants
+5. Phase 5: encoding and prompt cleanup
+
+## Expected Outcome
+
+- fewer repeated build-failure loops
+- better structural consistency for project generation and large edits
+- less drift in long-running Code tasks
+- fewer quality losses caused by broken strings and low-signal context replacements