Files
AX-Copilot-Codex/docs/CODE_CONTEXT_RELIABILITY_PLAN.md
lacvet 2e1c7be8c3 코드탭 query/history 조립 구조를 단계형 서비스로 분리
- AgentLoopQueryAssemblyService를 추가해 session learning refresh, queued command/query window 준비, code working set supplemental context 부착을 단계형으로 정리함

- AgentLoopService는 orchestration 중심으로 단순화하고 claw-code의 staged query/history 흐름과 비슷하게 책임을 재배치함

- AgentLoopQueryAssemblyServiceTests를 추가하고 SessionLearningCollectorTests를 영어 기준으로 정리했으며 dotnet build 및 targeted dotnet test(56 통과, 경고/오류 0)로 검증함
2026-04-16 02:07:26 +09:00

14 KiB

Code Context Reliability Plan

Update: 2026-04-16 01:37 (KST)

Background

Recent Code tab runs show that the LLM request payload is still growing over time. In the 2026-04-16 00:46:26 to 00:50:52 run, the request size grew from messages=7 to messages=125. That means the failure mode is not "context does not grow at all." The real problem is context fidelity: detailed evidence that the model still needs is being replaced too quickly by previews, repair notes, and low-signal summaries.

The same log window repeatedly shows:

  • tool_calls/tool mismatch detected - flattening assistant message
  • orphan tool message detected - converting to user
  • repeated rereads of nearby files after build failures
  • shifting build failures such as MC3089 followed by CS0017 without a stable working set that preserves what was already changed and what remains broken

In short, the current system grows the raw message count but does not preserve a stable working set for long-running code tasks.

Current Findings

1. Workspace context bootstrap is weak on first load

  • AX targets:
    • src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
    • src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
  • Finding:
    • When .ax-context.md is missing, the first Code request can return before background workspace-context generation becomes useful.
  • Impact:
    • Empty-workspace and fresh-project tasks start without a reliable folder or project summary in the early loops.

2. Build and file evidence is compacted too aggressively

  • AX targets:
    • src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
    • src/AxCopilot/Services/Agent/ContextCondenser.cs
  • Current values:
    • DefaultSoftCharLimit = 900
    • DefaultAggregateBudgetChars = 7_500
    • RecentKeepCount = 6
  • Impact:
    • Code tasks lose detailed build, test, and file-read evidence too early and fall back to previews instead of actionable context.

3. Session learning is not a durable code working set

  • AX targets:
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
    • src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
  • Finding:
    • Session learnings are injected every loop, but they are not structured strongly enough to lock in:
      • current goal
      • current architecture
      • changed files
      • latest build or test failure
      • next repair target
  • Impact:
    • The model must repeatedly reconstruct project state from noisy history instead of reading a stable code-task memory layer.

4. Tool-trace invariant repairs are too common

  • AX targets:
    • src/AxCopilot/Services/LlmService.ToolUse.cs
    • src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
  • Finding:
    • The recent logs show repeated mismatch and orphan corrections.
  • Impact:
    • Even if the total message count grows, the semantic chain between assistant reasoning, tool call, and tool result becomes less reliable.

5. There is no Code-specific working-set layer

  • AX targets:
    • new service required
    • injection path should go through:
      • AgentLoopLlmRequestPreparationService
      • AgentQueryContextBuilder
      • AgentLoopService
  • Finding:
    • The current request mixes raw chat history, session learnings, project context, and workspace context, but it does not maintain a dedicated code-task state ledger.
  • Impact:
    • Long-running runs become increasingly inconsistent because the model keeps rediscovering facts that should already be fixed in memory.

External Research Notes

Anthropic Claude Code memory docs

  • Claude Code explicitly documents memory files that are auto-loaded at startup and inspectable via /memory.
  • Planning implication:
    • AX should have a clearly observable memory hierarchy for Code tasks, including what was auto-loaded and why.
  • Source:

OpenAI practical guide to building agents

  • The guide emphasizes observability, eval baselines, and explicit tool and system design before optimizing agent behavior.
  • Planning implication:
    • AX should log the exact context sections that enter each Code request, including what was compacted and why.
  • Source:

SWE-Pruner

  • The paper argues that task-aware adaptive pruning outperforms naive fixed truncation for coding agents.
  • Planning implication:
    • AX should protect code-task evidence such as latest build failures and changed-file summaries instead of applying mostly size-based pruning.
  • Source:

claude-code Reference Points

Reference targets:

  • claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/query.ts
  • claw-code/claw-code-f5a40b86dede580f6543bf8926c9af017eea9409/src/history.ts
  • claw-code/en/concepts/memory-context.md

Observed direction:

  • claude-code builds a dedicated messagesForQuery window.
  • It stages compaction through boundary filtering, tool-result budgeting, snip, microcompact, and autocompact.
  • It treats memory and post-compaction query windows as first-class parts of the request path.

AX already has similar mechanisms, but the Code flow still lacks stronger working-set preservation and cleaner invariant handling.

Remediation Plan

Phase 1. Context observability and bootstrap repair

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/en/concepts/memory-context.md
  • AX targets:
    • src/AxCopilot/Views/ChatWindow.UtilityPresentation.cs
    • src/AxCopilot/Services/Agent/WorkspaceContextGenerator.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
    • src/AxCopilot/Services/Agent/AgentLoopLlmRequestPreparationService.cs
  • Work items:
    • guarantee workspace-context generation starts even on first miss
    • log the exact context sections injected into each request
    • add diagnostics for omitted sections
  • Done criteria:
    • empty-workspace runs show workspace context generation by loop 2
    • logs show section names, sizes, and compaction status
  • Quality scenario:
    • a fresh E:\code WPF scaffolding run should show folder and project context in the first two request cycles

Phase 2. Code working-set memory layer

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/.../src/history.ts
    • Anthropic memory docs
  • AX targets:
    • new CodeTaskWorkingSetService
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
  • Work items:
    • maintain a stable structured ledger with:
      • current goal
      • selected architecture
      • changed files
      • latest successful writes
      • open diagnostics
      • next repair target
    • inject it only when changed
    • replace superseded failures with the latest active issue
  • Done criteria:
    • long Code runs keep a single coherent working-set block without noisy duplication
    • build and test failures are preserved as part of the working set
  • Quality scenario:
    • after fixing MC3089, the run should still remember the earlier structure change while focusing on the new CS0017 entry-point failure

Phase 3. Task-aware pruning and protected evidence

  • Reference targets:
    • claw-code/.../src/query.ts
    • SWE-Pruner
  • AX targets:
    • src/AxCopilot/Services/Agent/AgentToolResultBudget.cs
    • src/AxCopilot/Services/Agent/ContextCondenser.cs
    • src/AxCopilot/Services/Agent/AgentQueryContextBuilder.cs
  • Work items:
    • protect:
      • latest build error block
      • latest test failure block
      • current plan or working set
      • latest folder tree snapshot
      • last N write diffs
    • move from pure char-based truncation toward semantic snapshots
    • tune compaction rules specifically for Code tasks
  • Done criteria:
    • active repair evidence survives across loops until superseded
    • older noise shrinks without losing the current failure context
  • Quality scenario:
    • a 30-plus-loop Code run should still preserve the latest failure and target files in the request payload

Phase 4. Tool-trace invariant hardening

  • Reference targets:
    • claw-code/.../src/query.ts
    • claw-code/.../src/history.ts
  • AX targets:
    • src/AxCopilot/Services/LlmService.ToolUse.cs
    • src/AxCopilot/Services/Agent/AgentMessageInvariantHelper.cs
    • src/AxCopilot/Services/Agent/AgentLoopService.cs
  • Work items:
    • shift from after-the-fact flattening to pre-request validation and normalization
    • classify mismatch and orphan causes and lock them with regression tests
    • add a final integrity pass before query submission
  • Done criteria:
    • standard Code runs approach zero mismatch or orphan repair logs
    • assistant, tool, and tool_result chains remain intact end to end
  • Quality scenario:
    • a 50-loop Code run should complete without repeated tool-trace repair events

Phase 5. Encoding hygiene and prompt cleanup

  • Reference targets:
    • Anthropic memory docs
    • OpenAI practical guide eval and observability recommendations
  • AX targets:
    • src/AxCopilot/Views/ChatWindow.SystemPromptBuilder.cs
    • src/AxCopilot/Services/Agent/SessionLearningCollector.cs
    • active status, prompt, and catalog files
    • AGENTS.md
  • Work items:
    • enforce English-only comments in code files
    • rewrite mojibake strings in active prompt paths into English
    • add long-run Code evals to catch prompt and status encoding regressions
  • Done criteria:
    • no broken strings remain in active prompt or status paths
    • touched code files keep English comments only
  • Quality scenario:
    • Windows Korean environments should show readable build, test, and status output without mojibake feedback loops

Priority

  1. Phase 1: bootstrap and observability
  2. Phase 2: working-set memory
  3. Phase 3: task-aware pruning
  4. Phase 4: tool-trace invariants
  5. Phase 5: encoding and prompt cleanup

Expected Outcome

  • fewer repeated build-failure loops
  • better structural consistency for project generation and large edits
  • less drift in long-running Code tasks
  • fewer quality losses caused by broken strings and low-signal context replacements

Latest Delivery

Updated: 2026-04-16 01:41 (KST)

  • Delivered in this pass:
    • Phase 1 foundation:
      • ChatWindow.UtilityPresentation.cs now bootstraps workspace context generation on first access and returns language-workflow fallback hints while .ax-context.md is still being generated.
      • AgentLoopService.cs now records query_context workflow transitions with query-window, budget, supplemental-context, and working-set summaries.
    • Phase 2 foundation:
      • CodeTaskWorkingSetService.cs adds a Code-only structured ledger for:
        • goal
        • selected scaffold/profile
        • created directories
        • recent reads/writes
        • latest diagnostics
        • next repair focus
      • the working set is injected into each Code request as a supplemental code_working_set system message.
    • Phase 3 foundation:
      • AgentToolResultBudget.cs and AgentQueryContextBuilder.cs now expose a code query profile with a larger protected-recent window and larger retained budgets for build_run, test_loop, process, file_read, multi_read, lsp_code_intel, and git_tool.
    • Phase 4 observability step:
      • LlmService.ToolUse.cs now logs sanitization counts for flattened assistant tool traces and converted orphan tool messages, so tool-trace repair frequency can be measured per run.
  • Remaining follow-up:
    • extend pre-request tool-trace validation so the flattening/orphan repair count trends toward zero rather than being logged after repair
    • replace more mojibake prompt/status strings in active Code execution paths with English equivalents

Updated: 2026-04-16 01:57 (KST)

  • Delivered in this pass:
    • Phase 4 partial delivery:
      • AgentMessageInvariantHelper.cs now normalizes historical tool traces before the request leaves the agent loop.
      • structured assistant tool-call messages without matching tool_result now flatten into plain assistant transcript text.
      • orphan tool_result messages now flatten into plain user transcript text instead of relying only on late OpenAI payload repair.
      • AgentLoopLlmRequestPreparationService.cs clones the query window first, then applies normalization, so request cleanup does not mutate stored conversation history.
      • AgentLoopContextReliability.cs now logs tool-trace repair counts inside the query_context transition for run-by-run observability.
    • Phase 5 partial delivery:
      • SessionLearningCollector.cs was rewritten with English-only comments and English injection text.
      • AgentLoopDiagnosticsFormatter.cs no longer emits mojibake-prone compaction status text in active Code paths.
  • Remaining follow-up:
    • measure whether tool_trace_repair counts keep trending down in long Code runs after this preflight normalization
    • continue replacing older mojibake strings outside the active Code execution path

Updated: 2026-04-16 02:05 (KST)

  • Delivered in this pass:
    • structural alignment step:
      • AgentLoopQueryAssemblyService.cs now owns the staged query/history assembly path.
      • PrepareHistory(...) handles session-learning refresh plus queued-command/query-window preparation.
      • PrepareRequest(...) handles Code working-set supplemental context and request-message assembly before dispatch.
      • AgentLoopService.cs now delegates those responsibilities instead of manually stitching them together inline.
    • test and encoding hygiene step:
      • AgentLoopQueryAssemblyServiceTests.cs locks the new staged assembly behavior.
      • SessionLearningCollectorTests.cs was rewritten to English-only comments and assertions to match the new repository rule.
  • Remaining follow-up:
    • keep extracting more inline AgentLoop responsibilities into smaller staged services where it improves observability or retry correctness
    • continue measuring long Code runs against claw-code-style continuity scenarios