feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412
Open
declan-scale wants to merge 26 commits into
Open
feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412declan-scale wants to merge 26 commits into
declan-scale wants to merge 26 commits into
Conversation
Approach A (Agentex event stream as canonical source of truth): one tap per harness feeds shared yield/auto-send delivery adapters and a span-deriving tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance + live-matrix testing (3 test agents per harness: sync/async/temporal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… golden-agent integration - Make tracing-tap span derivation explicit (tool open on Done of a ToolRequestContent index, close on matching ToolResponseContent by tool_call_id; parallel-safe; reasoning start->done). Flag missing is_error on ToolResponseContent as an additive upstream decision. - Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token taxonomy) attached to the turn span via span(data=) and reused for metrics. - Document golden-agent integration: all SGP/sandbox/secret/MCP coupling stays in the agent; only parsing/streaming/tracing/usage move to SDK taps + emitter; sandbox-setup events chain before the harness stream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1-3) Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter, yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI job. Migration/parser PRs (4-9) listed as follow-on plans. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signals Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… handling in SpanDeriver Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts for SpanTracer Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on early close Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reaming + tracing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… + cover error/finally paths in auto_send Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send_turn + doc tracer modes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egistry semantics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d21c54a to
ebc468d
Compare
…reportImplicitOverride) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MissingImports) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@greptile review |
…ast-segment, created_at (AGX1-377, AGX1-378) auto_send.py: - Replace single current_ctx with ctx_map[index] so parallel streams route correctly - Open a streaming context for ALL content types on Start (not just text/reasoning), fixing tool_request/tool_response stream delivery (AGX1-377) - Reset final_text_parts on each new Start(TextContent) and on Full(TextContent) so multi-step turns return the LAST text segment, not the full accumulation - Add created_at: datetime | None param; forward to every streaming_task_message_context call (AGX1-378) span_derivation.py: - _on_full: handle Full(ToolRequestContent) by opening a tool span keyed by tool_call_id if not already open; adds LangGraph full-event harness support Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting, last-segment, created_at, Full ToolRequest spans test_auto_send.py: - Fix test 2: remove bare Start(ToolRequestContent) from events (old behavior was that Start did not open a ctx; new behavior does, so test was updated to use Full-only events that still verify the two-context behavior) - Extend _FakeStreaming to record created_at on each context call - Add test 6: streamed tool_request opens a ctx + routes deltas (AGX1-377 core) - Add test 7: interleaved indexes route deltas to correct per-index contexts - Add test 8: multi-step turns return the LAST text segment only - Add test 9: Full(TextContent) contributes its content to final_text - Add test 10: created_at is forwarded to every streaming context call (AGX1-378) test_span_derivation.py: - Add test_full_tool_request_opens_span: Full(ToolRequestContent) opens a span - Add test_full_tool_request_and_response_paired: paired Full request+response produces a complete OpenSpan+CloseSpan - Add test_full_tool_request_does_not_double_open: idempotent; a Full for an already-open tool_call_id is a no-op Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines
+59
to
+67
| async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult: | ||
| """Async/temporal delivery: push to the task stream, return TurnResult.""" | ||
| return await auto_send( | ||
| turn.events, | ||
| task_id=self.task_id, | ||
| tracer=self.tracer, | ||
| streaming=self._streaming, | ||
| usage=turn.usage(), | ||
| ) |
There was a problem hiding this comment.
turn.usage() called before events are exhausted
turn.usage() is evaluated as a Python argument before auto_send is awaited, meaning it runs before turn.events is ever iterated. The HarnessTurn protocol explicitly states that usage() is "valid only after events is exhausted." Any real implementation that accumulates token counts or cost while iterating events will always return stale/zero usage here. The test passes only because _Turn.usage() returns a pre-set value regardless of iteration state.
The fix is to call turn.usage() after auto_send has drained the stream.
Suggested change
| async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult: | |
| """Async/temporal delivery: push to the task stream, return TurnResult.""" | |
| return await auto_send( | |
| turn.events, | |
| task_id=self.task_id, | |
| tracer=self.tracer, | |
| streaming=self._streaming, | |
| usage=turn.usage(), | |
| ) | |
| async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult: | |
| """Async/temporal delivery: push to the task stream, return TurnResult.""" | |
| result = await auto_send( | |
| turn.events, | |
| task_id=self.task_id, | |
| tracer=self.tracer, | |
| streaming=self._streaming, | |
| ) | |
| usage = turn.usage() # valid now: events exhausted by auto_send | |
| return TurnResult(final_text=result.final_text, usage=usage) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/emitter.py
Line: 59-67
Comment:
**`turn.usage()` called before events are exhausted**
`turn.usage()` is evaluated as a Python argument before `auto_send` is awaited, meaning it runs before `turn.events` is ever iterated. The `HarnessTurn` protocol explicitly states that `usage()` is "valid only after `events` is exhausted." Any real implementation that accumulates token counts or cost while iterating events will always return stale/zero usage here. The test passes only because `_Turn.usage()` returns a pre-set value regardless of iteration state.
The fix is to call `turn.usage()` after `auto_send` has drained the stream.
```suggestion
async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
"""Async/temporal delivery: push to the task stream, return TurnResult."""
result = await auto_send(
turn.events,
task_id=self.task_id,
tracer=self.tracer,
streaming=self._streaming,
)
usage = turn.usage() # valid now: events exhausted by auto_send
return TurnResult(final_text=result.final_text, usage=usage)
```
How can I resolve this? If you propose a fix, please make it concise.…n (AGX1-378) So migration helpers can restore the deterministic first-message timestamp on the temporal path. Default None preserves current behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@greptile review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex
StreamTaskMessage*stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:adk.streaming(async + temporal agents, from inside an activity),with tracing on by default (derived from the same stream) and overridable, and a unified
TurnUsage/TurnResultshape for per-harness usage normalization.Design:
docs/superpowers/specs/2026-06-18-unified-harness-surface-design.mdPlan:
docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.mdWhat's in
src/agentex/lib/core/harness/types.py—StreamTaskMessage,OpenSpan/CloseSpan/SpanSignal,TurnUsage,TurnResult,HarnessTurnprotocol.span_derivation.py—SpanDeriver: pure reducer (noadkdep), canonical stream → span signals. Tool span opens on theDoneof aToolRequestContentindex, closes on the matchingToolResponseContentbytool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe;flush()closes unclosed spans.tracer.py—SpanTracer: best-effort adapter from span signals toadk.tracing(never raises; overridable; guardedmake_logger).yield_delivery.py/auto_send.py— the two delivery adapters (both feed the sameSpanDeriver/SpanTracer;finally-flush on early close/error).emitter.py—UnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.conformance/— shared conformance scaffold each future harness tap registers fixtures with..github/workflows/harness-integration.yml— conformance CI job (via./scripts/test) + anif: falselive-matrixplaceholder enabled by the migration PRs.Scope / what's NOT here
Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.
Quality gates
./scripts/test tests/lib/core/harness/).# type: ignorein the package.Follow-ups (filed)
Fulltool-message wire shape (blocks migration backward-compat claims).adkfacade before the first consumer migration.pathstoagentex.types;SpanTracerduplicate-open guard.is_erroronToolResponseContent(tool-span error status).🤖 Generated with Claude Code
Greptile Summary
This PR introduces the foundation layer for a unified harness tracing and message-emitting surface —
SpanDeriver,SpanTracer,yield_delivery,auto_send,UnifiedEmitter, and shared conformance scaffolding — so that every future harness tap gets streaming, tracing, and usage normalization from a single set of shared components.SpanDeriveris a pure,adk-free reducer that maps the canonicalStreamTaskMessage*stream toOpenSpan/CloseSpansignals; both delivery adapters (yield_eventsandauto_send) feed it identically, with afinally-flush covering early exits and errors.auto_senddelivery: Now correctly routes events by index viactx_map, handlesStart(ToolRequestContent)(opens a streaming context), supports last-segmentfinal_textsemantics (resets on each newTextContentstart), and accumulates text fromFull(TextContent)— all verified by 10 dedicated tests.turn.usage()evaluated before events drain), wire-shape consistency forFulltool messages, and the conformance suite's yield-vs-auto-send equivalence assertion are explicitly tracked in AGX1-373/374/375/377 for resolution in the per-harness migration PRs.Confidence Score: 4/5
Safe to merge as a foundation layer; the deferred usage-ordering behaviour in the emitter means that real harness taps accumulating token counts during iteration will receive stale usage until the migration PRs align the call site.
The core event routing, span derivation, and delivery logic are correct and well-tested. The one known semantic gap —
turn.usage()is evaluated as an argument beforeauto_senddrainsturn.events, so any real tap that accumulates usage during iteration returns stale data — is a present defect in the changed emitter, deliberately deferred to the migration PRs where real implementations will be provided.emitter.py — usage collection ordering; span_derivation.py —
_on_donedouble-open guard asymmetryImportant Files Changed
_on_fullguards against double-open but_on_donedoes not, risking orphaned spans on malformed mixed streams.ctx_mapcorrectly handles interleaved streams, ToolRequestContent Start opens contexts, last-segment final_text semantics are in place, andfinally-flush covers early exits. Previous review gaps have been addressed.turn.usage()is evaluated as a positional argument beforeauto_senddrainsturn.events, which violates theHarnessTurnprotocol contract; deferred to migration PRs.adk.tracing. Never raises, gracefully handles missing spans. Duplicate-open orphan behavior is documented and tracked (AGX1-376).finally-flush correctly handles early consumer cancellation.StreamTaskMessageunion,OpenSpan/CloseSpansignals,TurnUsage,TurnResult,HarnessTurnprotocol.derive_allidempotency, not correctness of emitted signals; true yield-vs-auto-send equivalence assertion is deferred to AGX1-373../scripts/testfor isolated uv environment. Narrowpathsfilter (missingagentex.types) is a known gap filed as AGX1-376. Placeholderlive-matrixjob kept asif: falseuntil migration PRs land.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant HT as HarnessTurn participant UE as UnifiedEmitter participant YD as yield_events / auto_send participant SD as SpanDeriver participant ST as SpanTracer participant S as adk.streaming UE->>YD: pass turn.events + tracer loop each StreamTaskMessage HT-->>YD: event (Start / Delta / Done / Full) YD->>SD: observe(event) SD-->>YD: [] or [OpenSpan / CloseSpan] opt signal emitted YD->>ST: handle(signal) ST->>S: start_span / end_span end alt yield mode YD-->>UE: yield event else auto_send mode YD->>S: streaming_task_message_context + stream_update + close end end Note over YD,ST: finally: deriver.flush() closes unclosed spans YD-->>UE: TurnResult(final_text, usage)%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant HT as HarnessTurn participant UE as UnifiedEmitter participant YD as yield_events / auto_send participant SD as SpanDeriver participant ST as SpanTracer participant S as adk.streaming UE->>YD: pass turn.events + tracer loop each StreamTaskMessage HT-->>YD: event (Start / Delta / Done / Full) YD->>SD: observe(event) SD-->>YD: [] or [OpenSpan / CloseSpan] opt signal emitted YD->>ST: handle(signal) ST->>S: start_span / end_span end alt yield mode YD-->>UE: yield event else auto_send mode YD->>S: streaming_task_message_context + stream_update + close end end Note over YD,ST: finally: deriver.flush() closes unclosed spans YD-->>UE: TurnResult(final_text, usage)Comments Outside Diff (1)
General comment
handle()method toUnifiedEmitter(..., tracer=custom_tracer)did not use that object. Instead, the emitter fell through to default construction ofSpanTracer, which attempted to import the real ADK stack and failed in this environment withModuleNotFoundError: No module named 'temporalio'. This contradicts the requested override behavior for tracing in the unified emitter surface.src/agentex/lib/core/harness/emitter.pyonly accepts overrides whenisinstance(tracer, SpanTracer)is true. A valid injected/duck-typed tracer object is ignored, causing fallback toSpanTracer(...)whenevertrace_idis truthy.handle(signal)contract, or define and use a runtime-checkable tracer Protocol instead of requiringisinstance(tracer, SpanTracer).Reviews (7): Last reviewed commit: "feat(harness): thread created_at through..." | Re-trigger Greptile