Skip to content

feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412

Open
declan-scale wants to merge 26 commits into
nextfrom
declan-scale/unified-harness-surface
Open

feat(harness): unified harness surface — foundation (span derivation, delivery adapters, emitter)#412
declan-scale wants to merge 26 commits into
nextfrom
declan-scale/unified-harness-surface

Conversation

@declan-scale

@declan-scale declan-scale commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What this is

Foundation (PRs 1–3 of the rollout) for a unified harness tracing/message-emitting surface: the Agentex StreamTaskMessage* stream is the single source of truth, and shared harness-independent machinery derives spans from it and delivers it over both channels:

  • yield — pass the canonical stream through to the caller (sync HTTP ACP agents),
  • auto-send — push to the task stream via adk.streaming (async + temporal agents, from inside an activity),

with tracing on by default (derived from the same stream) and overridable, and a unified TurnUsage/TurnResult shape for per-harness usage normalization.

Design: docs/superpowers/specs/2026-06-18-unified-harness-surface-design.md
Plan: docs/superpowers/plans/2026-06-18-unified-harness-surface-foundation.md

What's in src/agentex/lib/core/harness/

  • types.pyStreamTaskMessage, OpenSpan/CloseSpan/SpanSignal, TurnUsage, TurnResult, HarnessTurn protocol.
  • span_derivation.pySpanDeriver: pure reducer (no adk dep), canonical stream → span signals. Tool span opens on the Done of a ToolRequestContent index, closes on the matching ToolResponseContent by tool_call_id; reasoning span open-on-Start / close-on-Done; parallel-safe; flush() closes unclosed spans.
  • tracer.pySpanTracer: best-effort adapter from span signals to adk.tracing (never raises; overridable; guarded make_logger).
  • yield_delivery.py / auto_send.py — the two delivery adapters (both feed the same SpanDeriver/SpanTracer; finally-flush on early close/error).
  • emitter.pyUnifiedEmitter: ties trace context + delivery + usage; default-on/overridable tracing; injectable tracing/streaming backends.
  • conformance/ — shared conformance scaffold each future harness tap registers fixtures with.
  • .github/workflows/harness-integration.yml — conformance CI job (via ./scripts/test) + an if: false live-matrix placeholder enabled by the migration PRs.

Scope / what's NOT here

Per-harness migration (pydantic-ai / langgraph / openai) and parser taps (claude-code / codex), plus their 3 e2e test agents each (sync/async/temporal), are future migration PRs (4–8) — not in this branch.

Quality gates

  • 30 tests passing on Python 3.12 + 3.13 (via ./scripts/test tests/lib/core/harness/).
  • pyright clean (0 errors/warnings), no # type: ignore in the package.
  • Each task spec- + quality-reviewed; final whole-branch review passed with no Critical issues.

Follow-ups (filed)

  • AGX1-373 (High) — make conformance assert true yield-vs-auto-send equivalence + reconcile Full tool-message wire shape (blocks migration backward-compat claims).
  • AGX1-374 (Medium) — auto_send reasoning + mixed-ordering tests.
  • AGX1-375 (Medium) — expose the surface via the public adk facade before the first consumer migration.
  • AGX1-376 (Low) — widen CI paths to agentex.types; SpanTracer duplicate-open guard.
  • AGX1-371 — deferred optional is_error on ToolResponseContent (tool-span error status).

Note: total diff is ~3k lines but ~1.6k of that is the spec + plan docs; the package code + tests + CI is ~1.4k. Reviewable per-commit (one commit per plan task).

🤖 Generated with Claude Code

Greptile Summary

This PR introduces the foundation layer for a unified harness tracing and message-emitting surface — SpanDeriver, SpanTracer, yield_delivery, auto_send, UnifiedEmitter, and shared conformance scaffolding — so that every future harness tap gets streaming, tracing, and usage normalization from a single set of shared components.

  • Core event pipeline: SpanDeriver is a pure, adk-free reducer that maps the canonical StreamTaskMessage* stream to OpenSpan/CloseSpan signals; both delivery adapters (yield_events and auto_send) feed it identically, with a finally-flush covering early exits and errors.
  • auto_send delivery: Now correctly routes events by index via ctx_map, handles Start(ToolRequestContent) (opens a streaming context), supports last-segment final_text semantics (resets on each new TextContent start), and accumulates text from Full(TextContent) — all verified by 10 dedicated tests.
  • Deferred gaps: Turn-usage ordering in the emitter (turn.usage() evaluated before events drain), wire-shape consistency for Full tool messages, and the conformance suite's yield-vs-auto-send equivalence assertion are explicitly tracked in AGX1-373/374/375/377 for resolution in the per-harness migration PRs.

Confidence Score: 4/5

Safe to merge as a foundation layer; the deferred usage-ordering behaviour in the emitter means that real harness taps accumulating token counts during iteration will receive stale usage until the migration PRs align the call site.

The core event routing, span derivation, and delivery logic are correct and well-tested. The one known semantic gap — turn.usage() is evaluated as an argument before auto_send drains turn.events, so any real tap that accumulates usage during iteration returns stale data — is a present defect in the changed emitter, deliberately deferred to the migration PRs where real implementations will be provided.

emitter.py — usage collection ordering; span_derivation.py — _on_done double-open guard asymmetry

Important Files Changed

Filename Overview
src/agentex/lib/core/harness/span_derivation.py Pure reducer from stream events to span signals. Handles Start+Done and Full paths for tool/reasoning spans. Minor asymmetry: _on_full guards against double-open but _on_done does not, risking orphaned spans on malformed mixed streams.
src/agentex/lib/core/harness/auto_send.py Async/temporal delivery adapter. Index-keyed ctx_map correctly handles interleaved streams, ToolRequestContent Start opens contexts, last-segment final_text semantics are in place, and finally-flush covers early exits. Previous review gaps have been addressed.
src/agentex/lib/core/harness/emitter.py UnifiedEmitter facade wiring trace context, delivery, and usage. turn.usage() is evaluated as a positional argument before auto_send drains turn.events, which violates the HarnessTurn protocol contract; deferred to migration PRs.
src/agentex/lib/core/harness/tracer.py Best-effort adapter over adk.tracing. Never raises, gracefully handles missing spans. Duplicate-open orphan behavior is documented and tracked (AGX1-376).
src/agentex/lib/core/harness/yield_delivery.py Pure passthrough async generator with side-effect tracing. finally-flush correctly handles early consumer cancellation.
src/agentex/lib/core/harness/types.py Clean type definitions: StreamTaskMessage union, OpenSpan/CloseSpan signals, TurnUsage, TurnResult, HarnessTurn protocol.
tests/lib/core/harness/conformance/runner.py Process-global fixture registry with a documented cross-module ordering hazard. Correct for the current single-module setup; future harness taps are advised to parametrize their own fixtures.
tests/lib/core/harness/conformance/test_conformance.py Conformance test only asserts derive_all idempotency, not correctness of emitted signals; true yield-vs-auto-send equivalence assertion is deferred to AGX1-373.
.github/workflows/harness-integration.yml Conformance CI job correctly defers to ./scripts/test for isolated uv environment. Narrow paths filter (missing agentex.types) is a known gap filed as AGX1-376. Placeholder live-matrix job kept as if: false until migration PRs land.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant HT as HarnessTurn
    participant UE as UnifiedEmitter
    participant YD as yield_events / auto_send
    participant SD as SpanDeriver
    participant ST as SpanTracer
    participant S as adk.streaming

    UE->>YD: pass turn.events + tracer
    loop each StreamTaskMessage
        HT-->>YD: event (Start / Delta / Done / Full)
        YD->>SD: observe(event)
        SD-->>YD: []  or  [OpenSpan / CloseSpan]
        opt signal emitted
            YD->>ST: handle(signal)
            ST->>S: start_span / end_span
        end
        alt yield mode
            YD-->>UE: yield event
        else auto_send mode
            YD->>S: streaming_task_message_context + stream_update + close
        end
    end
    Note over YD,ST: finally: deriver.flush() closes unclosed spans
    YD-->>UE: TurnResult(final_text, usage)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant HT as HarnessTurn
    participant UE as UnifiedEmitter
    participant YD as yield_events / auto_send
    participant SD as SpanDeriver
    participant ST as SpanTracer
    participant S as adk.streaming

    UE->>YD: pass turn.events + tracer
    loop each StreamTaskMessage
        HT-->>YD: event (Start / Delta / Done / Full)
        YD->>SD: observe(event)
        SD-->>YD: []  or  [OpenSpan / CloseSpan]
        opt signal emitted
            YD->>ST: handle(signal)
            ST->>S: start_span / end_span
        end
        alt yield mode
            YD-->>UE: yield event
        else auto_send mode
            YD->>S: streaming_task_message_context + stream_update + close
        end
    end
    Note over YD,ST: finally: deriver.flush() closes unclosed spans
    YD-->>UE: TurnResult(final_text, usage)
Loading

Comments Outside Diff (1)

  1. General comment

    P1 UnifiedEmitter ignores duck-typed tracer override and constructs real SpanTracer

    • Bug
      • Passing a custom tracer object with an async handle() method to UnifiedEmitter(..., tracer=custom_tracer) did not use that object. Instead, the emitter fell through to default construction of SpanTracer, which attempted to import the real ADK stack and failed in this environment with ModuleNotFoundError: No module named 'temporalio'. This contradicts the requested override behavior for tracing in the unified emitter surface.
    • Cause
      • src/agentex/lib/core/harness/emitter.py only accepts overrides when isinstance(tracer, SpanTracer) is true. A valid injected/duck-typed tracer object is ignored, causing fallback to SpanTracer(...) whenever trace_id is truthy.
    • Fix
      • Relax the override branch to accept any non-None, non-False tracer object that implements the expected handle(signal) contract, or define and use a runtime-checkable tracer Protocol instead of requiring isinstance(tracer, SpanTracer).

    T-Rex Ran code and verified through T-Rex

Reviews (7): Last reviewed commit: "feat(harness): thread created_at through..." | Re-trigger Greptile

Comment thread src/agentex/lib/core/harness/auto_send.py
Comment thread src/agentex/lib/core/harness/auto_send.py Outdated
declan-scale and others added 21 commits June 18, 2026 13:28
Approach A (Agentex event stream as canonical source of truth): one tap per
harness feeds shared yield/auto-send delivery adapters and a span-deriving
tracing tap. Additive backwards-compat, stacked PRs <1000 lines, conformance +
live-matrix testing (3 test agents per harness: sync/async/temporal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… golden-agent integration

- Make tracing-tap span derivation explicit (tool open on Done of a
  ToolRequestContent index, close on matching ToolResponseContent by
  tool_call_id; parallel-safe; reasoning start->done). Flag missing
  is_error on ToolResponseContent as an additive upstream decision.
- Add first-class TurnUsage/TurnResult shape (aligned to llm_metrics token
  taxonomy) attached to the turn span via span(data=) and reused for metrics.
- Document golden-agent integration: all SGP/sandbox/secret/MCP coupling
  stays in the agent; only parsing/streaming/tracing/usage move to SDK taps +
  emitter; sandbox-setup events chain before the harness stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1-3)

Bite-sized TDD tasks: foundation types, pure SpanDeriver, SpanTracer adapter,
yield + auto_send delivery, UnifiedEmitter facade, conformance scaffold + CI
job. Migration/parser PRs (4-9) listed as follow-on plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… signals

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… handling in SpanDeriver

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts for SpanTracer

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on early close

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reaming + tracing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… + cover error/finally paths in auto_send

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send_turn + doc tracer modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egistry semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or consistency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@declan-scale declan-scale force-pushed the declan-scale/unified-harness-surface branch from d21c54a to ebc468d Compare June 18, 2026 17:29
Comment thread src/agentex/lib/core/harness/span_derivation.py
Comment thread src/agentex/lib/core/harness/auto_send.py
declan-scale and others added 2 commits June 18, 2026 15:21
…reportImplicitOverride)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MissingImports)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/agentex/lib/core/harness/auto_send.py Outdated
Comment thread src/agentex/lib/core/harness/auto_send.py
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

declan-scale and others added 2 commits June 18, 2026 16:51
…ast-segment, created_at (AGX1-377, AGX1-378)

auto_send.py:
- Replace single current_ctx with ctx_map[index] so parallel streams route correctly
- Open a streaming context for ALL content types on Start (not just text/reasoning),
  fixing tool_request/tool_response stream delivery (AGX1-377)
- Reset final_text_parts on each new Start(TextContent) and on Full(TextContent)
  so multi-step turns return the LAST text segment, not the full accumulation
- Add created_at: datetime | None param; forward to every
  streaming_task_message_context call (AGX1-378)

span_derivation.py:
- _on_full: handle Full(ToolRequestContent) by opening a tool span keyed by
  tool_call_id if not already open; adds LangGraph full-event harness support

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting, last-segment, created_at, Full ToolRequest spans

test_auto_send.py:
- Fix test 2: remove bare Start(ToolRequestContent) from events (old behavior was
  that Start did not open a ctx; new behavior does, so test was updated to use
  Full-only events that still verify the two-context behavior)
- Extend _FakeStreaming to record created_at on each context call
- Add test 6: streamed tool_request opens a ctx + routes deltas (AGX1-377 core)
- Add test 7: interleaved indexes route deltas to correct per-index contexts
- Add test 8: multi-step turns return the LAST text segment only
- Add test 9: Full(TextContent) contributes its content to final_text
- Add test 10: created_at is forwarded to every streaming context call (AGX1-378)

test_span_derivation.py:
- Add test_full_tool_request_opens_span: Full(ToolRequestContent) opens a span
- Add test_full_tool_request_and_response_paired: paired Full request+response
  produces a complete OpenSpan+CloseSpan
- Add test_full_tool_request_does_not_double_open: idempotent; a Full for an
  already-open tool_call_id is a no-op

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/agentex/lib/core/harness/emitter.py Outdated
Comment on lines +59 to +67
async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
"""Async/temporal delivery: push to the task stream, return TurnResult."""
return await auto_send(
turn.events,
task_id=self.task_id,
tracer=self.tracer,
streaming=self._streaming,
usage=turn.usage(),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 turn.usage() called before events are exhausted

turn.usage() is evaluated as a Python argument before auto_send is awaited, meaning it runs before turn.events is ever iterated. The HarnessTurn protocol explicitly states that usage() is "valid only after events is exhausted." Any real implementation that accumulates token counts or cost while iterating events will always return stale/zero usage here. The test passes only because _Turn.usage() returns a pre-set value regardless of iteration state.

The fix is to call turn.usage() after auto_send has drained the stream.

Suggested change
async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
"""Async/temporal delivery: push to the task stream, return TurnResult."""
return await auto_send(
turn.events,
task_id=self.task_id,
tracer=self.tracer,
streaming=self._streaming,
usage=turn.usage(),
)
async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
"""Async/temporal delivery: push to the task stream, return TurnResult."""
result = await auto_send(
turn.events,
task_id=self.task_id,
tracer=self.tracer,
streaming=self._streaming,
)
usage = turn.usage() # valid now: events exhausted by auto_send
return TurnResult(final_text=result.final_text, usage=usage)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/core/harness/emitter.py
Line: 59-67

Comment:
**`turn.usage()` called before events are exhausted**

`turn.usage()` is evaluated as a Python argument before `auto_send` is awaited, meaning it runs before `turn.events` is ever iterated. The `HarnessTurn` protocol explicitly states that `usage()` is "valid only after `events` is exhausted." Any real implementation that accumulates token counts or cost while iterating events will always return stale/zero usage here. The test passes only because `_Turn.usage()` returns a pre-set value regardless of iteration state.

The fix is to call `turn.usage()` after `auto_send` has drained the stream.

```suggestion
    async def auto_send_turn(self, turn: HarnessTurn) -> TurnResult:
        """Async/temporal delivery: push to the task stream, return TurnResult."""
        result = await auto_send(
            turn.events,
            task_id=self.task_id,
            tracer=self.tracer,
            streaming=self._streaming,
        )
        usage = turn.usage()  # valid now: events exhausted by auto_send
        return TurnResult(final_text=result.final_text, usage=usage)
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

…n (AGX1-378)

So migration helpers can restore the deterministic first-message timestamp on
the temporal path. Default None preserves current behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@declan-scale

Copy link
Copy Markdown
Contributor Author

@greptile review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant