feat(core): Add eval + trim builder prompt (no-changelog)#32484
Conversation
PR review overviewBased on ownership of the 24 changed files in this PR:
|
There was a problem hiding this comment.
2 issues found across 44 files
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Instance AI Discovery Eval ✅Branch: Eval output |
Instance AI Workflow EvalImportant This eval does not re-run on new commits — ▶ Re-run this eval (then Re-run jobs) when you're ready to merge. Caution 🔴 1 regression · 0 likely regressions · 1 worth watching Aggregate: 86.1% PR vs 80.8% baseline — +5.3pp ↑ Regressions (1) — high-confidence
|
| Scenario | PR | Baseline | Δ |
|---|---|---|---|
workflow-data-table/happy-path |
3/3 (100%) | 4/10 (40%) | +60pp ↑ |
Improvements (1)
| Scenario | PR | Baseline | Δ | p |
|---|---|---|---|---|
weather-alert/happy-path |
2/3 (67%) | 0/10 (0%) | +67pp ↑ | 0.038 |
p = Fisher's exact one-sided p-value. Lower = stronger evidence of a real change.
Failure breakdown
| Category | PR | Baseline | Δ | |
|---|---|---|---|---|
builder_issue |
5 (13.9%) | 111 (27.8%) | -13.9pp ↓ | notable |
mock_issue |
0 (0.0%) | 10 (2.5%) | -2.5pp ↓ | |
framework_issue |
0 (0.0%) | 1 (0.3%) | -0.3pp ↓ |
Per-test-case results (6)
| Workflow | Built | pass@3 | pass^3 |
|---|---|---|---|
airtable-split-to-slack |
3/3 | 100% | 100% |
notification-router |
3/3 | 100% | 100% |
rest-api-data-pipeline |
3/3 | 67% | 43% |
telegram-chatbot-memory-session |
3/3 | 100% | 100% |
weather-alert |
3/3 | 100% | 65% |
workflow-data-table |
3/3 | 100% | 100% |
Workflow checks
Scored over 18 successful build(s). N/A = check did not apply to that workflow.
| Dimension | Check | Kind | Pass | Fail | N/A | Pass rate |
|---|---|---|---|---|---|---|
parameter_correctness |
correct_node_operations |
llm | 17 | 1 | 0 | 94% |
intent_match |
fulfills_user_request |
llm | 17 | 1 | 0 | 94% |
nodes_craftsmanship |
response_matches_workflow_changes |
llm | 17 | 1 | 0 | 94% |
All workflow checks (3 failing of 32 checks)
| Dimension | Check | Kind | Pass | Fail | N/A | Pass rate |
|---|---|---|---|---|---|---|
structure |
has_nodes |
deterministic | 18 | 0 | 0 | 100% |
structure |
has_start_node |
deterministic | 18 | 0 | 0 | 100% |
structure |
has_trigger |
deterministic | 18 | 0 | 0 | 100% |
structure |
no_disabled_nodes |
deterministic | 18 | 0 | 0 | 100% |
connection_topology |
all_nodes_connected |
deterministic | 18 | 0 | 0 | 100% |
connection_topology |
error_routes_consistent |
deterministic | 18 | 0 | 0 | 100% |
connection_topology |
handles_multiple_items |
llm | 18 | 0 | 0 | 100% |
connection_topology |
no_unreachable_nodes |
deterministic | 18 | 0 | 0 | 100% |
connection_topology |
switch_fallback_output_enabled |
deterministic | 3 | 0 | 15 | 100% |
parameter_correctness |
correct_node_operations |
llm | 17 | 1 | 0 | 94% |
parameter_correctness |
expressions_reference_existing_nodes |
deterministic | 3 | 0 | 15 | 100% |
parameter_correctness |
http_generic_auth_type_matches_prompt |
deterministic | 5 | 0 | 13 | 100% |
parameter_correctness |
item_flow_independent_source_execute_once |
deterministic | 0 | 0 | 18 | — |
parameter_correctness |
item_flow_paired_item_references |
deterministic | 0 | 0 | 18 | — |
parameter_correctness |
no_empty_set_nodes |
deterministic | 0 | 0 | 18 | — |
parameter_correctness |
no_invalid_from_ai |
deterministic | 0 | 0 | 18 | — |
parameter_correctness |
valid_data_flow |
llm | 18 | 0 | 0 | 100% |
parameter_correctness |
valid_field_references |
deterministic | 18 | 0 | 0 | 100% |
parameter_correctness |
valid_node_config |
deterministic | 18 | 0 | 0 | 100% |
intent_match |
fulfills_user_request |
llm | 17 | 1 | 0 | 94% |
ai_nodes |
agent_has_dynamic_prompt |
deterministic | 3 | 0 | 15 | 100% |
ai_nodes |
agent_has_language_model |
deterministic | 3 | 0 | 15 | 100% |
ai_nodes |
memory_properly_connected |
deterministic | 3 | 0 | 15 | 100% |
ai_nodes |
memory_session_key_expression |
deterministic | 3 | 0 | 15 | 100% |
ai_nodes |
tools_have_parameters |
deterministic | 0 | 0 | 18 | — |
ai_nodes |
vector_store_has_embeddings |
deterministic | 0 | 0 | 18 | — |
nodes_craftsmanship |
code_node_no_http_requests |
deterministic | 8 | 0 | 10 | 100% |
nodes_craftsmanship |
descriptive_node_names |
llm | 18 | 0 | 0 | 100% |
nodes_craftsmanship |
no_unnecessary_code_nodes |
llm | 18 | 0 | 0 | 100% |
nodes_craftsmanship |
response_matches_workflow_changes |
llm | 17 | 1 | 0 | 94% |
security |
inbound_trigger_auth_defaults |
deterministic | 3 | 0 | 15 | 100% |
security |
no_hardcoded_credentials |
deterministic | 10 | 0 | 8 | 100% |
Other findings: 9 stable
Stable (9):
airtable-split-to-slack/empty-records, airtable-split-to-slack/happy-path, notification-router/high-priority, notification-router/low-priority, notification-router/medium-priority, rest-api-data-pipeline/all-filtered, rest-api-data-pipeline/happy-path, telegram-chatbot-memory-session/distinct-telegram-chat, weather-alert/rain-not-expected.
Failure details
rest-api-data-pipeline/empty-response — 3 failed
Run [builder_issue]: The Fetch Posts node returned an empty array (0 items), which caused Filter and Build Message and Post to Slack to not run at all. In n8n, when 0 items flow into a node, that node does not execute. Th
Run [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 output items. With 0 items flowing into Filter & Build Message, that node never executed (n8n's default behavior: 0 items = node doesn
Run [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 items. With 0 items flowing into 'Drop Titles With "qui"', that Filter node did not run. With 0 items flowing into 'Build Digest', tha
rest-api-data-pipeline/all-filtered — 1 failed
Run [builder_issue]: All 3 posts were correctly filtered out by the 'Drop Titles With "qui"' node (all titles contain 'qui'), producing 0 items on branch 0. However, when 0 items flow into Build Digest, n8n does not execu
weather-alert/happy-path — 1 failed
Run [builder_issue]: The workflow fails for two independent reasons. First, the 'Detect Rain Today' code node returned rainExpected: false despite the mock data containing a rain entry (id=500, rain.3h=0.5 at 2024-06-12 1
There was a problem hiding this comment.
1 issue found across 24 files
Architecture diagram
sequenceDiagram
participant Agent as AI Agent
participant KB as Knowledge Base
participant Tools as Tool Registry
participant Nodes as Nodes Tool
participant Build as Build Workflow Tool
participant Forms as Form Node
participant OpenAI as OpenAI Node
Note over Agent,OpenAI: Workflow Builder – Eval + Prompt Trimming Flow
Agent->>KB: Request skill instructions + reference docs
KB-->>Agent: Updated SKILL.md (trimmed tool surface, repair strategy, placeholders, working memory)
KB-->>Agent: NEW: open-ai-output-shape.md reference
Agent->>Agent: Compose system prompt from skill + KB + shared prompts
Note over Agent: Shared prompt trimmed: PLACEHOLDERS_RULE removed. Working memory section removed.
Agent->>Tools: Discover available tools with updated descriptions
Tools->>Nodes: NEW: enriched action descriptions (suggested, search, type-definition, explore-resources)
Tools->>Build: NEW: "Primary workflow-builder tool" prefix, patch-mode details
Tools-->>Agent: Tool set with builder-specific surface hints
Agent->>Nodes: Get node type definitions
Nodes-->>Agent: OpenAI v2 text/response: builderHint marks 'message' as invalid
Nodes-->>Agent: Form common: builderHint lists valid field types, notes no 'time' type
Agent->>KB: Reference open-ai-output-shape.md for downstream field access
KB-->>Agent: $json.output[0].content[0].text (v2+ response), not $json.text
alt Build workflow with Google Sheets + OpenAI
Agent->>Build: Submit SDK code (form trigger → OpenAI → Google Sheets → Form Ending)
Build-->>Agent: Workflow created
Agent->>Agent: NEW: Judge against eval case — form-booking.json
Note over Agent: Eval checks: correct OpenAI field mapping, Google Sheets 7 columns, Form Ending exact text
else Verify existing workflow
Agent->>Build: Patch mode with workflowId + old_str/new_str
Build-->>Agent: Patched workflow
end
opt User has Google Sheets with unknown document ID
Agent->>Agent: Use placeholder() in documentId __rl.value with cachedResultName
Note over Agent: Never empty string — follow trimmed placeholder guidance
end
Agent->>Agent: Report verification verdict (only in checkpoint follow-up turns)
Note over Agent: Tool descriptions enforce: complete-checkpoint, report-verification-verdict, verify-built-workflow all scoped to specific turns
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
Summary
Adds a failing case from prod traces, create an eval out of it and trim builder prompt by moving tool descriptions from builder skill to actual descriptions in tool
Related Linear tickets, Github issues, and Community forum posts
RESOLVES INS-574
Review / Merge checklist
Backport to Beta,Backport to Stable, orBackport to v1(if the PR is an urgent fix that needs to be backported)