Skip to content

feat(core): Add eval + trim builder prompt (no-changelog)#32484

Open
riqwan wants to merge 7 commits into
masterfrom
ins-574-eval-1
Open

feat(core): Add eval + trim builder prompt (no-changelog)#32484
riqwan wants to merge 7 commits into
masterfrom
ins-574-eval-1

Conversation

@riqwan

@riqwan riqwan commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a failing case from prod traces, create an eval out of it and trim builder prompt by moving tool descriptions from builder skill to actual descriptions in tool

Related Linear tickets, Github issues, and Community forum posts

RESOLVES INS-574

Review / Merge checklist

  • I have seen this code, I have run this code, and I take responsibility for this code.
  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with Backport to Beta, Backport to Stable, or Backport to v1 (if the PR is an urgent fix that needs to be backported)

Open in Stage

@riqwan riqwan marked this pull request as ready for review June 17, 2026 12:50
@n8n-assistant

n8n-assistant Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

PR review overview

Based on ownership of the 24 changed files in this PR:

Ownership Files owned Share Source code Test files Misc
@n8n-io/instance-ai 22 92% +125 / -38 +7 / -1 +92 / -72
@n8n-io/ai 1 4% +4 / -0 +0 / -0 +0 / -0
@n8n-io/nodes 1 4% +4 / -0 +0 / -0 +0 / -0
Total 24 100% +133 / -38 +7 / -1 +92 / -72

@riqwan riqwan marked this pull request as draft June 17, 2026 12:55
@n8n-assistant n8n-assistant Bot added the n8n team Authored by the n8n team label Jun 17, 2026

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 44 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/@n8n/instance-ai/evaluations/langsmith/dataset-sync.ts Outdated
Comment thread packages/@n8n/instance-ai/src/tools/workflows/upstream-field-references.ts Outdated
@riqwan riqwan changed the title feat(core): add failed eval + trim builder prompt feat(core): Add eval + trim builder prompt (no-changelog) Jun 17, 2026
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Instance AI Discovery Eval ✅

Branch: ins-574-eval-1 · Commit: ccca1009d7afafbc561507f4d41d09b9a183852d

Eval output

> @n8n/instance-ai@1.12.0 eval:discovery /home/runner/_work/n8n/n8n/packages/@n8n/instance-ai
> tsx evaluations/discovery/cli.ts --trials 3 --fail-on-zero-pass

Running 9 discovery scenario(s) × 3 trial(s) (model: anthropic/claude-sonnet-4-6, concurrency: 3).

▸ data-table-natural-list-skill-loading ... ✓ 3/3 passed (100%)
▸ data-table-skill-loading ... ✓ 3/3 passed (100%)
▸ data-table-workflow-skill-loading ... ✓ 3/3 passed (100%)
▸ google-oauth-credential-setup ... ✓ 3/3 passed (100%)
▸ http-node-config-no-browser ... ✓ 3/3 passed (100%)
▸ oauth-with-computer-use-disabled ... ✓ 3/3 passed (100%)
▸ screenshot-dashboard ... ✓ 3/3 passed (100%)
▸ slack-oauth-credential-setup ... ✓ 3/3 passed (100%)
▸ workflow-builder-no-credential-ask ... ✓ 2/3 passed (67%)

=== Summary ===
Scenarios: 9/9 above threshold (67%)
Trials: 26/27 passed (96%)
Total time: 673.5s

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Instance AI Workflow Eval

Important

This eval does not re-run on new commits▶ Re-run this eval (then Re-run jobs) when you're ready to merge.

Caution

🔴 1 regression · 0 likely regressions · 1 worth watching
1 improvement · 9 stable · pass rate +5.3pp vs baseline

Aggregate: 86.1% PR vs 80.8% baseline — +5.3pp ↑
12 scenarios · N=3 (PR) vs N=10 (baseline) · baseline: instance-ai-baseline-cba67a7c
Partial: 28 baseline scenarios not run by PR.

Regressions (1) — high-confidence

Scenario PR Baseline Δ p
rest-api-data-pipeline/empty-response 0/3 (0%) 10/10 (100%) -100pp ↓ 0.003
rest-api-data-pipeline/empty-response — 3 of 3 failed · 3× builder_issue

Run 1 [builder_issue]: The Fetch Posts node returned an empty array (0 items), which caused Filter and Build Message and Post to Slack to not run at all. In n8n, when 0 items flow into a node, that node does not execute. The Code node's logic correctly handles an empty posts array (it would produce a message '0 posts rema

Run 2 [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 output items. With 0 items flowing into Filter & Build Message, that node never executed (n8n's default behavior: 0 items = node doesn't run). As a result, Post to Slack also never ran. No Slack message was posted stating '0 posts rem

Run 3 [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 items. With 0 items flowing into 'Drop Titles With "qui"', that Filter node did not run. With 0 items flowing into 'Build Digest', that Code node did not run. With 0 items flowing into 'Post to Slack', that Slack node did not run. No

Worth watching (1) — large change, not flagged as a regression

Scenario PR Baseline Δ
workflow-data-table/happy-path 3/3 (100%) 4/10 (40%) +60pp ↑

Improvements (1)

Scenario PR Baseline Δ p
weather-alert/happy-path 2/3 (67%) 0/10 (0%) +67pp ↑ 0.038

p = Fisher's exact one-sided p-value. Lower = stronger evidence of a real change.

Failure breakdown

Category PR Baseline Δ
builder_issue 5 (13.9%) 111 (27.8%) -13.9pp ↓ notable
mock_issue 0 (0.0%) 10 (2.5%) -2.5pp ↓
framework_issue 0 (0.0%) 1 (0.3%) -0.3pp ↓
Per-test-case results (6)
Workflow Built pass@3 pass^3
airtable-split-to-slack 3/3 100% 100%
notification-router 3/3 100% 100%
rest-api-data-pipeline 3/3 67% 43%
telegram-chatbot-memory-session 3/3 100% 100%
weather-alert 3/3 100% 65%
workflow-data-table 3/3 100% 100%

Workflow checks

Scored over 18 successful build(s). N/A = check did not apply to that workflow.

Dimension Check Kind Pass Fail N/A Pass rate
parameter_correctness correct_node_operations llm 17 1 0 94%
intent_match fulfills_user_request llm 17 1 0 94%
nodes_craftsmanship response_matches_workflow_changes llm 17 1 0 94%
All workflow checks (3 failing of 32 checks)
Dimension Check Kind Pass Fail N/A Pass rate
structure has_nodes deterministic 18 0 0 100%
structure has_start_node deterministic 18 0 0 100%
structure has_trigger deterministic 18 0 0 100%
structure no_disabled_nodes deterministic 18 0 0 100%
connection_topology all_nodes_connected deterministic 18 0 0 100%
connection_topology error_routes_consistent deterministic 18 0 0 100%
connection_topology handles_multiple_items llm 18 0 0 100%
connection_topology no_unreachable_nodes deterministic 18 0 0 100%
connection_topology switch_fallback_output_enabled deterministic 3 0 15 100%
parameter_correctness correct_node_operations llm 17 1 0 94%
parameter_correctness expressions_reference_existing_nodes deterministic 3 0 15 100%
parameter_correctness http_generic_auth_type_matches_prompt deterministic 5 0 13 100%
parameter_correctness item_flow_independent_source_execute_once deterministic 0 0 18
parameter_correctness item_flow_paired_item_references deterministic 0 0 18
parameter_correctness no_empty_set_nodes deterministic 0 0 18
parameter_correctness no_invalid_from_ai deterministic 0 0 18
parameter_correctness valid_data_flow llm 18 0 0 100%
parameter_correctness valid_field_references deterministic 18 0 0 100%
parameter_correctness valid_node_config deterministic 18 0 0 100%
intent_match fulfills_user_request llm 17 1 0 94%
ai_nodes agent_has_dynamic_prompt deterministic 3 0 15 100%
ai_nodes agent_has_language_model deterministic 3 0 15 100%
ai_nodes memory_properly_connected deterministic 3 0 15 100%
ai_nodes memory_session_key_expression deterministic 3 0 15 100%
ai_nodes tools_have_parameters deterministic 0 0 18
ai_nodes vector_store_has_embeddings deterministic 0 0 18
nodes_craftsmanship code_node_no_http_requests deterministic 8 0 10 100%
nodes_craftsmanship descriptive_node_names llm 18 0 0 100%
nodes_craftsmanship no_unnecessary_code_nodes llm 18 0 0 100%
nodes_craftsmanship response_matches_workflow_changes llm 17 1 0 94%
security inbound_trigger_auth_defaults deterministic 3 0 15 100%
security no_hardcoded_credentials deterministic 10 0 8 100%
Other findings: 9 stable

Stable (9):
airtable-split-to-slack/empty-records, airtable-split-to-slack/happy-path, notification-router/high-priority, notification-router/low-priority, notification-router/medium-priority, rest-api-data-pipeline/all-filtered, rest-api-data-pipeline/happy-path, telegram-chatbot-memory-session/distinct-telegram-chat, weather-alert/rain-not-expected.

Failure details

rest-api-data-pipeline/empty-response — 3 failed

Run [builder_issue]: The Fetch Posts node returned an empty array (0 items), which caused Filter and Build Message and Post to Slack to not run at all. In n8n, when 0 items flow into a node, that node does not execute. Th
Run [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 output items. With 0 items flowing into Filter & Build Message, that node never executed (n8n's default behavior: 0 items = node doesn
Run [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 items. With 0 items flowing into 'Drop Titles With "qui"', that Filter node did not run. With 0 items flowing into 'Build Digest', tha

rest-api-data-pipeline/all-filtered — 1 failed

Run [builder_issue]: All 3 posts were correctly filtered out by the 'Drop Titles With "qui"' node (all titles contain 'qui'), producing 0 items on branch 0. However, when 0 items flow into Build Digest, n8n does not execu

weather-alert/happy-path — 1 failed

Run [builder_issue]: The workflow fails for two independent reasons. First, the 'Detect Rain Today' code node returned rainExpected: false despite the mock data containing a rain entry (id=500, rain.3h=0.5 at 2024-06-12 1

@riqwan riqwan marked this pull request as ready for review June 17, 2026 13:53

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 24 files

Architecture diagram
sequenceDiagram
    participant Agent as AI Agent
    participant KB as Knowledge Base
    participant Tools as Tool Registry
    participant Nodes as Nodes Tool
    participant Build as Build Workflow Tool
    participant Forms as Form Node
    participant OpenAI as OpenAI Node

    Note over Agent,OpenAI: Workflow Builder – Eval + Prompt Trimming Flow

    Agent->>KB: Request skill instructions + reference docs
    KB-->>Agent: Updated SKILL.md (trimmed tool surface, repair strategy, placeholders, working memory)
    KB-->>Agent: NEW: open-ai-output-shape.md reference

    Agent->>Agent: Compose system prompt from skill + KB + shared prompts
    Note over Agent: Shared prompt trimmed: PLACEHOLDERS_RULE removed. Working memory section removed.

    Agent->>Tools: Discover available tools with updated descriptions
    Tools->>Nodes: NEW: enriched action descriptions (suggested, search, type-definition, explore-resources)
    Tools->>Build: NEW: "Primary workflow-builder tool" prefix, patch-mode details
    Tools-->>Agent: Tool set with builder-specific surface hints

    Agent->>Nodes: Get node type definitions
    Nodes-->>Agent: OpenAI v2 text/response: builderHint marks 'message' as invalid
    Nodes-->>Agent: Form common: builderHint lists valid field types, notes no 'time' type
    Agent->>KB: Reference open-ai-output-shape.md for downstream field access
    KB-->>Agent: $json.output[0].content[0].text (v2+ response), not $json.text

    alt Build workflow with Google Sheets + OpenAI
        Agent->>Build: Submit SDK code (form trigger → OpenAI → Google Sheets → Form Ending)
        Build-->>Agent: Workflow created
        Agent->>Agent: NEW: Judge against eval case — form-booking.json
        Note over Agent: Eval checks: correct OpenAI field mapping, Google Sheets 7 columns, Form Ending exact text
    else Verify existing workflow
        Agent->>Build: Patch mode with workflowId + old_str/new_str
        Build-->>Agent: Patched workflow
    end

    opt User has Google Sheets with unknown document ID
        Agent->>Agent: Use placeholder() in documentId __rl.value with cachedResultName
        Note over Agent: Never empty string — follow trimmed placeholder guidance
    end

    Agent->>Agent: Report verification verdict (only in checkpoint follow-up turns)
    Note over Agent: Tool descriptions enforce: complete-checkpoint, report-verification-verdict, verify-built-workflow all scoped to specific turns
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/@n8n/instance-ai/src/tools/nodes/suggested-nodes-data.ts
@riqwan riqwan requested a review from aalises June 17, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed n8n team Authored by the n8n team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant