feat(core): Add eval + trim builder prompt (no-changelog) by riqwan · Pull Request #32484 · n8n-io/n8n

riqwan · 2026-06-17T12:49:53Z

Summary

Adds a failing case from prod traces, create an eval out of it and trim builder prompt by moving tool descriptions from builder skill to actual descriptions in tool

Related Linear tickets, Github issues, and Community forum posts

RESOLVES INS-574

Review / Merge checklist

I have seen this code, I have run this code, and I take responsibility for this code.
PR title and summary are descriptive. (conventions)
Docs updated or follow-up ticket created.
Tests included.
PR Labeled with Backport to Beta, Backport to Stable, or Backport to v1 (if the PR is an urgent fix that needs to be backported)

n8n-assistant · 2026-06-17T12:50:53Z

PR review overview

Based on ownership of the 24 changed files in this PR:

Ownership	Files owned	Share	Source code	Test files	Misc
@n8n-io/instance-ai	22	92%	+125 / -38	+7 / -1	+92 / -72
@n8n-io/ai	1	4%	+4 / -0	+0 / -0	+0 / -0
@n8n-io/nodes	1	4%	+4 / -0	+0 / -0	+0 / -0
Total	24	100%	+133 / -38	+7 / -1	+92 / -72

cubic-dev-ai

2 issues found across 44 files

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

codecov · 2026-06-17T13:18:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-17T13:21:29Z

Instance AI Discovery Eval ✅

Branch: ins-574-eval-1 · Commit: ccca1009d7afafbc561507f4d41d09b9a183852d

Eval output


> @n8n/instance-ai@1.12.0 eval:discovery /home/runner/_work/n8n/n8n/packages/@n8n/instance-ai
> tsx evaluations/discovery/cli.ts --trials 3 --fail-on-zero-pass

Running 9 discovery scenario(s) × 3 trial(s) (model: anthropic/claude-sonnet-4-6, concurrency: 3).

▸ data-table-natural-list-skill-loading ... ✓ 3/3 passed (100%)
▸ data-table-skill-loading ... ✓ 3/3 passed (100%)
▸ data-table-workflow-skill-loading ... ✓ 3/3 passed (100%)
▸ google-oauth-credential-setup ... ✓ 3/3 passed (100%)
▸ http-node-config-no-browser ... ✓ 3/3 passed (100%)
▸ oauth-with-computer-use-disabled ... ✓ 3/3 passed (100%)
▸ screenshot-dashboard ... ✓ 3/3 passed (100%)
▸ slack-oauth-credential-setup ... ✓ 3/3 passed (100%)
▸ workflow-builder-no-credential-ask ... ✓ 2/3 passed (67%)

=== Summary ===
Scenarios: 9/9 above threshold (67%)
Trials: 26/27 passed (96%)
Total time: 673.5s

github-actions · 2026-06-17T13:23:11Z

Instance AI Workflow Eval

Important

This eval does not re-run on new commits — ▶ Re-run this eval (then Re-run jobs) when you're ready to merge.

Caution

🔴 1 regression · 0 likely regressions · 1 worth watching
1 improvement · 9 stable · pass rate +5.3pp vs baseline

Aggregate: 86.1% PR vs 80.8% baseline — +5.3pp ↑
12 scenarios · N=3 (PR) vs N=10 (baseline) · baseline: instance-ai-baseline-cba67a7c
Partial: 28 baseline scenarios not run by PR.

Regressions (1) — high-confidence

Scenario	PR	Baseline	Δ	p
`rest-api-data-pipeline/empty-response`	0/3 (0%)	10/10 (100%)	-100pp ↓	0.003

rest-api-data-pipeline/empty-response — 3 of 3 failed · 3× builder_issue

Run 1 [builder_issue]: The Fetch Posts node returned an empty array (0 items), which caused Filter and Build Message and Post to Slack to not run at all. In n8n, when 0 items flow into a node, that node does not execute. The Code node's logic correctly handles an empty posts array (it would produce a message '0 posts rema

Run 2 [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 output items. With 0 items flowing into Filter & Build Message, that node never executed (n8n's default behavior: 0 items = node doesn't run). As a result, Post to Slack also never ran. No Slack message was posted stating '0 posts rem

Run 3 [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 items. With 0 items flowing into 'Drop Titles With "qui"', that Filter node did not run. With 0 items flowing into 'Build Digest', that Code node did not run. With 0 items flowing into 'Post to Slack', that Slack node did not run. No

Worth watching (1) — large change, not flagged as a regression

Scenario	PR	Baseline	Δ
`workflow-data-table/happy-path`	3/3 (100%)	4/10 (40%)	+60pp ↑

Improvements (1)

Scenario	PR	Baseline	Δ	p
`weather-alert/happy-path`	2/3 (67%)	0/10 (0%)	+67pp ↑	0.038

p = Fisher's exact one-sided p-value. Lower = stronger evidence of a real change.

Failure breakdown

Category	PR	Baseline	Δ
`builder_issue`	5 (13.9%)	111 (27.8%)	-13.9pp ↓	notable
`mock_issue`	0 (0.0%)	10 (2.5%)	-2.5pp ↓
`framework_issue`	0 (0.0%)	1 (0.3%)	-0.3pp ↓

Per-test-case results (6)

Workflow	Built	pass@3	pass^3
`airtable-split-to-slack`	3/3	100%	100%
`notification-router`	3/3	100%	100%
`rest-api-data-pipeline`	3/3	67%	43%
`telegram-chatbot-memory-session`	3/3	100%	100%
`weather-alert`	3/3	100%	65%
`workflow-data-table`	3/3	100%	100%

Workflow checks

Scored over 18 successful build(s). N/A = check did not apply to that workflow.

Dimension	Check	Kind	Pass	Fail	Pass rate
`parameter_correctness`	`correct_node_operations`	llm	17	1	94%
`intent_match`	`fulfills_user_request`	llm	17	1	94%
`nodes_craftsmanship`	`response_matches_workflow_changes`	llm	17	1	94%

All workflow checks (3 failing of 32 checks)

Dimension	Check	Kind	Pass	Fail	N/A	Pass rate
`structure`	`has_nodes`	deterministic	18	0	0	100%
`structure`	`has_start_node`	deterministic	18	0	0	100%
`structure`	`has_trigger`	deterministic	18	0	0	100%
`structure`	`no_disabled_nodes`	deterministic	18	0	0	100%
`connection_topology`	`all_nodes_connected`	deterministic	18	0	0	100%
`connection_topology`	`error_routes_consistent`	deterministic	18	0	0	100%
`connection_topology`	`handles_multiple_items`	llm	18	0	0	100%
`connection_topology`	`no_unreachable_nodes`	deterministic	18	0	0	100%
`connection_topology`	`switch_fallback_output_enabled`	deterministic	3	0	15	100%
`parameter_correctness`	`correct_node_operations`	llm	17	1	0	94%
`parameter_correctness`	`expressions_reference_existing_nodes`	deterministic	3	0	15	100%
`parameter_correctness`	`http_generic_auth_type_matches_prompt`	deterministic	5	0	13	100%
`parameter_correctness`	`item_flow_independent_source_execute_once`	deterministic	0	0	18	—
`parameter_correctness`	`item_flow_paired_item_references`	deterministic	0	0	18	—
`parameter_correctness`	`no_empty_set_nodes`	deterministic	0	0	18	—
`parameter_correctness`	`no_invalid_from_ai`	deterministic	0	0	18	—
`parameter_correctness`	`valid_data_flow`	llm	18	0	0	100%
`parameter_correctness`	`valid_field_references`	deterministic	18	0	0	100%
`parameter_correctness`	`valid_node_config`	deterministic	18	0	0	100%
`intent_match`	`fulfills_user_request`	llm	17	1	0	94%
`ai_nodes`	`agent_has_dynamic_prompt`	deterministic	3	0	15	100%
`ai_nodes`	`agent_has_language_model`	deterministic	3	0	15	100%
`ai_nodes`	`memory_properly_connected`	deterministic	3	0	15	100%
`ai_nodes`	`memory_session_key_expression`	deterministic	3	0	15	100%
`ai_nodes`	`tools_have_parameters`	deterministic	0	0	18	—
`ai_nodes`	`vector_store_has_embeddings`	deterministic	0	0	18	—
`nodes_craftsmanship`	`code_node_no_http_requests`	deterministic	8	0	10	100%
`nodes_craftsmanship`	`descriptive_node_names`	llm	18	0	0	100%
`nodes_craftsmanship`	`no_unnecessary_code_nodes`	llm	18	0	0	100%
`nodes_craftsmanship`	`response_matches_workflow_changes`	llm	17	1	0	94%
`security`	`inbound_trigger_auth_defaults`	deterministic	3	0	15	100%
`security`	`no_hardcoded_credentials`	deterministic	10	0	8	100%

Other findings: 9 stable

Stable (9):
airtable-split-to-slack/empty-records, airtable-split-to-slack/happy-path, notification-router/high-priority, notification-router/low-priority, notification-router/medium-priority, rest-api-data-pipeline/all-filtered, rest-api-data-pipeline/happy-path, telegram-chatbot-memory-session/distinct-telegram-chat, weather-alert/rain-not-expected.

Failure details

rest-api-data-pipeline/empty-response — 3 failed

Run [builder_issue]: The Fetch Posts node returned an empty array (0 items), which caused Filter and Build Message and Post to Slack to not run at all. In n8n, when 0 items flow into a node, that node does not execute. Th
Run [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 output items. With 0 items flowing into Filter & Build Message, that node never executed (n8n's default behavior: 0 items = node doesn
Run [builder_issue]: The Fetch Posts node returned an empty array [], which produced 0 items. With 0 items flowing into 'Drop Titles With "qui"', that Filter node did not run. With 0 items flowing into 'Build Digest', tha

rest-api-data-pipeline/all-filtered — 1 failed

Run [builder_issue]: All 3 posts were correctly filtered out by the 'Drop Titles With "qui"' node (all titles contain 'qui'), producing 0 items on branch 0. However, when 0 items flow into Build Digest, n8n does not execu

weather-alert/happy-path — 1 failed

Run [builder_issue]: The workflow fails for two independent reasons. First, the 'Detect Rain Today' code node returned rainExpected: false despite the mock data containing a rain entry (id=500, rain.3h=0.5 at 2024-06-12 1

cubic-dev-ai

1 issue found across 24 files

Architecture diagram

sequenceDiagram
    participant Agent as AI Agent
    participant KB as Knowledge Base
    participant Tools as Tool Registry
    participant Nodes as Nodes Tool
    participant Build as Build Workflow Tool
    participant Forms as Form Node
    participant OpenAI as OpenAI Node

    Note over Agent,OpenAI: Workflow Builder – Eval + Prompt Trimming Flow

    Agent->>KB: Request skill instructions + reference docs
    KB-->>Agent: Updated SKILL.md (trimmed tool surface, repair strategy, placeholders, working memory)
    KB-->>Agent: NEW: open-ai-output-shape.md reference

    Agent->>Agent: Compose system prompt from skill + KB + shared prompts
    Note over Agent: Shared prompt trimmed: PLACEHOLDERS_RULE removed. Working memory section removed.

    Agent->>Tools: Discover available tools with updated descriptions
    Tools->>Nodes: NEW: enriched action descriptions (suggested, search, type-definition, explore-resources)
    Tools->>Build: NEW: "Primary workflow-builder tool" prefix, patch-mode details
    Tools-->>Agent: Tool set with builder-specific surface hints

    Agent->>Nodes: Get node type definitions
    Nodes-->>Agent: OpenAI v2 text/response: builderHint marks 'message' as invalid
    Nodes-->>Agent: Form common: builderHint lists valid field types, notes no 'time' type
    Agent->>KB: Reference open-ai-output-shape.md for downstream field access
    KB-->>Agent: $json.output[0].content[0].text (v2+ response), not $json.text

    alt Build workflow with Google Sheets + OpenAI
        Agent->>Build: Submit SDK code (form trigger → OpenAI → Google Sheets → Form Ending)
        Build-->>Agent: Workflow created
        Agent->>Agent: NEW: Judge against eval case — form-booking.json
        Note over Agent: Eval checks: correct OpenAI field mapping, Google Sheets 7 columns, Form Ending exact text
    else Verify existing workflow
        Agent->>Build: Patch mode with workflowId + old_str/new_str
        Build-->>Agent: Patched workflow
    end

    opt User has Google Sheets with unknown document ID
        Agent->>Agent: Use placeholder() in documentId __rl.value with cachedResultName
        Note over Agent: Never empty string — follow trimmed placeholder guidance
    end

    Agent->>Agent: Report verification verdict (only in checkpoint follow-up turns)
    Note over Agent: Tool descriptions enforce: complete-checkpoint, report-verification-verdict, verify-built-workflow all scoped to specific turns

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

[MOVE AROUND] LLM visualization

01bcd3e

riqwan marked this pull request as ready for review June 17, 2026 12:50

n8n-assistant Bot added the cla-signed label Jun 17, 2026

riqwan marked this pull request as draft June 17, 2026 12:55

n8n-assistant Bot added the n8n team Authored by the n8n team label Jun 17, 2026

cubic-dev-ai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread packages/@n8n/instance-ai/evaluations/langsmith/dataset-sync.ts Outdated

Comment thread packages/@n8n/instance-ai/src/tools/workflows/upstream-field-references.ts Outdated

riqwan changed the title ~~feat(core): add failed eval + trim builder prompt~~ feat(core): Add eval + trim builder prompt (no-changelog) Jun 17, 2026

chore: add eval, trim builder skill

0d66024

riqwan force-pushed the ins-574-eval-1 branch from 54ba78e to 0d66024 Compare June 17, 2026 13:22

riqwan added 2 commits June 17, 2026 15:41

chore: cleanup

5f92f89

chore: cleanup

04af6d1

riqwan marked this pull request as ready for review June 17, 2026 13:53

cubic-dev-ai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread packages/@n8n/instance-ai/src/tools/nodes/suggested-nodes-data.ts

riqwan and others added 3 commits June 17, 2026 16:23

chore: fix skill

3ea45a6

chore: fix skill

69d1419

Merge branch 'master' into ins-574-eval-1

123f71e

riqwan requested a review from aalises June 17, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): Add eval + trim builder prompt (no-changelog)#32484

feat(core): Add eval + trim builder prompt (no-changelog)#32484
riqwan wants to merge 7 commits into
masterfrom
ins-574-eval-1

riqwan commented Jun 17, 2026 •

edited by stage-review Bot

Loading

Uh oh!

n8n-assistant Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

riqwan commented Jun 17, 2026 • edited by stage-review Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

Uh oh!

n8n-assistant Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR review overview

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instance AI Discovery Eval ✅

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instance AI Workflow Eval

Regressions (1) — high-confidence

Worth watching (1) — large change, not flagged as a regression

Improvements (1)

Failure breakdown

Workflow checks

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

riqwan commented Jun 17, 2026 •

edited by stage-review Bot

Loading

n8n-assistant Bot commented Jun 17, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading