LCORE-2080: Added E2E Steps for Agent Skills#1941
Conversation
|
Warning Review limit reached
More reviews will be available in 30 minutes and 16 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (14)
WalkthroughAdds e2e skill fixtures, compose mounts, and Lightspeed stack configs for server and library modes. Updates streaming response helpers to capture tool calls and results, and expands the skills feature scenarios to use the new load/read skill flows. ChangesSkills e2e wiring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c201e27 to
fe7754f
Compare
|
|
||
| @SkillsConfig | ||
| @SkillsConfig @skip | ||
| Scenario: Skill tools are registered when skills are configured |
There was a problem hiding this comment.
TODO: Need to reflect skill tools (list_skills, load_skill, read_skill_resource) in /tools.
| """ | ||
| And The token metrics have increased | ||
|
|
||
| # --- Error handling: unknown skill --- |
There was a problem hiding this comment.
The "Error Paths" will have to be skipped for now as the skill tools do fail and produce a result, but it's a different type that the response-building code silently discards.
Below I have helpful part of conversation with Claude about the issue.
Pydantic-ai catches ModelRetry and wraps the error in a RetryPromptPart (not a ToolReturnPart). The FunctionToolResultEvent.part is typed as ToolReturnPart | RetryPromptPart — it can be either.
Where LCS drops it:
In the non-streaming path, build_turn_summary_from_agent_run only processes ToolReturnPart:
query.py
Lines 266-269
elif isinstance(message, ModelRequest):
for request_part in message.parts:
if isinstance(request_part, ToolReturnPart):
process_function_tool_result(state, request_part)
In the streaming path, the same filter exists:
streaming.py
Lines 522-524
part = event.part
if not isinstance(part, ToolReturnPart):
return None
Both paths explicitly ignore RetryPromptPart, so the retry/error message for load_skill is never surfaced as a tool_result in the API response.
The result:
- Both tool calls appear (because both ToolCallPart instances from the ModelResponse are processed)
- Only the list_skills result appears (because it succeeded and produced a ToolReturnPart)
- The load_skill result is missing (because it raised ModelRetry → became a RetryPromptPart → silently dropped)
| ] | ||
| """ | ||
|
|
||
| # --- Full progressive disclosure flow --- |
There was a problem hiding this comment.
This will likely be quite flaky as the LLM (through appendage of system prompt, I think) is given only the "names" of skills so sometimes will result in just load_skill and read_skill_resource being used completely skipping list_skills.
bd2b990 to
159e8ae
Compare
|
Please Review: |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/configuration/server-mode/lightspeed-stack-skills-directory.yaml`:
- Around line 24-26: The skills discovery config is using a relative path in the
`skills.paths` entry, which can break startup when the working directory
changes. Update the YAML to use the absolute mounted path expected by the stack,
and keep the change localized to the `skills` block in
`lightspeed-stack-skills-directory.yaml` so startup consistently finds the
skills directory.
In `@tests/e2e/configuration/server-mode/lightspeed-stack-skills.yaml`:
- Around line 24-26: The skills path in the stack config is CWD-sensitive and
should be pinned to the mounted absolute location instead. Update the
`skills.paths` entry in the YAML so it points to `/app-root/skills/echo` rather
than the relative `skills/echo`, keeping the `skills` configuration
deterministic under the compose mount.
In `@tests/e2e/features/steps/common_http.py`:
- Around line 334-335: The expected JSON in the step implementation still parses
context.text directly, so placeholder tokens like {MODEL} are not substituted
before validation. Update the relevant step in common_http.py to apply the same
placeholder resolution used by the existing partial-body handling before calling
json.loads and validate_json_partially. Keep the fix localized to the step that
consumes context.text and ensure the parsed expected_value reflects substituted
placeholders first.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2beba6a7-1aff-4350-92f7-60524e66a1c4
📒 Files selected for processing (14)
docker-compose-library.yamldocker-compose.yamltests/e2e/configuration/library-mode/lightspeed-stack-skills-directory.yamltests/e2e/configuration/library-mode/lightspeed-stack-skills.yamltests/e2e/configuration/server-mode/lightspeed-stack-skills-directory.yamltests/e2e/configuration/server-mode/lightspeed-stack-skills.yamltests/e2e/features/skills.featuretests/e2e/features/steps/common_http.pytests/e2e/features/steps/llm_query_response.pytests/e2e/skills/echo/SKILL.mdtests/e2e/skills/echo/references/guide.mdtests/e2e/skills/summarize/SKILL.mdtests/e2e/skills/summarize/references/guide.mdtests/e2e/test_list.txt
📜 Review details
⏰ Context from checks skipped due to timeout. (2)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/e2e/features/steps/common_http.pytests/e2e/features/steps/llm_query_response.py
tests/e2e/**/*.{py,feature}
📄 CodeRabbit inference engine (AGENTS.md)
Use behave (BDD) framework for end-to-end testing with Gherkin feature files
Files:
tests/e2e/features/steps/common_http.pytests/e2e/features/steps/llm_query_response.pytests/e2e/features/skills.feature
🧠 Learnings (4)
📚 Learning: 2026-05-20T08:09:30.641Z
Learnt from: max-svistunov
Repo: lightspeed-core/lightspeed-stack PR: 1580
File: docs/design/llama-stack-config-merge/poc-results/library-mode/synthesized-run.yaml:107-110
Timestamp: 2026-05-20T08:09:30.641Z
Learning: In Llama-stack config YAMLs, when defining a Llama Guard safety shield entry, set `provider_shield_id` to the *guard model identifier* (e.g., `meta-llama/Llama-Guard-3-8B`). Do not use a chat/generative model id (e.g., `openai/gpt-4o-mini`): a chat-model id (or `native_override`) indicates only an override landed and does **not** mean the safety shield is actually gating queries. Ensure any E2E coverage for the related implementation (JIRA/E2E tests) exercises a real Llama Guard model to verify that the shield is effective.
Applied to files:
docker-compose-library.yamltests/e2e/configuration/server-mode/lightspeed-stack-skills-directory.yamltests/e2e/configuration/server-mode/lightspeed-stack-skills.yamltests/e2e/configuration/library-mode/lightspeed-stack-skills.yamldocker-compose.yamltests/e2e/configuration/library-mode/lightspeed-stack-skills-directory.yaml
📚 Learning: 2026-04-07T09:20:26.590Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1467
File: tests/e2e/features/steps/common.py:36-49
Timestamp: 2026-04-07T09:20:26.590Z
Learning: For Behave-based Python tests, rely on Behave’s Context layered stack for attribute lifecycle: Behave pushes a new Context layer when entering feature scope (before_feature) and again for scenario scope (before_scenario). Attributes assigned inside given/when/then steps live on the current scenario layer and are automatically removed when the scenario ends. As a result, step-set attributes should not be expected to persist across scenarios or features, and manual cleanup in after_scenario/after_feature is generally unnecessary for attributes set in step functions. Only perform manual cleanup for attributes that you set explicitly in before_feature/before_scenario, since those live on the respective feature/scenario layers.
Applied to files:
tests/e2e/features/steps/common_http.pytests/e2e/features/steps/llm_query_response.py
📚 Learning: 2026-04-13T13:39:54.963Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1490
File: tests/e2e/features/environment.py:206-211
Timestamp: 2026-04-13T13:39:54.963Z
Learning: In lightspeed-stack E2E tests under tests/e2e/features, it is intentional to set context.feature_config inside Background/step functions (scenario-scoped Behave layer). The environment.py after_scenario restore logic should only restore configuration when context.scenario_lightspeed_override_active is True; this flag is set by configure_service only when a real config switch occurs (so restore does not run for scenarios without a switch). Additionally, steps/common.py’s module-level _active_lightspeed_stack_config_basename is used to prevent re-applying the same config across subsequent scenarios, ensuring scenario_lightspeed_override_active stays False after the first apply. Therefore, reviewers should not “fix” this flow as if feature_config were incorrectly scoped or if after_scenario restoration is missing—config switching and restoration are meant to happen exactly once per actual switch, not redundantly per scenario.
Applied to files:
tests/e2e/features/steps/common_http.pytests/e2e/features/steps/llm_query_response.py
📚 Learning: 2026-06-24T13:45:37.249Z
Learnt from: Jdubrick
Repo: lightspeed-core/lightspeed-stack PR: 1971
File: src/utils/markdown_repair.py:31-36
Timestamp: 2026-06-24T13:45:37.249Z
Learning: In the lightspeed-stack repository, docstrings must use the section header name "Parameters:" (not "Args:") for function arguments, even if the project references Google Python docstring conventions. Ensure docstrings follow the project’s established "Parameters:" header format for any documented function parameters.
Applied to files:
tests/e2e/features/steps/common_http.pytests/e2e/features/steps/llm_query_response.py
🪛 LanguageTool
tests/e2e/skills/echo/SKILL.md
[style] ~17-~17: Using “back” with the verb “return” may be redundant.
Context: ...r's input text 2. Return the exact text back to the user without modification For f...
(RETURN_BACK)
🔇 Additional comments (8)
docker-compose-library.yaml (1)
23-23: LGTM!docker-compose.yaml (1)
90-90: LGTM!tests/e2e/skills/echo/SKILL.md (1)
1-19: LGTM!tests/e2e/skills/echo/references/guide.md (1)
1-20: LGTM!tests/e2e/skills/summarize/SKILL.md (1)
1-22: LGTM!tests/e2e/skills/summarize/references/guide.md (1)
1-21: LGTM!tests/e2e/configuration/library-mode/lightspeed-stack-skills-directory.yaml (1)
1-26: LGTM!tests/e2e/configuration/library-mode/lightspeed-stack-skills.yaml (1)
1-26: LGTM!
| skills: | ||
| paths: | ||
| - skills |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win
Use absolute skills path to avoid CWD-dependent startup failures.
skills is relative; if the service working directory changes, skills discovery can fail at startup. Use /app-root/skills to match the compose mount explicitly.
Proposed change
skills:
paths:
- - skills
+ - /app-root/skills📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| skills: | |
| paths: | |
| - skills | |
| skills: | |
| paths: | |
| - /app-root/skills |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/configuration/server-mode/lightspeed-stack-skills-directory.yaml`
around lines 24 - 26, The skills discovery config is using a relative path in
the `skills.paths` entry, which can break startup when the working directory
changes. Update the YAML to use the absolute mounted path expected by the stack,
and keep the change localized to the `skills` block in
`lightspeed-stack-skills-directory.yaml` so startup consistently finds the
skills directory.
| skills: | ||
| paths: | ||
| - skills/echo |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win
Pin the skill path to the mounted absolute location.
skills/echo is CWD-sensitive. Prefer /app-root/skills/echo for deterministic resolution against the compose mount.
Proposed change
skills:
paths:
- - skills/echo
+ - /app-root/skills/echo📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| skills: | |
| paths: | |
| - skills/echo | |
| skills: | |
| paths: | |
| - /app-root/skills/echo |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/configuration/server-mode/lightspeed-stack-skills.yaml` around
lines 24 - 26, The skills path in the stack config is CWD-sensitive and should
be pinned to the mounted absolute location instead. Update the `skills.paths`
entry in the YAML so it points to `/app-root/skills/echo` rather than the
relative `skills/echo`, keeping the `skills` configuration deterministic under
the compose mount.
| expected_value = json.loads(context.text) | ||
| validate_json_partially(actual_value, expected_value) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Apply placeholder substitution before parsing expected JSON.
At Line 334, this step parses context.text directly, so placeholders like {MODEL} won’t be resolved here (unlike the existing partial-body step). That can cause false failures in scenario assertions.
Proposed fix
- expected_value = json.loads(context.text)
+ json_str = replace_placeholders(context, context.text)
+ expected_value = json.loads(json_str)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expected_value = json.loads(context.text) | |
| validate_json_partially(actual_value, expected_value) | |
| json_str = replace_placeholders(context, context.text) | |
| expected_value = json.loads(json_str) | |
| validate_json_partially(actual_value, expected_value) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/features/steps/common_http.py` around lines 334 - 335, The expected
JSON in the step implementation still parses context.text directly, so
placeholder tokens like {MODEL} are not substituted before validation. Update
the relevant step in common_http.py to apply the same placeholder resolution
used by the existing partial-body handling before calling json.loads and
validate_json_partially. Keep the fix localized to the step that consumes
context.text and ensure the parsed expected_value reflects substituted
placeholders first.
anik120
left a comment
There was a problem hiding this comment.
ps: squashing commits to have a single commit for a PR (unless having multiple commits is by design, in which case too, the question would be "why aren't they multiple PRs instead"), is the hygienic thing to do.
Otherwise they show up as
"fix"
"fix"
"address code rabbit"
when someone is searching through git history trying to figure out what changes were made.
Here's an article I highly recommend reading https://medium.com/@madhav2002/git-hygiene-commits-branching-and-rewriting-history-bc6dee5f953f
refined E2E tests for skills and added necessary step implementations. close: LCORE-2080
radofuchs
left a comment
There was a problem hiding this comment.
LGTM in overall, just a few details
|
|
||
| actual_value = response_body[field] | ||
|
|
||
| if not context.text: |
|
|
||
|
|
||
| @then("The response is the last streamed fragment") | ||
| def response_is_last_streamed_fragment(context: Context) -> None: |
There was a problem hiding this comment.
this logic is already in "wait for response to be complted step". If you need the use_streaming_response_data, then set it there
|
Just a conceptual question: Is the skill invocation really so strict that when you prompt to run a non-existing skill, the LLM really tries to execute it and ends up with failure? |
|
@asimurka when u prompt the LLM to use a skill, if u are direct enough, it will try to use the INPUT OUTPUT Does that answer your question? |
|
Is it possible that this is just model-specific behavior? Because I think you shouldn't be able to influence the model behavior like this (with bare prompt). |
Description
Added the missing E2E steps for testing agent skills.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Tests