Skip to content

Agent skill extraction can dead-letter when SkillClusterUpdated fires before agent_case is visible in LanceDB #275

@rendigua2025-gif

Description

@rendigua2025-gif

Summary

extract_agent_skill can enter dead_letter when SkillClusterUpdated is emitted before the newly written agent_case is visible through LanceDB.

In an isolated Linux/Docker run with a synthetic agent trajectory, EverOS successfully generated and indexed an agent_case, but the downstream extract_agent_skill strategy retried too early and exhausted its retries with _CaseNotYetIndexedError.

Why this matters

Agent memory is one of EverOS's differentiating capabilities. In this case:

  • extract_agent_case succeeded.
  • trigger_skill_clustering succeeded.
  • agent_case Markdown existed.
  • agent_case was later visible in LanceDB.
  • /search could retrieve the agent_case.
  • But extract_agent_skill still ended in dead_letter because it checked LanceDB before the case was visible.

That means the case-to-skill chain can fail even though the case itself is valid and eventually searchable.

Reproduction shape

Synthetic agent trajectory:

  1. User reports a focused test failure.
  2. Assistant/agent runs a focused test.
  3. Tool returns a trace.
  4. Assistant identifies root cause.
  5. Assistant patches the minimal code path.
  6. Tool reports focused test pass.
  7. Assistant runs the broader test module.
  8. Assistant summarizes the reusable debugging lesson.

The sample was sufficient to generate an agent_case with quality score 0.98.

Observed evidence

agent_case result:

agent_case: 1
agent_skill: 0
agent_case search: keyword/vector/hybrid all retrieved the case

OME records:

extract_agent_case: success
trigger_skill_clustering: success
extract_agent_skill: dead_letter

Failure message:

_CaseNotYetIndexedError: AgentCase entry_id=ac_20260609_00000001 not in LanceDB yet; retrying

Relevant code locations:

src/everos/memory/strategies/trigger_skill_clustering.py
src/everos/memory/strategies/extract_agent_skill.py

trigger_skill_clustering emits SkillClusterUpdated, and extract_agent_skill then calls _load_target_case, which raises _CaseNotYetIndexedError if the case is not yet visible in LanceDB.

Expected behavior

One of the following would make the chain more reliable:

  • Ensure SkillClusterUpdated is emitted only after agent_case is indexed and visible.
  • Let extract_agent_skill load the target case from Markdown or the event payload when LanceDB is not yet caught up.
  • Increase/rework retry scheduling so eventual LanceDB visibility is actually covered.
  • Treat _CaseNotYetIndexedError as a delayed dependency rather than a normal retry that can quickly exhaust into dead_letter.

Environment

EverOS: 1.0.0 source checkout
Runtime: Docker Linux runtime
Python: 3.12
Data: synthetic agent trajectory only

No real user memory or secrets were used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions