Skip to content

<feature>[zbs]: vhost self-heal + host env prep + review fixes#4342

Open
ZStack-Robot wants to merge 2 commits into
feature-zbs-vhostfrom
sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2
Open

<feature>[zbs]: vhost self-heal + host env prep + review fixes#4342
ZStack-Robot wants to merge 2 commits into
feature-zbs-vhostfrom
sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2

Conversation

@ZStack-Robot

Copy link
Copy Markdown
Collaborator

跟进已合并的 zbsadm vhost MR,本分支 2 个新 commit。

commit A — review 修复 (903cb56)

  • 删死代码 DESTROY_VHOST path/cmd、VHOST_CONTROLLER_NAME_PREFIX
  • VhostHostCmd.sshPassword 加 @nologging 屏蔽
  • vhost activate 复用 ZbsHelper.getPoolFromVolumePath
  • 聚合为 Connected 时把 host status reason 置 null
  • applyHostStatus -> aggregateAndApplyHostStatus
  • 保留 targetHealthy 变量名;cannot found -> cannot find

commit B — self-heal + 主机环境准备 (a70d340)

  • 周期 ping 检到 vhost target down 时 best-effort 重部署,失败窗口翻倍 backoff(仿 kvm ping-skip)
  • PREPARE_VHOST_TARGET_ENV flow:deploy 前调计算节点 kvmagent 确保 hugepages + docker(zbsadm 只检查不安装)
  • ZbsVhostVolumeCase 覆盖 backoff + prepare 步骤

配套

zstack-utility 同名分支 @@2 新 MR(kvmagent prepare_vhost_target_env + ensure_docker + 2G hugepage)。

测试

  • 全量编译 / kvmagent 单测 16 / groovy test-compile
  • IT ZbsVhostVolumeCase 待补跑(CI 8989 串行占用)

已知限制(非本 MR、ZBS 侧)

真机 vhost VM 启动还依赖 ZBS 侧 vhost 镜像与 zbsadm 版本对齐 + 计算节点 registry/DNS onboarding;大页/docker 本 MR 已自动处理。

Resolves: ZSTAC-85515

sync from gitlab !10303

Address round-1 review on the zbsadm vhost MR:
- drop dead DESTROY_VHOST path/cmd and VHOST_CONTROLLER_NAME_PREFIX
- mask host SSH password (@nologging) on VhostHostCmd
- reuse ZbsHelper.getPoolFromVolumePath in vhost activate
- null the host status reason when the aggregate is Connected
- rename applyHostStatus -> aggregateAndApplyHostStatus
- keep targetHealthy var name; fix "cannot found" -> "cannot find"

Resolves: ZSTAC-85515

Change-Id: Ifd059b63893fdaf343b7222dd45859a7b4b0bd00
@MatheMatrix MatheMatrix force-pushed the sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2 branch from a70d340 to fc08d45 Compare June 26, 2026 09:46
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

新增了外部主存储协议恢复回退逻辑,ZBS vhost 增加目标环境预处理与健康上报调整,并同步修改主机状态聚合的原因处理和相关集成测试。

Changes

主存储协议恢复与 vhost 流程

Layer / File(s) Summary
外部主存储协议恢复
plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java
checkHostStatus 成功路径在协议状态更新后触发 recoverDisconnectedProtocols(...),并新增按失败次数与跳过窗口控制的恢复状态。
vhost 部署与健康上报
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
新增 PREPARE_VHOST_TARGET_ENV_PATHPrepareVhostTargetEnvCmd 和 best-effort 预处理 flow,调整激活/停用路径解析、健康上报逻辑,并给 sshPassword 加上 @NoLogging
主机状态原因处理
storage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorage.java
主机状态更新改为 aggregateAndApplyHostStatus,聚合为 Connected 时清空 reason
vhost 恢复测试
test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy
补充 prepare flow 模拟、down 期间 redeploy 记录、回退断言和 testProtocolRecoveryBackoff()

Sequence Diagram(s)

sequenceDiagram
  participant ZbsStorageController as ZbsStorageController
  participant KVMHostAsyncHttpCallMsg as KVMHostAsyncHttpCallMsg
  participant KVMHost as KVM 主机
  ZbsStorageController->>KVMHostAsyncHttpCallMsg: 发送 PREPARE_VHOST_TARGET_ENV_PATH 请求
  KVMHostAsyncHttpCallMsg->>KVMHost: PrepareVhostTargetEnvCmd
  KVMHost-->>KVMHostAsyncHttpCallMsg: AgentResponse
  KVMHostAsyncHttpCallMsg-->>ZbsStorageController: 告警后继续 deploy-vhost-target
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 我蹦进云里看一眼,
协议断线也能再连。
vhost 先暖好小窝,
失败回退慢慢转圈。
啃一口胡萝卜,继续向前。

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题采用了 feature[zbs] 形式,且准确概括了 vhost 自愈、主机环境准备和 review 修复。
Description check ✅ Passed 描述与变更一致,涵盖了 vhost 自愈、环境预处理、回归修复和测试情况。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2

Comment @coderabbitai help to get the list of available commands.

@MatheMatrix MatheMatrix force-pushed the sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2 branch from fc08d45 to fbf00c5 Compare June 26, 2026 10:30
- recovery: the external primary storage health check (kvm ping
  cadence) re-runs deployClient for a host's disconnected protocols,
  on a per-(ps,host,protocol) failure-doubling backoff, so a down
  vhost target redeploys without a host reconnect. cadence and
  backoff are owned by the framework; the controller adds no new api
- new PREPARE_VHOST_TARGET_ENV flow runs the host kvmagent to
  ensure hugepages + docker before the zbsadm deploy
- ZbsVhostVolumeCase covers the recovery backoff and the prepare step

Resolves: ZSTAC-85515

Change-Id: Id11467eb56c9ef31601b6ea5577cc1a29e3bed2f
@MatheMatrix MatheMatrix force-pushed the sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2 branch from fbf00c5 to 82c4882 Compare June 26, 2026 11:56

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java`:
- Around line 331-339: Update dueForRecovery so all ProtocolRecovery state
changes for a given key happen inside protocolRecoveryByKey.compute, instead of
mutating the shared ProtocolRecovery returned by computeIfAbsent. In
ExternalPrimaryStorageKvmFactory.dueForRecovery, atomically read/update skip and
failures together to avoid lost updates or duplicate recovery triggers under
concurrent pings. Also cap the failures value before calculating 1 << failures,
and keep the resulting skip bounded by PROTOCOL_RECOVERY_MAX_SKIP to prevent
overflow and preserve backoff behavior.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`:
- Around line 561-566: The redeploy tracking in the Zbs vhost test is being
cleared after the storage target is marked down, which can erase a concurrent
ping-triggered redeploy recorded by the DEPLOY_VHOST_PATH simulator. Update the
test logic around vhostTargetHealthy and redeployedWhileDown so the reset
happens before switching to the down state, and only set redeployedWhileDown
inside the deploy simulator when the target is actually down; use the existing
DEPLOY_VHOST_PATH handler and the vhostTargetHealthy/redeployedWhileDown flags
to keep the window race-free.
- Around line 779-788: The recovery check in ZbsVhostVolumeCase only verifies
that Vhost returns to Connected, but it does not prove the backoff entry was
cleared. In the existing recovery block around the vhostDown/retryInSecs
assertion, add a follow-up down cycle that resets the same host state after
reconnection, then assert redeploy happens again within a short window rather
than reusing an old large backoff; use the existing vhostDown, retryInSecs, and
ExternalPrimaryStorageHostProtocolRefVO/PrimaryStorageHostStatus checks to
locate and extend this test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 21a023ca-68a2-467d-bd7a-8f230703aaef

📥 Commits

Reviewing files that changed from the base of the PR and between 6613666 and 82c4882.

📒 Files selected for processing (5)
  • plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java
  • plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsConstants.java
  • plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
  • storage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorage.java
  • test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy
💤 Files with no reviewable changes (1)
  • plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsConstants.java

Comment on lines +331 to +339
private boolean dueForRecovery(String key) {
ProtocolRecovery state = protocolRecoveryByKey.computeIfAbsent(key, k -> new ProtocolRecovery());
if (state.skip > 0) {
state.skip--;
return false;
}
state.failures++;
state.skip = Math.min(1 << state.failures, PROTOCOL_RECOVERY_MAX_SKIP);
return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 预期:输出 dueForRecovery 中的 computeIfAbsent、skip/failures 直接修改,以及 `1 << state.failures`。
fd -i 'ExternalPrimaryStorageKvmFactory.java' -x sed -n '274,341p' {}

Repository: MatheMatrix/zstack

Length of output: 3438


🏁 Script executed:

#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'

# Show surrounding class context and call sites of dueForRecovery / protocolRecoveryByKey
sed -n '1,220p' "$file" | nl -ba | sed -n '1,220p'
printf '\n--- CALL SITES ---\n'
rg -n 'dueForRecovery\(|protocolRecoveryByKey|recoverDisconnectedProtocols\(' "$file"

Repository: MatheMatrix/zstack

Length of output: 781


🏁 Script executed:

#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'

# Show surrounding class context and call sites of dueForRecovery / protocolRecoveryByKey
sed -n '1,220p' "$file" | nl -ba | sed -n '1,220p'
printf '\n--- CALL SITES ---\n'
rg -n 'dueForRecovery\(|protocolRecoveryByKey|recoverDisconnectedProtocols\(' "$file"

Repository: MatheMatrix/zstack

Length of output: 781


🏁 Script executed:

#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'

# Inspect the caller context around the only call site and the recovery state code
sed -n '210,340p' "$file"

Repository: MatheMatrix/zstack

Length of output: 6839


🏁 Script executed:

#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'

# Inspect the caller context around the only call site and the recovery state code
sed -n '210,340p' "$file"

Repository: MatheMatrix/zstack

Length of output: 6839


🏁 Script executed:

#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'

# Inspect the caller context around the only call site and the recovery state code
sed -n '210,340p' "$file"

Repository: MatheMatrix/zstack

Length of output: 6839


dueForRecovery 的恢复状态更新收敛到 ConcurrentHashMap.compute 里。
ProtocolRecovery 现在是共享可变对象,skip/failures 的读改写不原子;同一 key 并发 ping 时会丢更新或重复触发恢复。1 << state.failures 也要对失败次数封顶,避免长期失败后移位溢出让退避失效。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java`
around lines 331 - 339, Update dueForRecovery so all ProtocolRecovery state
changes for a given key happen inside protocolRecoveryByKey.compute, instead of
mutating the shared ProtocolRecovery returned by computeIfAbsent. In
ExternalPrimaryStorageKvmFactory.dueForRecovery, atomically read/update skip and
failures together to avoid lost updates or duplicate recovery triggers under
concurrent pings. Also cap the failures value before calculating 1 << failures,
and keep the resulting skip bounded by PROTOCOL_RECOVERY_MAX_SKIP to prevent
overflow and preserve backoff behavior.

Comment on lines 561 to +566
env.simulator(ZbsStorageController.DEPLOY_VHOST_PATH) { HttpEntity<String> e, EnvSpec spec ->
def cmd = JSONObjectUtil.toObject(e.body, ZbsStorageController.DeployVhostCmd.class)
assert cmd.hostIp != null : "deploy cmd missing target host IP for zbsadm SSH"
ensureCalled.set(true)
redeployedWhileDown.set(true)
return new ZbsStorageController.AgentResponse()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

先清零再切换到 down,避免丢掉并发 ping 触发的 redeploy。

当前 Line 644 先把 vhostTargetHealthy 置为 false,Line 645 再清 redeployedWhileDown;如果 ping 线程刚好在这两行之间触发 DEPLOY_VHOST_PATH,Line 565 写入的 true 会被清掉,后续断言可能误报失败。建议先 reset,再进入 down 窗口,并且只在目标确实 down 时记录。

建议调整
 env.simulator(ZbsStorageController.DEPLOY_VHOST_PATH) { HttpEntity<String> e, EnvSpec spec ->
     def cmd = JSONObjectUtil.toObject(e.body, ZbsStorageController.DeployVhostCmd.class)
     assert cmd.hostIp != null : "deploy cmd missing target host IP for zbsadm SSH"
     ensureCalled.set(true)
-    redeployedWhileDown.set(true)
+    if (!vhostTargetHealthy.get()) {
+        redeployedWhileDown.set(true)
+    }
     return new ZbsStorageController.AgentResponse()
 }
-rehostTargetHealthy.set(false)
-redeployedWhileDown.set(false)
+redeployedWhileDown.set(false)
+vhostTargetHealthy.set(false)

Also applies to: 644-645

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`
around lines 561 - 566, The redeploy tracking in the Zbs vhost test is being
cleared after the storage target is marked down, which can erase a concurrent
ping-triggered redeploy recorded by the DEPLOY_VHOST_PATH simulator. Update the
test logic around vhostTargetHealthy and redeployedWhileDown so the reset
happens before switching to the down state, and only set redeployedWhileDown
inside the deploy simulator when the target is actually down; use the existing
DEPLOY_VHOST_PATH handler and the vhostTargetHealthy/redeployedWhileDown flags
to keep the window race-free.

Comment on lines +779 to +788
// target recovers: health flips Vhost back to Connected and the backoff entry clears
vhostDown.set(false)
retryInSecs(30) {
assert Q.New(ExternalPrimaryStorageHostProtocolRefVO.class)
.eq(ExternalPrimaryStorageHostProtocolRefVO_.primaryStorageUuid, ps.uuid)
.eq(ExternalPrimaryStorageHostProtocolRefVO_.hostUuid, host.uuid)
.eq(ExternalPrimaryStorageHostProtocolRefVO_.protocol, VolumeProtocol.Vhost.toString())
.eq(ExternalPrimaryStorageHostProtocolRefVO_.status, PrimaryStorageHostStatus.Connected)
.isExists()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

补一段重新 down 的断言来证明 backoff 状态已清理。

这里的注释说恢复后会清理 backoff entry,但当前只验证协议回到 Connected;如果实现遗漏清理,测试仍会通过,下一次 down 可能继续沿用旧的大窗口。建议在恢复为 Connected 后清空 redeploySeconds、再次置 down,并断言短时间内能重新触发 redeploy。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`
around lines 779 - 788, The recovery check in ZbsVhostVolumeCase only verifies
that Vhost returns to Connected, but it does not prove the backoff entry was
cleared. In the existing recovery block around the vhostDown/retryInSecs
assertion, add a follow-up down cycle that resets the same host state after
reconnection, then assert redeploy happens again within a short window rather
than reusing an old large backoff; use the existing vhostDown, retryInSecs, and
ExternalPrimaryStorageHostProtocolRefVO/PrimaryStorageHostStatus checks to
locate and extend this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants