<feature>[zbs]: vhost self-heal + host env prep + review fixes by ZStack-Robot · Pull Request #4342 · MatheMatrix/zstack

ZStack-Robot · 2026-06-26T09:22:30Z

跟进已合并的 zbsadm vhost MR，本分支 2 个新 commit。

commit A — review 修复 (`903cb56`)

删死代码 DESTROY_VHOST path/cmd、VHOST_CONTROLLER_NAME_PREFIX
VhostHostCmd.sshPassword 加 @nologging 屏蔽
vhost activate 复用 ZbsHelper.getPoolFromVolumePath
聚合为 Connected 时把 host status reason 置 null
applyHostStatus -> aggregateAndApplyHostStatus
保留 targetHealthy 变量名；cannot found -> cannot find

commit B — self-heal + 主机环境准备 (`a70d340`)

周期 ping 检到 vhost target down 时 best-effort 重部署，失败窗口翻倍 backoff（仿 kvm ping-skip）
新 PREPARE_VHOST_TARGET_ENV flow：deploy 前调计算节点 kvmagent 确保 hugepages + docker（zbsadm 只检查不安装）
ZbsVhostVolumeCase 覆盖 backoff + prepare 步骤

配套

zstack-utility 同名分支 @@2 新 MR（kvmagent prepare_vhost_target_env + ensure_docker + 2G hugepage）。

测试

全量编译 / kvmagent 单测 16 / groovy test-compile
IT ZbsVhostVolumeCase 待补跑（CI 8989 串行占用）

已知限制（非本 MR、ZBS 侧）

真机 vhost VM 启动还依赖 ZBS 侧 vhost 镜像与 zbsadm 版本对齐 + 计算节点 registry/DNS onboarding；大页/docker 本 MR 已自动处理。

Resolves: ZSTAC-85515

sync from gitlab !10303

@nologging

Address round-1 review on the zbsadm vhost MR: - drop dead DESTROY_VHOST path/cmd and VHOST_CONTROLLER_NAME_PREFIX - mask host SSH password (@nologging) on VhostHostCmd - reuse ZbsHelper.getPoolFromVolumePath in vhost activate - null the host status reason when the aggregate is Connected - rename applyHostStatus -> aggregateAndApplyHostStatus - keep targetHealthy var name; fix "cannot found" -> "cannot find" Resolves: ZSTAC-85515 Change-Id: Ifd059b63893fdaf343b7222dd45859a7b4b0bd00

coderabbitai · 2026-06-26T09:46:20Z

Walkthrough

新增了外部主存储协议恢复回退逻辑，ZBS vhost 增加目标环境预处理与健康上报调整，并同步修改主机状态聚合的原因处理和相关集成测试。

Changes

主存储协议恢复与 vhost 流程

Layer / File(s)	Summary
外部主存储协议恢复 `plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java`	`checkHostStatus` 成功路径在协议状态更新后触发 `recoverDisconnectedProtocols(...)`，并新增按失败次数与跳过窗口控制的恢复状态。
vhost 部署与健康上报 `plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java`	新增 `PREPARE_VHOST_TARGET_ENV_PATH`、`PrepareVhostTargetEnvCmd` 和 best-effort 预处理 flow，调整激活/停用路径解析、健康上报逻辑，并给 `sshPassword` 加上 `@NoLogging`。
主机状态原因处理 `storage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorage.java`	主机状态更新改为 `aggregateAndApplyHostStatus`，聚合为 `Connected` 时清空 `reason`。
vhost 恢复测试 `test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`	补充 prepare flow 模拟、down 期间 redeploy 记录、回退断言和 `testProtocolRecoveryBackoff()`。

Sequence Diagram(s)

sequenceDiagram
  participant ZbsStorageController as ZbsStorageController
  participant KVMHostAsyncHttpCallMsg as KVMHostAsyncHttpCallMsg
  participant KVMHost as KVM 主机
  ZbsStorageController->>KVMHostAsyncHttpCallMsg: 发送 PREPARE_VHOST_TARGET_ENV_PATH 请求
  KVMHostAsyncHttpCallMsg->>KVMHost: PrepareVhostTargetEnvCmd
  KVMHost-->>KVMHostAsyncHttpCallMsg: AgentResponse
  KVMHostAsyncHttpCallMsg-->>ZbsStorageController: 告警后继续 deploy-vhost-target

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 我蹦进云里看一眼，
协议断线也能再连。
vhost 先暖好小窝，
失败回退慢慢转圈。
啃一口胡萝卜，继续向前。

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	标题采用了 feature[zbs] 形式，且准确概括了 vhost 自愈、主机环境准备和 review 修复。
Description check	✅ Passed	描述与变更一致，涵盖了 vhost 自愈、环境预处理、回归修复和测试情况。
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2

_{Comment @coderabbitai help to get the list of available commands.}

- recovery: the external primary storage health check (kvm ping cadence) re-runs deployClient for a host's disconnected protocols, on a per-(ps,host,protocol) failure-doubling backoff, so a down vhost target redeploys without a host reconnect. cadence and backoff are owned by the framework; the controller adds no new api - new PREPARE_VHOST_TARGET_ENV flow runs the host kvmagent to ensure hugepages + docker before the zbsadm deploy - ZbsVhostVolumeCase covers the recovery backoff and the prepare step Resolves: ZSTAC-85515 Change-Id: Id11467eb56c9ef31601b6ea5577cc1a29e3bed2f

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java`:
- Around line 331-339: Update dueForRecovery so all ProtocolRecovery state
changes for a given key happen inside protocolRecoveryByKey.compute, instead of
mutating the shared ProtocolRecovery returned by computeIfAbsent. In
ExternalPrimaryStorageKvmFactory.dueForRecovery, atomically read/update skip and
failures together to avoid lost updates or duplicate recovery triggers under
concurrent pings. Also cap the failures value before calculating 1 << failures,
and keep the resulting skip bounded by PROTOCOL_RECOVERY_MAX_SKIP to prevent
overflow and preserve backoff behavior.

In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`:
- Around line 561-566: The redeploy tracking in the Zbs vhost test is being
cleared after the storage target is marked down, which can erase a concurrent
ping-triggered redeploy recorded by the DEPLOY_VHOST_PATH simulator. Update the
test logic around vhostTargetHealthy and redeployedWhileDown so the reset
happens before switching to the down state, and only set redeployedWhileDown
inside the deploy simulator when the target is actually down; use the existing
DEPLOY_VHOST_PATH handler and the vhostTargetHealthy/redeployedWhileDown flags
to keep the window race-free.
- Around line 779-788: The recovery check in ZbsVhostVolumeCase only verifies
that Vhost returns to Connected, but it does not prove the backoff entry was
cleared. In the existing recovery block around the vhostDown/retryInSecs
assertion, add a follow-up down cycle that resets the same host state after
reconnection, then assert redeploy happens again within a short window rather
than reusing an old large backoff; use the existing vhostDown, retryInSecs, and
ExternalPrimaryStorageHostProtocolRefVO/PrimaryStorageHostStatus checks to
locate and extend this test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 21a023ca-68a2-467d-bd7a-8f230703aaef

📥 Commits

Reviewing files that changed from the base of the PR and between 6613666 and 82c4882.

📒 Files selected for processing (5)

plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsConstants.java
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
storage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorage.java
test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy

💤 Files with no reviewable changes (1)

plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsConstants.java

coderabbitai · 2026-06-26T12:06:13Z

+    private boolean dueForRecovery(String key) {
+        ProtocolRecovery state = protocolRecoveryByKey.computeIfAbsent(key, k -> new ProtocolRecovery());
+        if (state.skip > 0) {
+            state.skip--;
+            return false;
+        }
+        state.failures++;
+        state.skip = Math.min(1 << state.failures, PROTOCOL_RECOVERY_MAX_SKIP);
+        return true;


🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # 预期：输出 dueForRecovery 中的 computeIfAbsent、skip/failures 直接修改，以及 `1 << state.failures`。 fd -i 'ExternalPrimaryStorageKvmFactory.java' -x sed -n '274,341p' {}

Repository: MatheMatrix/zstack

Length of output: 3438

🏁 Script executed:

#!/bin/bash set -e file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java' # Show surrounding class context and call sites of dueForRecovery / protocolRecoveryByKey sed -n '1,220p' "$file" | nl -ba | sed -n '1,220p' printf '\n--- CALL SITES ---\n' rg -n 'dueForRecovery\(|protocolRecoveryByKey|recoverDisconnectedProtocols\(' "$file"

Repository: MatheMatrix/zstack

Length of output: 781

🏁 Script executed:

#!/bin/bash set -e file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java' # Show surrounding class context and call sites of dueForRecovery / protocolRecoveryByKey sed -n '1,220p' "$file" | nl -ba | sed -n '1,220p' printf '\n--- CALL SITES ---\n' rg -n 'dueForRecovery\(|protocolRecoveryByKey|recoverDisconnectedProtocols\(' "$file"

Repository: MatheMatrix/zstack

Length of output: 781

🏁 Script executed:

#!/bin/bash set -e file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java' # Inspect the caller context around the only call site and the recovery state code sed -n '210,340p' "$file"

Repository: MatheMatrix/zstack

Length of output: 6839

🏁 Script executed:

#!/bin/bash set -e file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java' # Inspect the caller context around the only call site and the recovery state code sed -n '210,340p' "$file"

Repository: MatheMatrix/zstack

Length of output: 6839

🏁 Script executed:

#!/bin/bash set -e file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java' # Inspect the caller context around the only call site and the recovery state code sed -n '210,340p' "$file"

Repository: MatheMatrix/zstack

Length of output: 6839

把 dueForRecovery 的恢复状态更新收敛到 ConcurrentHashMap.compute 里。
ProtocolRecovery 现在是共享可变对象，skip/failures 的读改写不原子；同一 key 并发 ping 时会丢更新或重复触发恢复。1 << state.failures 也要对失败次数封顶，避免长期失败后移位溢出让退避失效。

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java` around lines 331 - 339, Update dueForRecovery so all ProtocolRecovery state changes for a given key happen inside protocolRecoveryByKey.compute, instead of mutating the shared ProtocolRecovery returned by computeIfAbsent. In ExternalPrimaryStorageKvmFactory.dueForRecovery, atomically read/update skip and failures together to avoid lost updates or duplicate recovery triggers under concurrent pings. Also cap the failures value before calculating 1 << failures, and keep the resulting skip bounded by PROTOCOL_RECOVERY_MAX_SKIP to prevent overflow and preserve backoff behavior.

coderabbitai · 2026-06-26T12:06:13Z

        env.simulator(ZbsStorageController.DEPLOY_VHOST_PATH) { HttpEntity<String> e, EnvSpec spec ->
            def cmd = JSONObjectUtil.toObject(e.body, ZbsStorageController.DeployVhostCmd.class)
            assert cmd.hostIp != null : "deploy cmd missing target host IP for zbsadm SSH"
            ensureCalled.set(true)
+            redeployedWhileDown.set(true)
+            return new ZbsStorageController.AgentResponse()


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

先清零再切换到 down，避免丢掉并发 ping 触发的 redeploy。

当前 Line 644 先把 vhostTargetHealthy 置为 false，Line 645 再清 redeployedWhileDown；如果 ping 线程刚好在这两行之间触发 DEPLOY_VHOST_PATH，Line 565 写入的 true 会被清掉，后续断言可能误报失败。建议先 reset，再进入 down 窗口，并且只在目标确实 down 时记录。

建议调整

env.simulator(ZbsStorageController.DEPLOY_VHOST_PATH) { HttpEntity<String> e, EnvSpec spec -> def cmd = JSONObjectUtil.toObject(e.body, ZbsStorageController.DeployVhostCmd.class) assert cmd.hostIp != null : "deploy cmd missing target host IP for zbsadm SSH" ensureCalled.set(true) - redeployedWhileDown.set(true) + if (!vhostTargetHealthy.get()) { + redeployedWhileDown.set(true) + } return new ZbsStorageController.AgentResponse() }

-rehostTargetHealthy.set(false) -redeployedWhileDown.set(false) +redeployedWhileDown.set(false) +vhostTargetHealthy.set(false)

Also applies to: 644-645

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy` around lines 561 - 566, The redeploy tracking in the Zbs vhost test is being cleared after the storage target is marked down, which can erase a concurrent ping-triggered redeploy recorded by the DEPLOY_VHOST_PATH simulator. Update the test logic around vhostTargetHealthy and redeployedWhileDown so the reset happens before switching to the down state, and only set redeployedWhileDown inside the deploy simulator when the target is actually down; use the existing DEPLOY_VHOST_PATH handler and the vhostTargetHealthy/redeployedWhileDown flags to keep the window race-free.

coderabbitai · 2026-06-26T12:06:13Z

+        // target recovers: health flips Vhost back to Connected and the backoff entry clears
+        vhostDown.set(false)
+        retryInSecs(30) {
+            assert Q.New(ExternalPrimaryStorageHostProtocolRefVO.class)
+                    .eq(ExternalPrimaryStorageHostProtocolRefVO_.primaryStorageUuid, ps.uuid)
+                    .eq(ExternalPrimaryStorageHostProtocolRefVO_.hostUuid, host.uuid)
+                    .eq(ExternalPrimaryStorageHostProtocolRefVO_.protocol, VolumeProtocol.Vhost.toString())
+                    .eq(ExternalPrimaryStorageHostProtocolRefVO_.status, PrimaryStorageHostStatus.Connected)
+                    .isExists()
+        }


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

补一段重新 down 的断言来证明 backoff 状态已清理。

这里的注释说恢复后会清理 backoff entry，但当前只验证协议回到 Connected；如果实现遗漏清理，测试仍会通过，下一次 down 可能继续沿用旧的大窗口。建议在恢复为 Connected 后清空 redeploySeconds、再次置 down，并断言短时间内能重新触发 redeploy。

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy` around lines 779 - 788, The recovery check in ZbsVhostVolumeCase only verifies that Vhost returns to Connected, but it does not prove the backoff entry was cleared. In the existing recovery block around the vhostDown/retryInSecs assertion, add a follow-up down cycle that resets the same host state after reconnection, then assert redeploy happens again within a short window rather than reusing an old large backoff; use the existing vhostDown, retryInSecs, and ExternalPrimaryStorageHostProtocolRefVO/PrimaryStorageHostStatus checks to locate and extend this test.

MatheMatrix force-pushed the sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2 branch from a70d340 to fc08d45 Compare June 26, 2026 09:46

MatheMatrix force-pushed the sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2 branch from fc08d45 to fbf00c5 Compare June 26, 2026 10:30

MatheMatrix force-pushed the sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2 branch from fbf00c5 to 82c4882 Compare June 26, 2026 11:56

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<feature>[zbs]: vhost self-heal + host env prep + review fixes#4342

<feature>[zbs]: vhost self-heal + host env prep + review fixes#4342
ZStack-Robot wants to merge 2 commits into
feature-zbs-vhostfrom
sync/jin.ma/fix/ZSTAC-85515-zbsadm-vhost@@2

ZStack-Robot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZStack-Robot commented Jun 26, 2026

commit A — review 修复 (903cb56)

commit B — self-heal + 主机环境准备 (a70d340)

配套

测试

已知限制（非本 MR、ZBS 侧）

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

commit A — review 修复 (`903cb56`)

commit B — self-heal + 主机环境准备 (`a70d340`)

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading