<feature>[zbs]: vhost self-heal + host env prep + review fixes#4342
<feature>[zbs]: vhost self-heal + host env prep + review fixes#4342ZStack-Robot wants to merge 2 commits into
Conversation
Address round-1 review on the zbsadm vhost MR: - drop dead DESTROY_VHOST path/cmd and VHOST_CONTROLLER_NAME_PREFIX - mask host SSH password (@nologging) on VhostHostCmd - reuse ZbsHelper.getPoolFromVolumePath in vhost activate - null the host status reason when the aggregate is Connected - rename applyHostStatus -> aggregateAndApplyHostStatus - keep targetHealthy var name; fix "cannot found" -> "cannot find" Resolves: ZSTAC-85515 Change-Id: Ifd059b63893fdaf343b7222dd45859a7b4b0bd00
a70d340 to
fc08d45
Compare
Walkthrough新增了外部主存储协议恢复回退逻辑,ZBS vhost 增加目标环境预处理与健康上报调整,并同步修改主机状态聚合的原因处理和相关集成测试。 Changes主存储协议恢复与 vhost 流程
Sequence Diagram(s)sequenceDiagram
participant ZbsStorageController as ZbsStorageController
participant KVMHostAsyncHttpCallMsg as KVMHostAsyncHttpCallMsg
participant KVMHost as KVM 主机
ZbsStorageController->>KVMHostAsyncHttpCallMsg: 发送 PREPARE_VHOST_TARGET_ENV_PATH 请求
KVMHostAsyncHttpCallMsg->>KVMHost: PrepareVhostTargetEnvCmd
KVMHost-->>KVMHostAsyncHttpCallMsg: AgentResponse
KVMHostAsyncHttpCallMsg-->>ZbsStorageController: 告警后继续 deploy-vhost-target
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
fc08d45 to
fbf00c5
Compare
- recovery: the external primary storage health check (kvm ping cadence) re-runs deployClient for a host's disconnected protocols, on a per-(ps,host,protocol) failure-doubling backoff, so a down vhost target redeploys without a host reconnect. cadence and backoff are owned by the framework; the controller adds no new api - new PREPARE_VHOST_TARGET_ENV flow runs the host kvmagent to ensure hugepages + docker before the zbsadm deploy - ZbsVhostVolumeCase covers the recovery backoff and the prepare step Resolves: ZSTAC-85515 Change-Id: Id11467eb56c9ef31601b6ea5577cc1a29e3bed2f
fbf00c5 to
82c4882
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java`:
- Around line 331-339: Update dueForRecovery so all ProtocolRecovery state
changes for a given key happen inside protocolRecoveryByKey.compute, instead of
mutating the shared ProtocolRecovery returned by computeIfAbsent. In
ExternalPrimaryStorageKvmFactory.dueForRecovery, atomically read/update skip and
failures together to avoid lost updates or duplicate recovery triggers under
concurrent pings. Also cap the failures value before calculating 1 << failures,
and keep the resulting skip bounded by PROTOCOL_RECOVERY_MAX_SKIP to prevent
overflow and preserve backoff behavior.
In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`:
- Around line 561-566: The redeploy tracking in the Zbs vhost test is being
cleared after the storage target is marked down, which can erase a concurrent
ping-triggered redeploy recorded by the DEPLOY_VHOST_PATH simulator. Update the
test logic around vhostTargetHealthy and redeployedWhileDown so the reset
happens before switching to the down state, and only set redeployedWhileDown
inside the deploy simulator when the target is actually down; use the existing
DEPLOY_VHOST_PATH handler and the vhostTargetHealthy/redeployedWhileDown flags
to keep the window race-free.
- Around line 779-788: The recovery check in ZbsVhostVolumeCase only verifies
that Vhost returns to Connected, but it does not prove the backoff entry was
cleared. In the existing recovery block around the vhostDown/retryInSecs
assertion, add a follow-up down cycle that resets the same host state after
reconnection, then assert redeploy happens again within a short window rather
than reusing an old large backoff; use the existing vhostDown, retryInSecs, and
ExternalPrimaryStorageHostProtocolRefVO/PrimaryStorageHostStatus checks to
locate and extend this test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 21a023ca-68a2-467d-bd7a-8f230703aaef
📒 Files selected for processing (5)
plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.javaplugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsConstants.javaplugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.javastorage/src/main/java/org/zstack/storage/addon/primary/ExternalPrimaryStorage.javatest/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy
💤 Files with no reviewable changes (1)
- plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsConstants.java
| private boolean dueForRecovery(String key) { | ||
| ProtocolRecovery state = protocolRecoveryByKey.computeIfAbsent(key, k -> new ProtocolRecovery()); | ||
| if (state.skip > 0) { | ||
| state.skip--; | ||
| return false; | ||
| } | ||
| state.failures++; | ||
| state.skip = Math.min(1 << state.failures, PROTOCOL_RECOVERY_MAX_SKIP); | ||
| return true; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 预期:输出 dueForRecovery 中的 computeIfAbsent、skip/failures 直接修改,以及 `1 << state.failures`。
fd -i 'ExternalPrimaryStorageKvmFactory.java' -x sed -n '274,341p' {}Repository: MatheMatrix/zstack
Length of output: 3438
🏁 Script executed:
#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'
# Show surrounding class context and call sites of dueForRecovery / protocolRecoveryByKey
sed -n '1,220p' "$file" | nl -ba | sed -n '1,220p'
printf '\n--- CALL SITES ---\n'
rg -n 'dueForRecovery\(|protocolRecoveryByKey|recoverDisconnectedProtocols\(' "$file"Repository: MatheMatrix/zstack
Length of output: 781
🏁 Script executed:
#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'
# Show surrounding class context and call sites of dueForRecovery / protocolRecoveryByKey
sed -n '1,220p' "$file" | nl -ba | sed -n '1,220p'
printf '\n--- CALL SITES ---\n'
rg -n 'dueForRecovery\(|protocolRecoveryByKey|recoverDisconnectedProtocols\(' "$file"Repository: MatheMatrix/zstack
Length of output: 781
🏁 Script executed:
#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'
# Inspect the caller context around the only call site and the recovery state code
sed -n '210,340p' "$file"Repository: MatheMatrix/zstack
Length of output: 6839
🏁 Script executed:
#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'
# Inspect the caller context around the only call site and the recovery state code
sed -n '210,340p' "$file"Repository: MatheMatrix/zstack
Length of output: 6839
🏁 Script executed:
#!/bin/bash
set -e
file='plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java'
# Inspect the caller context around the only call site and the recovery state code
sed -n '210,340p' "$file"Repository: MatheMatrix/zstack
Length of output: 6839
把 dueForRecovery 的恢复状态更新收敛到 ConcurrentHashMap.compute 里。
ProtocolRecovery 现在是共享可变对象,skip/failures 的读改写不原子;同一 key 并发 ping 时会丢更新或重复触发恢复。1 << state.failures 也要对失败次数封顶,避免长期失败后移位溢出让退避失效。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugin/externalStorage/src/main/java/org/zstack/externalStorage/primary/kvm/ExternalPrimaryStorageKvmFactory.java`
around lines 331 - 339, Update dueForRecovery so all ProtocolRecovery state
changes for a given key happen inside protocolRecoveryByKey.compute, instead of
mutating the shared ProtocolRecovery returned by computeIfAbsent. In
ExternalPrimaryStorageKvmFactory.dueForRecovery, atomically read/update skip and
failures together to avoid lost updates or duplicate recovery triggers under
concurrent pings. Also cap the failures value before calculating 1 << failures,
and keep the resulting skip bounded by PROTOCOL_RECOVERY_MAX_SKIP to prevent
overflow and preserve backoff behavior.
| env.simulator(ZbsStorageController.DEPLOY_VHOST_PATH) { HttpEntity<String> e, EnvSpec spec -> | ||
| def cmd = JSONObjectUtil.toObject(e.body, ZbsStorageController.DeployVhostCmd.class) | ||
| assert cmd.hostIp != null : "deploy cmd missing target host IP for zbsadm SSH" | ||
| ensureCalled.set(true) | ||
| redeployedWhileDown.set(true) | ||
| return new ZbsStorageController.AgentResponse() |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
先清零再切换到 down,避免丢掉并发 ping 触发的 redeploy。
当前 Line 644 先把 vhostTargetHealthy 置为 false,Line 645 再清 redeployedWhileDown;如果 ping 线程刚好在这两行之间触发 DEPLOY_VHOST_PATH,Line 565 写入的 true 会被清掉,后续断言可能误报失败。建议先 reset,再进入 down 窗口,并且只在目标确实 down 时记录。
建议调整
env.simulator(ZbsStorageController.DEPLOY_VHOST_PATH) { HttpEntity<String> e, EnvSpec spec ->
def cmd = JSONObjectUtil.toObject(e.body, ZbsStorageController.DeployVhostCmd.class)
assert cmd.hostIp != null : "deploy cmd missing target host IP for zbsadm SSH"
ensureCalled.set(true)
- redeployedWhileDown.set(true)
+ if (!vhostTargetHealthy.get()) {
+ redeployedWhileDown.set(true)
+ }
return new ZbsStorageController.AgentResponse()
}-rehostTargetHealthy.set(false)
-redeployedWhileDown.set(false)
+redeployedWhileDown.set(false)
+vhostTargetHealthy.set(false)Also applies to: 644-645
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`
around lines 561 - 566, The redeploy tracking in the Zbs vhost test is being
cleared after the storage target is marked down, which can erase a concurrent
ping-triggered redeploy recorded by the DEPLOY_VHOST_PATH simulator. Update the
test logic around vhostTargetHealthy and redeployedWhileDown so the reset
happens before switching to the down state, and only set redeployedWhileDown
inside the deploy simulator when the target is actually down; use the existing
DEPLOY_VHOST_PATH handler and the vhostTargetHealthy/redeployedWhileDown flags
to keep the window race-free.
| // target recovers: health flips Vhost back to Connected and the backoff entry clears | ||
| vhostDown.set(false) | ||
| retryInSecs(30) { | ||
| assert Q.New(ExternalPrimaryStorageHostProtocolRefVO.class) | ||
| .eq(ExternalPrimaryStorageHostProtocolRefVO_.primaryStorageUuid, ps.uuid) | ||
| .eq(ExternalPrimaryStorageHostProtocolRefVO_.hostUuid, host.uuid) | ||
| .eq(ExternalPrimaryStorageHostProtocolRefVO_.protocol, VolumeProtocol.Vhost.toString()) | ||
| .eq(ExternalPrimaryStorageHostProtocolRefVO_.status, PrimaryStorageHostStatus.Connected) | ||
| .isExists() | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
补一段重新 down 的断言来证明 backoff 状态已清理。
这里的注释说恢复后会清理 backoff entry,但当前只验证协议回到 Connected;如果实现遗漏清理,测试仍会通过,下一次 down 可能继续沿用旧的大窗口。建议在恢复为 Connected 后清空 redeploySeconds、再次置 down,并断言短时间内能重新触发 redeploy。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@test/src/test/groovy/org/zstack/test/integration/storage/primary/addon/zbs/ZbsVhostVolumeCase.groovy`
around lines 779 - 788, The recovery check in ZbsVhostVolumeCase only verifies
that Vhost returns to Connected, but it does not prove the backoff entry was
cleared. In the existing recovery block around the vhostDown/retryInSecs
assertion, add a follow-up down cycle that resets the same host state after
reconnection, then assert redeploy happens again within a short window rather
than reusing an old large backoff; use the existing vhostDown, retryInSecs, and
ExternalPrimaryStorageHostProtocolRefVO/PrimaryStorageHostStatus checks to
locate and extend this test.
跟进已合并的 zbsadm vhost MR,本分支 2 个新 commit。
commit A — review 修复 (903cb56)
cannot found->cannot findcommit B — self-heal + 主机环境准备 (a70d340)
PREPARE_VHOST_TARGET_ENVflow:deploy 前调计算节点 kvmagent 确保 hugepages + docker(zbsadm 只检查不安装)配套
zstack-utility 同名分支
@@2新 MR(kvmagentprepare_vhost_target_env+ensure_docker+ 2G hugepage)。测试
ZbsVhostVolumeCase待补跑(CI 8989 串行占用)已知限制(非本 MR、ZBS 侧)
真机 vhost VM 启动还依赖 ZBS 侧 vhost 镜像与 zbsadm 版本对齐 + 计算节点 registry/DNS onboarding;大页/docker 本 MR 已自动处理。
Resolves: ZSTAC-85515
sync from gitlab !10303