Skip to content

fix[vm]: reject destroy VM when host is Disconnected and preserve original state#4097

Closed
zstack-robot-2 wants to merge 11 commits into
5.5.22from
sync/songtao.liu/fix/ZSTAC-84948
Closed

fix[vm]: reject destroy VM when host is Disconnected and preserve original state#4097
zstack-robot-2 wants to merge 11 commits into
5.5.22from
sync/songtao.liu/fix/ZSTAC-84948

Conversation

@zstack-robot-2

Copy link
Copy Markdown
Collaborator

Root Cause

Destroying a VM during host disconnect left the VM in Unknown state with no recovery path.

Fix

  • Reject destroy VM when host is Disconnected (fail fast)
  • Preserve original VM state on destroy failure instead of transitioning to Unknown

Resolves: ZSTAC-84948

sync from gitlab !9947

zstack and others added 4 commits May 21, 2026 20:57
DestroyVmOnHypervisorFlow now checks host status before sending destroy
command. If host is Disconnected, the flow fails immediately with
HOST_IS_DISCONNECTED error instead of silently creating a GC job and
returning success. This prevents VM stuck in Unknown state while API
reports success.

Resolves: ZSTAC-84948
…nknown

Resolves: ZSTAC-84948

Change-Id: Ib3df33d1a1cdd46877ac5570e3f393ae2b9fdab3
Correct err() call missing global error code (ORG_ZSTACK_COMPUTE_VM_10331).
Extract isHostDisconnected() for testability without Spring/AspectJ.
Add VmDestroyOnHypervisorFlowTest: 5 tests covering host status check
and stopped-VM skip logic.

Resolves: ZSTAC-84948

Change-Id: I947ace96c530f80eb6be37f2dab289822cdadd87
Replace dbf.findByUuid() with Q.New() for efficient column-level query.
Add en_US and zh_CN elaborations for ORG_ZSTACK_COMPUTE_VM_10331.
Make isHostDisconnected public for testability via Mockito.spy().

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@MatheMatrix, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 minutes and 53 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: a54287f6-58b2-478b-b5a3-02674116e853

📥 Commits

Reviewing files that changed from the base of the PR and between 53fa5dc and c887191.

⛔ Files ignored due to path filters (2)
  • conf/i18n/globalErrorCodeMapping/global-error-en_US.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-zh_CN.json is excluded by !**/*.json
📒 Files selected for processing (3)
  • compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/vm/VmDestroyOnDisconnectedHostCase.groovy

走查

本PR为VM销毁工作流增加主机连接状态前置检查机制,若主机已断连则直接拒绝销毁请求;同时改进销毁失败时的状态回滚,确保VM状态恢复准确。涵盖工作流逻辑、错误处理与完整单元测试。

变更内容

主机断连检查与工作流改进

层级 / 文件 说明
错误码与主机断连检测基础
utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java, compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java
新增错误码常量 ORG_ZSTACK_COMPUTE_VM_10331;导入数据库查询工具 Q 及主机状态相关类 (HostVO, HostStatus, HostVO_);实现 isHostDisconnected(hostUuid) 方法,通过UUID查询主机状态并判断是否断连。
工作流中的主机连接检查集成
compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java
run(...) 方法中集成断连检查,若目标主机已断连则调用 chain.fail(...) 返回错误,阻止后续销毁消息发送。
销毁失败时的状态回滚改进
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
改进 destroy(...) 方法的失败回滚逻辑,使用 SQLBatch 重新加载VM实体并将 statehostUuidlastHostUuid 恢复至原始快照,而非统一置为 unknown
单元测试覆盖
test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java
新增测试类覆盖三个关键分支:Stopped状态VM直接跳过销毁、断连主机拒绝销毁请求、正常连接主机允许继续处理;通过Mockito mock依赖和反射注入实现测试隔离。

🎯 3 (Moderate) | ⏱️ ~25 分钟

🐰 小兔说:断连主机莫销毁,
检查状态早预防,
状态回滚更精准,
三方分支皆覆盖。
工作流稳妥,
错误处理无遗漏!


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Title check ❌ Error PR 标题长度为 80 个字符,超过了要求的 72 个字符限制。 将标题缩短至 72 个字符以内,例如:'fix[vm]: reject destroy when host disconnected, preserve state'
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed PR 描述清楚地说明了根本原因、修复方案以及相关的任务号,与代码变更高度相关。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/songtao.liu/fix/ZSTAC-84948

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 239-248: The SQLBatch block directly mutates and merges the VM
(findByUuid/ setState/ setHostUuid/ setLastHostUuid/ merge) and thereby bypasses
the proper event/extension chain; replace that direct DB write with a call to
the VM state-change helper so listeners and extensions are notified: inside the
SQLBatch (to preserve transactional semantics) invoke changeVmStateInDb(...) (or
the VM state-change method used elsewhere in VmInstanceBase) to restore the
state/host/lastHost from originalCopy for self.getUuid(), and remove the manual
setState/setHostUuid/setLastHostUuid/merge sequence so the state transition and
extension callbacks are emitted.

In
`@test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java`:
- Around line 143-152: In proceedsNormallyWhenHostConnected in
VmDestroyOnHypervisorFlowTest replace the broad catch(Throwable e) around
flow.run(trigger, data) with a targeted catch for the expected NoSuchMethodError
(or the exact AspectJ weaving error) and in that catch allow the comment/ignore;
for any other Exception or Error rethrow or call Assert.fail(...) with
e.toString() so real failures are not swallowed; additionally assert that
wentNext (and/or failError.get() == null) after flow.run to ensure the flow
actually proceeded; reference flow.run(trigger, data), failError and wentNext in
your changes.
- Around line 93-116: The test currently relies on catching a
NullPointerException and has an empty FlowTrigger.fail()/setError(), which fails
to verify the disconnected-host branch; modify the anonymous FlowTrigger used in
rejectsDestroyWhenHostDisconnected to store the ErrorCode into an
AtomicReference<ErrorCode> (implement both fail(ErrorCode) and
setError(ErrorCode) to set it) and remove using NPE as the pass condition; after
flow.run(...) assert that the recorded ErrorCode.getGlobalErrorCode() equals
ORG_ZSTACK_COMPUTE_VM_10331 and that
ErrorCode.isError(HostErrors.HOST_IS_DISCONNECTED) is true, and still assert
wentNext.get() is false to ensure next() was not called.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 2d21bf0b-5a91-4f3c-831d-8139b6a84ca3

📥 Commits

Reviewing files that changed from the base of the PR and between 96cad7e and 53fa5dc.

⛔ Files ignored due to path filters (2)
  • conf/i18n/globalErrorCodeMapping/global-error-en_US.json is excluded by !**/*.json
  • conf/i18n/globalErrorCodeMapping/global-error-zh_CN.json is excluded by !**/*.json
📒 Files selected for processing (4)
  • compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
  • test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java
  • utils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java

Comment thread compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java Outdated
Comment on lines +93 to +116
FlowTrigger trigger = new FlowTrigger() {
@Override
public void fail(ErrorCode errorCode) {
}

@Override
public void next() {
wentNext.set(true);
}

@Override
public void setError(ErrorCode errorCode) {
}
};

try {
flow.run(trigger, data);
} catch (NullPointerException e) {
// expected — Platform.err() requires Spring context in unit test,
// but the NPE proves we passed the isHostDisconnected check
}

Assert.assertFalse("should not go next when host Disconnected", wentNext.get());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java"
echo "== File: $FILE =="
ls -l "$FILE"
echo

# Show lines around 70-140 with numbers
nl -ba "$FILE" | sed -n '60,160p'

# Also locate any references to HOST_IS_DISCONNECTED or ORG_ZSTACK_COMPUTE_VM_10331 in this file
rg -n "HOST_IS_DISCONNECTED|ORG_ZSTACK_COMPUTE_VM_10331|Disconnected|host Disconnected|fail\\(" "$FILE" || true

Repository: MatheMatrix/zstack

Length of output: 348


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java"
echo "== File: $FILE =="
ls -l "$FILE"
echo

echo "== Lines 60-180 (numbered via cat -n) =="
cat -n "$FILE" | sed -n '60,180p'

echo
echo "== Search patterns =="
rg -n "HOST_IS_DISCONNECTED|ORG_ZSTACK_COMPUTE_VM_10331|host Disconnected|fail\\(|Disconnected" "$FILE" || true

Repository: MatheMatrix/zstack

Length of output: 5030


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "HOST_IS_DISCONNECTED|ORG_ZSTACK_COMPUTE_VM_10331" . || true

Repository: MatheMatrix/zstack

Length of output: 3170


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java"
echo "== File: $FILE =="
ls -l "$FILE"
echo

# Print around the referenced lines
cat -n "$FILE" | sed -n '1,200p'

# Narrow to around the chain.fail usage we saw in rg
rg -n "ORG_ZSTACK_COMPUTE_VM_10331|isHostDisconnected|chain\\.fail\\(" "$FILE"

Repository: MatheMatrix/zstack

Length of output: 8347


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "isError\\(HostErrors\\.HOST_IS_DISCONNECTED\\)|isError\\(HostErrors\\.|HOST_IS_DISCONNECTED\\)\\." -S . || true

Repository: MatheMatrix/zstack

Length of output: 2475


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find Platform.err implementation
rg -n "static .* err\\(" -S "core/src/main/java|header/src/main/java|utils/src/main/java|compute/src/main/java|.*" . || true

# Better: search specifically for "class Platform" and then err method
rg -n "class Platform" . || true
rg -n "Platform\\.err\\(" . || true

# Locate Platform.java
FILE=$(fd -a "Platform.java" . | head -n 20 | tr '\n' ' ')
echo "Possible Platform.java files: $FILE"

Repository: MatheMatrix/zstack

Length of output: 9092


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect Platform.err implementation
PLATFORM="core/src/main/java/org/zstack/core/Platform.java"
echo "== Platform.err snippet =="
sed -n '930,1020p' "$PLATFORM" | cat -n

echo
echo "== Locate FlowTrigger, NoRollbackFlow =="
rg -n "interface FlowTrigger|class NoRollbackFlow|void fail\\(" core/src/main/java header/src/main/java compute/src/main/java test/src/test/java  || true

echo
echo "== Read FlowTrigger definition =="
FLOWTRIGGER=$(fd -a "FlowTrigger.java" . | head -n 20 | head -n 1)
echo "FlowTrigger file: $FLOWTRIGGER"
sed -n '1,200p' "$FLOWTRIGGER" | cat -n

echo
echo "== Read NoRollbackFlow definition =="
NOROLLBACK=$(fd -a "NoRollbackFlow.java" . | head -n 20 | head -n 1)
echo "NoRollbackFlow file: $NOROLLBACK"
sed -n '1,220p' "$NOROLLBACK" | cat -n

Repository: MatheMatrix/zstack

Length of output: 46776


🏁 Script executed:

#!/bin/bash
set -euo pipefail

PLATFORM="core/src/main/java/org/zstack/core/Platform.java"
echo "== Platform.err snippet =="
sed -n '930,1020p' "$PLATFORM" | cat -n

echo
echo "== Locate FlowTrigger, NoRollbackFlow =="
rg -n "interface FlowTrigger|class NoRollbackFlow|void fail\\(" core/src/main/java header/src/main/java compute/src/main/java test/src/test/java || true

echo
echo "== Read FlowTrigger definition =="
FLOWTRIGGER=$(fd -a "FlowTrigger.java" . | head -n 1)
echo "FlowTrigger file: $FLOWTRIGGER"
sed -n '1,220p' "$FLOWTRIGGER" | cat -n

echo
echo "== Read NoRollbackFlow definition =="
NOROLLBACK=$(fd -a "NoRollbackFlow.java" . | head -n 1)
echo "NoRollbackFlow file: $NOROLLBACK"
sed -n '1,260p' "$NOROLLBACK" | cat -n

Repository: MatheMatrix/zstack

Length of output: 46776


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java"
cat -n "$FILE"

Repository: MatheMatrix/zstack

Length of output: 6270


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="core/src/main/java/org/zstack/core/Platform.java"
echo "== Locate getComponentLoader and loader fields =="
rg -n "getComponentLoader\\(|ComponentLoader loader|static .*loader" "$FILE"

echo
echo "== Read Platform loader-related portion (around first match) =="
LINE=$(rg -n "getComponentLoader\\(" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-80))
END=$((LINE+120))
sed -n "${START},${END}p" "$FILE" | cat -n

Repository: MatheMatrix/zstack

Length of output: 8106


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="header/src/main/java/org/zstack/header/errorcode/ErrorCode.java"
ls -l "$FILE" || true
rg -n "class ErrorCode|interface ErrorCode|isError\\(|getGlobalErrorCode\\(|getCode\\(|getDescription\\(|toString\\(" "$FILE" || true
sed -n '1,240p' "$FILE" | cat -n

Repository: MatheMatrix/zstack

Length of output: 8738


断连分支测试用 NPE 作为通过条件且缺少 fail(ErrorCode) 断言

rejectsDestroyWhenHostDisconnectedFlowTrigger.fail()/setError() 为空实现,且通过 catch (NullPointerException) 结束用例后仅断言 next() 未被调用;这既无法验证 VmDestroyOnHypervisorFlowchain.fail(err(ORG_ZSTACK_COMPUTE_VM_10331, HostErrors.HOST_IS_DISCONNECTED, ...)) 是否真正发生,也会让任何无关 NPE 都可能造成误报。建议在 fail(ErrorCode) 里记录 ErrorCode(如 AtomicReference<ErrorCode>),并断言其 getGlobalErrorCode()==ORG_ZSTACK_COMPUTE_VM_10331isError(HostErrors.HOST_IS_DISCONNECTED)(必要时再断言 next() 不应触发),避免用 NPE 推断走到了断连分支。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java`
around lines 93 - 116, The test currently relies on catching a
NullPointerException and has an empty FlowTrigger.fail()/setError(), which fails
to verify the disconnected-host branch; modify the anonymous FlowTrigger used in
rejectsDestroyWhenHostDisconnected to store the ErrorCode into an
AtomicReference<ErrorCode> (implement both fail(ErrorCode) and
setError(ErrorCode) to set it) and remove using NPE as the pass condition; after
flow.run(...) assert that the recorded ErrorCode.getGlobalErrorCode() equals
ORG_ZSTACK_COMPUTE_VM_10331 and that
ErrorCode.isError(HostErrors.HOST_IS_DISCONNECTED) is true, and still assert
wentNext.get() is false to ensure next() was not called.

Comment thread test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java Outdated
zstack and others added 6 commits May 25, 2026 22:02
…roy error path

For non-disconnect errors (where agent may have already acted), transition to
Unknown instead of blindly restoring original copy state. Extract the restore
logic into a private helper. Also fix elaboration message to match code.

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Batch

Replace direct DB merge with changeVmStateInDb so state events fire
properly on restore (Stopped→stopped event, Running→running event).

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nnected

Covers intermediate states (Connecting) where the host also cannot
accept a destroy message.

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move setLastHostUuid into the changeVmStateInDb callback so it is
persisted to DB along with hostUuid, instead of only updating the
in-memory self reference.

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…disconnected host

Delete the mock+spy+catch-Throwable unit test and add a SubCase-based
integration test that creates a VM, sets host to Disconnected via DB,
calls destroyVmInstance via SDK, and verifies:
- API fails when host is Disconnected
- VM state/hostUuid/lastHostUuid preserved by restoreOriginalState

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fix three review findings:
1. Add missing ImageSpec/InstanceOfferingSpec/L3NetworkSpec imports to test
2. Replace else-branch changeVmStateInDb(unknown) with direct setState + updateAndRefresh
   to preserve Created/Paused/etc states instead of destroying them
3. Revert isHostDisconnected from Q.New(HostVO.class) back to dbf.findByUuid — both have
   same cross-module concern, Q.New added unnecessary churn

Resolves: ZSTAC-84948

Change-Id: Ie6330fadec0074970ffe4381d5e1128a588dd6a2
@zstack-robot-2

Copy link
Copy Markdown
Collaborator Author

Comment from yaohua.wu:

Review: MR !9947 — ZSTAC-84948

Background (preserved across rounds)

  • Jira: ZSTAC-84948 — 【5.5.6-后端】删除VM时宿主机Disconnected导致VM卡在Unknown状态且API返回成功
  • Bug summary: 在宿主机 Disconnected 状态下调用删除 VM 的 API,API 返回成功但 VM 实际未删除,状态被改为 Unknown,无恢复路径。截图显示 destroy 成功但 VM 仍在列表中。
  • Intent & scope: fail-fast 拒绝在 Disconnected 宿主机上 destroy VM;destroy 失败时(错误码包含 HOST_IS_DISCONNECTED)保留原 VM 状态而非切到 Unknown。Scope: compute/vm 模块,新增 1 个错误码、1 个 SubCase,主修复在 VmDestroyOnHypervisorFlow 前置 check + VmInstanceBase done() 回调分支。
  • Round 1 initial findings: 2 × P0 (修复不完整 / host 不存在分支错判), 6 × P1 (状态机绕行 / 测试覆盖不足 / 无关变更等)
  • Suggested fix direction: done() 回调改成"reload host 状态再决定走 unknown 还是 restore"以消除对错误码携带 HOST_IS_DISCONNECTED 的依赖;isHostDisconnectedhvo == null 也视为不可达;补 race-condition 用例和 Running VM 用例

P0 — Critical

# File:Line Issue Fix
1 compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java:235-243 修复不完整:race condition 未被覆盖。新逻辑只在 errCode.isError(HostErrors.HOST_IS_DISCONNECTED) 时 restore,但 Jira 描述的真实触发路径还包括 destroy 期间 宿主机失联/超时——此时 bus.send 返回的错误码大概率是 OPERATION_FAILURE_GC_ELIGIBLE / IO timeout 类型,cause chain 未必含 HOST_IS_DISCONNECTED。原 bug(VM 卡 Unknown)在该路径下依然存在。 done() 回调中直接 reload host 状态再分流,去除对错误码的依赖:
```java
} else {
String hu = originalCopy.getHostUuid() != null ? originalCopy.getHostUuid() : originalCopy.getLastHostUuid();
HostVO h = hu == null ? null : dbf.findByUuid(hu, HostVO.class);
if (h == null
2 compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java:73-76 isHostDisconnected 把 host 不存在判为"已连接"hvo == null 时返回 false,flow 继续走 bus.send,依赖底层 service 路由失败兜底——但 host 完全不存在时连错误码都未必带 HOST_IS_DISCONNECTED,又会回到上面 #1 的死路径。is X 语义方法用 false 表示"找不到"是反模式。 改为:
```java
public boolean isHostNotConnected(String hostUuid) {
HostVO hvo = dbf.findByUuid(hostUuid, HostVO.class);
return hvo == null

P1 — Warning

# File:Line Issue Fix
3 VmInstanceBase.java:265-270 restoreOriginalState 的 else 分支用 self.setState(origState) + dbf.updateAndRefresh(self) 绕过了 VmInstanceStateEvent 状态机,跳过所有 VmStateChangedExtensionPoint 等扩展点。如果原状态是 Paused/Pausing/Migrating 等,观察者会看不到状态切换事件。 把分支扩展开来,对每种原状态调用对应的 VmInstanceStateEventpaused / migrating / unknown 等);只对真正没有事件可对应的状态(如 Destroyed)保留 setState fallback。
4 VmInstanceBase.java:259-260 restoreOriginalState 的 Stopped 分支未恢复 hostUuid/lastHostUuid,与 Running 分支处理不一致。如果原 Stopped VM 之前关联过 host(最近停的),destroy 失败后字段会被先前 destroy 流程清空。 与 Running 分支保持一致:changeVmStateInDb(VmInstanceStateEvent.stopped, () -> { self.setHostUuid(originalCopy.getHostUuid()); self.setLastHostUuid(originalCopy.getLastHostUuid()); });
5 test/src/test/groovy/org/zstack/test/integration/kvm/vm/VmDestroyOnDisconnectedHostCase.groovy:48-95 测试只覆盖了前置 check 路径:两个 case 都在 destroy 之前预先把 host 改成 Disconnected,验证 VmDestroyOnHypervisorFlow 前置拦截。未覆盖 race condition(destroy 进行中 host 失联),而这是原 Jira issue 的真实触发场景,也是 #1 P0 的潜在死点。 补一个 case:destroy 进行中通过 simulator hook 让 DestroyVmOnHypervisorMsg 回复 HOST_IS_DISCONNECTED 或 timeout 错误,验证 VM 状态正确 restore 到原值而非 Unknown。
6 VmDestroyOnDisconnectedHostCase.groovy:50,99 测试使用 VmCreationStrategy.JustCreate(Created 状态)VM,而生产 bug 发生在 Running VM 上。Created 状态 VM 的 destroy 路径与 Running VM 不同(done() 回调里 originalCopy.getState() 值不同),无法验证 Running 分支的 restoreOriginalState 至少补一个 Running VM 的 case:先正常 start 起来再把 host 切 Disconnected,然后 destroy。
7 VmInstanceBase.java:241 无关变更混入:原 changeVmStateInDb(VmInstanceStateEvent.unknown); 改成 self = changeVmStateInDb(...);。如果 changeVmStateInDb 返回新引用、原代码确实有 stale-self bug,应该是独立 commit + 独立测试;如果只是顺手改,会污染本次 fix 的 git blame。 拆成独立 commit 并在 message 中说明原因,或回滚此处保持本 MR 只做主修复。
8 VmDestroyOnHypervisorFlow.java:73 命名 isHostDisconnected 命名与语义不符HostStatusConnected / Disconnected / Connecting 三态,方法对非 Connected 一律返回 true,但错误信息硬编码 "is Disconnected"。host 处于 Connecting 时给用户误导。 见 P0 #2 的 fix:重命名 + 错误信息带上 current status。
9 VmDestroyOnDisconnectedHostCase.groovy 整体 SubCase 注册需确认:ZStack 测试框架要求 SubCase 在某个 main case 的 getSubCases() 中注册才会被执行。本 MR diff 中只看到新增 SubCase 文件,未见对应 main case 的注册改动——如未注册,此用例不会被 CI 跑到。 确认已在对应 main case(如 KvmVmTest / KvmVmCase 类似的入口)里把 VmDestroyOnDisconnectedHostCase.class 加进 getSubCases(),并跑一次稳定性测试:mvn test -Dtest=PremiumTestCaseStabilityTest -Dcases=org.zstack.test.integration.kvm.vm.VmDestroyOnDisconnectedHostCase -Dtimes=2 验证可执行。

P3 — Suggestion

# File:Line Issue Fix
10 VmDestroyOnHypervisorFlow.java:73 isHostDisconnectedrun() 内部使用,可见性应为 private public booleanprivate boolean
11 VmDestroyOnHypervisorFlow.java:8,17,21 新加的 import 空行分组与文件其他 import 紧凑风格不一致。 删除多余空行。

Coverage

  • 6 个文件全部 deep review(2 个 .java 主修复 + 1 测试 + 2 i18n + 1 错误码常量)
  • 跨文件追踪:done() 回调的完整路径未能在 VmInstanceBase.java 中拉取(文件 >420KB 超出单次读上限),改用 diff hunk + 函数签名推断;建议 maintainer 在合入前手动校对 originalCopy 来源与 changeVmStateInDb 返回值语义
  • 残余风险:race condition 路径的真实错误码未通过运行时验证;强烈建议在补完 P1 <fix>[kvm]: filter out stopped VM when invoke getVirtualizerInfo #5 测试后再次确认

Verdict: REVISION_REQUIRED

两条 P0 都指向同一个根因:修复只覆盖了"destroy 前 host 已 Disconnected"这一种情况,遗漏了 race condition 和 host 不存在的分支。改完 P0 #1/#2 + 补 P1 #5 测试后,原 bug 才算真正消除。其它 P1 是质量问题,建议本轮一并处理。


🤖 Robot Reviewer

…ion test

Use SDK createVmInstance with hostUuid parameter instead of JustCreate +
dbf.updateAndRefresh. Use SQL.New() for host status changes (standard
integration test pattern). Use queryVmInstance for assertions instead of
dbf.findByUuid.

Resolves: ZSTAC-84948

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zstack-robot-2 zstack-robot-2 deleted the sync/songtao.liu/fix/ZSTAC-84948 branch May 26, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant