fix[vm]: reject destroy VM when host is Disconnected and preserve original state#4097
fix[vm]: reject destroy VM when host is Disconnected and preserve original state#4097zstack-robot-2 wants to merge 11 commits into
Conversation
DestroyVmOnHypervisorFlow now checks host status before sending destroy command. If host is Disconnected, the flow fails immediately with HOST_IS_DISCONNECTED error instead of silently creating a GC job and returning success. This prevents VM stuck in Unknown state while API reports success. Resolves: ZSTAC-84948
…nknown Resolves: ZSTAC-84948 Change-Id: Ib3df33d1a1cdd46877ac5570e3f393ae2b9fdab3
Correct err() call missing global error code (ORG_ZSTACK_COMPUTE_VM_10331). Extract isHostDisconnected() for testability without Spring/AspectJ. Add VmDestroyOnHypervisorFlowTest: 5 tests covering host status check and stopped-VM skip logic. Resolves: ZSTAC-84948 Change-Id: I947ace96c530f80eb6be37f2dab289822cdadd87
Replace dbf.findByUuid() with Q.New() for efficient column-level query. Add en_US and zh_CN elaborations for ORG_ZSTACK_COMPUTE_VM_10331. Make isHostDisconnected public for testability via Mockito.spy(). Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 16 minutes and 53 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml) Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (3)
走查本PR为VM销毁工作流增加主机连接状态前置检查机制,若主机已断连则直接拒绝销毁请求;同时改进销毁失败时的状态回滚,确保VM状态恢复准确。涵盖工作流逻辑、错误处理与完整单元测试。 变更内容主机断连检查与工作流改进
🎯 3 (Moderate) | ⏱️ ~25 分钟
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 239-248: The SQLBatch block directly mutates and merges the VM
(findByUuid/ setState/ setHostUuid/ setLastHostUuid/ merge) and thereby bypasses
the proper event/extension chain; replace that direct DB write with a call to
the VM state-change helper so listeners and extensions are notified: inside the
SQLBatch (to preserve transactional semantics) invoke changeVmStateInDb(...) (or
the VM state-change method used elsewhere in VmInstanceBase) to restore the
state/host/lastHost from originalCopy for self.getUuid(), and remove the manual
setState/setHostUuid/setLastHostUuid/merge sequence so the state transition and
extension callbacks are emitted.
In
`@test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java`:
- Around line 143-152: In proceedsNormallyWhenHostConnected in
VmDestroyOnHypervisorFlowTest replace the broad catch(Throwable e) around
flow.run(trigger, data) with a targeted catch for the expected NoSuchMethodError
(or the exact AspectJ weaving error) and in that catch allow the comment/ignore;
for any other Exception or Error rethrow or call Assert.fail(...) with
e.toString() so real failures are not swallowed; additionally assert that
wentNext (and/or failError.get() == null) after flow.run to ensure the flow
actually proceeded; reference flow.run(trigger, data), failError and wentNext in
your changes.
- Around line 93-116: The test currently relies on catching a
NullPointerException and has an empty FlowTrigger.fail()/setError(), which fails
to verify the disconnected-host branch; modify the anonymous FlowTrigger used in
rejectsDestroyWhenHostDisconnected to store the ErrorCode into an
AtomicReference<ErrorCode> (implement both fail(ErrorCode) and
setError(ErrorCode) to set it) and remove using NPE as the pass condition; after
flow.run(...) assert that the recorded ErrorCode.getGlobalErrorCode() equals
ORG_ZSTACK_COMPUTE_VM_10331 and that
ErrorCode.isError(HostErrors.HOST_IS_DISCONNECTED) is true, and still assert
wentNext.get() is false to ensure next() was not called.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 2d21bf0b-5a91-4f3c-831d-8139b6a84ca3
⛔ Files ignored due to path filters (2)
conf/i18n/globalErrorCodeMapping/global-error-en_US.jsonis excluded by!**/*.jsonconf/i18n/globalErrorCodeMapping/global-error-zh_CN.jsonis excluded by!**/*.json
📒 Files selected for processing (4)
compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.javacompute/src/main/java/org/zstack/compute/vm/VmInstanceBase.javatest/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.javautils/src/main/java/org/zstack/utils/clouderrorcode/CloudOperationsErrorCode.java
| FlowTrigger trigger = new FlowTrigger() { | ||
| @Override | ||
| public void fail(ErrorCode errorCode) { | ||
| } | ||
|
|
||
| @Override | ||
| public void next() { | ||
| wentNext.set(true); | ||
| } | ||
|
|
||
| @Override | ||
| public void setError(ErrorCode errorCode) { | ||
| } | ||
| }; | ||
|
|
||
| try { | ||
| flow.run(trigger, data); | ||
| } catch (NullPointerException e) { | ||
| // expected — Platform.err() requires Spring context in unit test, | ||
| // but the NPE proves we passed the isHostDisconnected check | ||
| } | ||
|
|
||
| Assert.assertFalse("should not go next when host Disconnected", wentNext.get()); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java"
echo "== File: $FILE =="
ls -l "$FILE"
echo
# Show lines around 70-140 with numbers
nl -ba "$FILE" | sed -n '60,160p'
# Also locate any references to HOST_IS_DISCONNECTED or ORG_ZSTACK_COMPUTE_VM_10331 in this file
rg -n "HOST_IS_DISCONNECTED|ORG_ZSTACK_COMPUTE_VM_10331|Disconnected|host Disconnected|fail\\(" "$FILE" || trueRepository: MatheMatrix/zstack
Length of output: 348
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java"
echo "== File: $FILE =="
ls -l "$FILE"
echo
echo "== Lines 60-180 (numbered via cat -n) =="
cat -n "$FILE" | sed -n '60,180p'
echo
echo "== Search patterns =="
rg -n "HOST_IS_DISCONNECTED|ORG_ZSTACK_COMPUTE_VM_10331|host Disconnected|fail\\(|Disconnected" "$FILE" || trueRepository: MatheMatrix/zstack
Length of output: 5030
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "HOST_IS_DISCONNECTED|ORG_ZSTACK_COMPUTE_VM_10331" . || trueRepository: MatheMatrix/zstack
Length of output: 3170
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="compute/src/main/java/org/zstack/compute/vm/VmDestroyOnHypervisorFlow.java"
echo "== File: $FILE =="
ls -l "$FILE"
echo
# Print around the referenced lines
cat -n "$FILE" | sed -n '1,200p'
# Narrow to around the chain.fail usage we saw in rg
rg -n "ORG_ZSTACK_COMPUTE_VM_10331|isHostDisconnected|chain\\.fail\\(" "$FILE"Repository: MatheMatrix/zstack
Length of output: 8347
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "isError\\(HostErrors\\.HOST_IS_DISCONNECTED\\)|isError\\(HostErrors\\.|HOST_IS_DISCONNECTED\\)\\." -S . || trueRepository: MatheMatrix/zstack
Length of output: 2475
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find Platform.err implementation
rg -n "static .* err\\(" -S "core/src/main/java|header/src/main/java|utils/src/main/java|compute/src/main/java|.*" . || true
# Better: search specifically for "class Platform" and then err method
rg -n "class Platform" . || true
rg -n "Platform\\.err\\(" . || true
# Locate Platform.java
FILE=$(fd -a "Platform.java" . | head -n 20 | tr '\n' ' ')
echo "Possible Platform.java files: $FILE"Repository: MatheMatrix/zstack
Length of output: 9092
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect Platform.err implementation
PLATFORM="core/src/main/java/org/zstack/core/Platform.java"
echo "== Platform.err snippet =="
sed -n '930,1020p' "$PLATFORM" | cat -n
echo
echo "== Locate FlowTrigger, NoRollbackFlow =="
rg -n "interface FlowTrigger|class NoRollbackFlow|void fail\\(" core/src/main/java header/src/main/java compute/src/main/java test/src/test/java || true
echo
echo "== Read FlowTrigger definition =="
FLOWTRIGGER=$(fd -a "FlowTrigger.java" . | head -n 20 | head -n 1)
echo "FlowTrigger file: $FLOWTRIGGER"
sed -n '1,200p' "$FLOWTRIGGER" | cat -n
echo
echo "== Read NoRollbackFlow definition =="
NOROLLBACK=$(fd -a "NoRollbackFlow.java" . | head -n 20 | head -n 1)
echo "NoRollbackFlow file: $NOROLLBACK"
sed -n '1,220p' "$NOROLLBACK" | cat -nRepository: MatheMatrix/zstack
Length of output: 46776
🏁 Script executed:
#!/bin/bash
set -euo pipefail
PLATFORM="core/src/main/java/org/zstack/core/Platform.java"
echo "== Platform.err snippet =="
sed -n '930,1020p' "$PLATFORM" | cat -n
echo
echo "== Locate FlowTrigger, NoRollbackFlow =="
rg -n "interface FlowTrigger|class NoRollbackFlow|void fail\\(" core/src/main/java header/src/main/java compute/src/main/java test/src/test/java || true
echo
echo "== Read FlowTrigger definition =="
FLOWTRIGGER=$(fd -a "FlowTrigger.java" . | head -n 1)
echo "FlowTrigger file: $FLOWTRIGGER"
sed -n '1,220p' "$FLOWTRIGGER" | cat -n
echo
echo "== Read NoRollbackFlow definition =="
NOROLLBACK=$(fd -a "NoRollbackFlow.java" . | head -n 1)
echo "NoRollbackFlow file: $NOROLLBACK"
sed -n '1,260p' "$NOROLLBACK" | cat -nRepository: MatheMatrix/zstack
Length of output: 46776
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java"
cat -n "$FILE"Repository: MatheMatrix/zstack
Length of output: 6270
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="core/src/main/java/org/zstack/core/Platform.java"
echo "== Locate getComponentLoader and loader fields =="
rg -n "getComponentLoader\\(|ComponentLoader loader|static .*loader" "$FILE"
echo
echo "== Read Platform loader-related portion (around first match) =="
LINE=$(rg -n "getComponentLoader\\(" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-80))
END=$((LINE+120))
sed -n "${START},${END}p" "$FILE" | cat -nRepository: MatheMatrix/zstack
Length of output: 8106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="header/src/main/java/org/zstack/header/errorcode/ErrorCode.java"
ls -l "$FILE" || true
rg -n "class ErrorCode|interface ErrorCode|isError\\(|getGlobalErrorCode\\(|getCode\\(|getDescription\\(|toString\\(" "$FILE" || true
sed -n '1,240p' "$FILE" | cat -nRepository: MatheMatrix/zstack
Length of output: 8738
断连分支测试用 NPE 作为通过条件且缺少 fail(ErrorCode) 断言
rejectsDestroyWhenHostDisconnected 中 FlowTrigger.fail()/setError() 为空实现,且通过 catch (NullPointerException) 结束用例后仅断言 next() 未被调用;这既无法验证 VmDestroyOnHypervisorFlow 的 chain.fail(err(ORG_ZSTACK_COMPUTE_VM_10331, HostErrors.HOST_IS_DISCONNECTED, ...)) 是否真正发生,也会让任何无关 NPE 都可能造成误报。建议在 fail(ErrorCode) 里记录 ErrorCode(如 AtomicReference<ErrorCode>),并断言其 getGlobalErrorCode()==ORG_ZSTACK_COMPUTE_VM_10331 且 isError(HostErrors.HOST_IS_DISCONNECTED)(必要时再断言 next() 不应触发),避免用 NPE 推断走到了断连分支。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@test/src/test/java/org/zstack/test/compute/vm/VmDestroyOnHypervisorFlowTest.java`
around lines 93 - 116, The test currently relies on catching a
NullPointerException and has an empty FlowTrigger.fail()/setError(), which fails
to verify the disconnected-host branch; modify the anonymous FlowTrigger used in
rejectsDestroyWhenHostDisconnected to store the ErrorCode into an
AtomicReference<ErrorCode> (implement both fail(ErrorCode) and
setError(ErrorCode) to set it) and remove using NPE as the pass condition; after
flow.run(...) assert that the recorded ErrorCode.getGlobalErrorCode() equals
ORG_ZSTACK_COMPUTE_VM_10331 and that
ErrorCode.isError(HostErrors.HOST_IS_DISCONNECTED) is true, and still assert
wentNext.get() is false to ensure next() was not called.
…roy error path For non-disconnect errors (where agent may have already acted), transition to Unknown instead of blindly restoring original copy state. Extract the restore logic into a private helper. Also fix elaboration message to match code. Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Batch Replace direct DB merge with changeVmStateInDb so state events fire properly on restore (Stopped→stopped event, Running→running event). Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nnected Covers intermediate states (Connecting) where the host also cannot accept a destroy message. Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move setLastHostUuid into the changeVmStateInDb callback so it is persisted to DB along with hostUuid, instead of only updating the in-memory self reference. Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…disconnected host Delete the mock+spy+catch-Throwable unit test and add a SubCase-based integration test that creates a VM, sets host to Disconnected via DB, calls destroyVmInstance via SDK, and verifies: - API fails when host is Disconnected - VM state/hostUuid/lastHostUuid preserved by restoreOriginalState Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fix three review findings: 1. Add missing ImageSpec/InstanceOfferingSpec/L3NetworkSpec imports to test 2. Replace else-branch changeVmStateInDb(unknown) with direct setState + updateAndRefresh to preserve Created/Paused/etc states instead of destroying them 3. Revert isHostDisconnected from Q.New(HostVO.class) back to dbf.findByUuid — both have same cross-module concern, Q.New added unnecessary churn Resolves: ZSTAC-84948 Change-Id: Ie6330fadec0074970ffe4381d5e1128a588dd6a2
|
Comment from yaohua.wu: Review: MR !9947 — ZSTAC-84948Background (preserved across rounds)
P0 — Critical
P1 — Warning
P3 — Suggestion
Coverage
Verdict: REVISION_REQUIRED两条 P0 都指向同一个根因:修复只覆盖了"destroy 前 host 已 Disconnected"这一种情况,遗漏了 race condition 和 host 不存在的分支。改完 P0 #1/#2 + 补 P1 #5 测试后,原 bug 才算真正消除。其它 P1 是质量问题,建议本轮一并处理。 🤖 Robot Reviewer |
…ion test Use SDK createVmInstance with hostUuid parameter instead of JustCreate + dbf.updateAndRefresh. Use SQL.New() for host status changes (standard integration test pattern). Use queryVmInstance for assertions instead of dbf.findByUuid. Resolves: ZSTAC-84948 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root Cause
Destroying a VM during host disconnect left the VM in Unknown state with no recovery path.
Fix
Resolves: ZSTAC-84948
sync from gitlab !9947