Skip to content

[ZSTAC-83890][5.4.8] HA pre-fence leftover qemu on suspect host via sibling#3899

Closed
MatheMatrix wants to merge 501 commits into
5.4.8from
sync/yingzhe.hu/fix/ZSTAC-83890@@3
Closed

[ZSTAC-83890][5.4.8] HA pre-fence leftover qemu on suspect host via sibling#3899
MatheMatrix wants to merge 501 commits into
5.4.8from
sync/yingzhe.hu/fix/ZSTAC-83890@@3

Conversation

@MatheMatrix

Copy link
Copy Markdown
Owner

Background

TIC-5513: Ceph OSD osd_client_watch_timeout (default 30s) evicts the watcher when watch_ping is delayed (e.g. source host OOM / storage net blocked). rbd status briefly returns Watchers: none even though qemu is still alive. HA's second watcher check hits this transient empty window -> VM starts on a new host -> split-brain after the source host recovers.

Fix

Add HaPreFenceVmExtensionPoint, called by VmInstanceBase.handle(HaStartVmInstanceMsg) before BeforeHaStartVmInstanceExtensionPoint. KVM impl KvmHaPreFenceVmExtension:

  1. Read vm.hostUuid (HA flow has not yet nulled it - this is the suspect host).
  2. Pick a sibling KVM host. Prefer HaStartVmInstanceMsg.siblingHostUuids (HA decision already vetted them via HaKvmHostSiblingChecker); fallback to fresh Connected+Enabled+KVM same-cluster query.
  3. Send KVMHostAsyncHttpCallMsg with FenceVmFromPeerCmd to that sibling.
  4. Sibling kvmagent SSHes the suspect host: virsh destroy + pkill -9 -f qemu.*vmUuid + pgrep verify.
  5. Verdict:
    • qemuConfirmedDead -> HA proceeds
    • targetHostUnreachable -> HA proceeds (host truly down)
    • qemuStillAlive -> HA refused (refuse to split-brain)
    • no sibling available -> HA refused (fail-safe)

Companion MRs

  • premium: yingzhe.hu/premium fix/ZSTAC-83890@@3 -> 5.4.8 (sibling propagation chain + integration test HaPreFenceVmFromPeerCase)
  • zstack-utility: yingzhe.hu/zstack-utility fix/ZSTAC-83890@@3 -> 5.4.8 (agent endpoint /ha/fence/vm/from/peer)

Refs

Resolves: ZSTAC-83890
Refs: TIC-5513

sync from gitlab !9783

AlanJager and others added 30 commits February 17, 2026 19:32
…anup

Resolves: ZSTAC-80821

Change-Id: I59284c4e69f5d2ee357b1836b7c243200e30949a
Resolves: ZSTAC-77544

Change-Id: I1f711bff9c1e87a8cbf6a2eb310ca6086f0f99ba
Resolves: ZSTAC-80821

Change-Id: Ia9a9597feceb96b3e6e22259e2d0be7bde8ae499
…V bond check

add new error code constant for SR-IOV bond slave NIC count validation.

Resolves: ZSTAC-81163

Change-Id: Ie2a74411129a98c3c03a4a085e94a3bd45922da5
…tion

Resolves: ZSTAC-80991

Change-Id: I7677ddc25c8859e35e8ba80fd3105406bc761a76
Resolves: ZSTAC-74908

Change-Id: I48054139babb1e8092ab81e4367743ae3fd8aefb
Resolves: ZSTAC-81354

Change-Id: Iff2131b3a878444fa27641f24dd727fe4fa176fb
- UpdateQueryImpl: guard val.getClass() NPE
- LogSafeGson: return JsonNull when input is null
- HostAllocatorChain: null check completion
- VmInstanceVO: use Objects.equals to avoid NPE
- SessionManagerImpl: guard null session
- VmCapabilitiesJudger: guard null PS type result
- CephPSMonBase: guard null self when evicted
- CephPSBase: guard null mon.getSelf()
- HostBase: guard null self when HostVO deleted
- ExternalPSFactory: guard null URI protocol
- LocalStorageBase: guard null errorCode.getCause()

Resolves: ZSTAC-69300, ZSTAC-69957, ZSTAC-71973,
 ZSTAC-81294, ZSTAC-70180, ZSTAC-70181,
 ZSTAC-78309, ZSTAC-78310, ZSTAC-70668,
 ZSTAC-71909, ZSTAC-80555, ZSTAC-81270,
 ZSTAC-70101, ZSTAC-72034, ZSTAC-73197,
 ZSTAC-79921, ZSTAC-81160, ZSTAC-81224,
 ZSTAC-81805, ZSTAC-72304, ZSTAC-81804,
 ZSTAC-74898, ZSTAC-69215, ZSTAC-70151,
 ZSTAC-68933

Change-Id: I910e9b542ecd254fdf7e956f943316988a56a1f9
- LdapUtil: CRE to OperationFailureException
- QueryFacadeImpl: CRE to OperationFailureException
- HostAllocatorManagerImpl: CRE to warn + clamp
- CloudOperationsErrorCode: add LDAP/PROMETHEUS codes

Resolves: ZSTAC-81334

Change-Id: Iab947b0476e9174d5a61baa095847b521b1f59fa
<fix>[pciDevice]: add error code ORG_ZSTACK_PCIDEVICE_10077

See merge request zstackio/zstack!9205
Add ORG_ZSTACK_AI_10134 for ModelCenter disconnected
check and ORG_ZSTACK_PCIDEVICE_10077 for SR-IOV bond
validation.

Resolves: ZSTAC-72783

Change-Id: I504a415a6e822513df955be600188ae88e2e1058
fix(ai): add error codes for AI and PCI modules [ZSTAC-72783]

See merge request zstackio/zstack!9215
<fix>[multi]: batch fix CRE quality issues

See merge request zstackio/zstack!9214
<fix>[volumebackup]: add backup cancel timeout error code

See merge request zstackio/zstack!9204
Companion DB migration for premium MR !13025. Adds normalizedModelName column and index to GpuDeviceSpecVO for GPU spec dedup by normalized model name.

Resolves: ZSTAC-75319

Change-Id: If15e615bcbda955cc1d6c58527bae27d4af4b497
<fix>[compute]: respect vm.migrationQuantity during host maintenance

See merge request zstackio/zstack!9209
…ttachedPSMountPath NPE

Resolves: ZSTAC-69300

Change-Id: I1b39a9e7b76751e8a4ef4cc53e9ac2028386e334
Resolves: ZSTAC-61988

Change-Id: Id3d5a48801fda21e2a51d96949c743bac254b2e6
<fix>[kvm]: configurable orphan skip timeout

See merge request zstackio/zstack!9203
Resolves: ZSTAC-61988

Change-Id: I0908fce97128904f9954198645290f4e5709252e
- Add Javadoc: NULL resourceType = universal (backward compatible)
- Add resourceType to TagPatternVO_ metamodel and TagPatternInventory
- Add groovy integration test (3 scenarios: universal/scoped/combined filter)

Resolves: ZSTAC-74908

Change-Id: I6fc05535ae688e50290759f1e129501f0240696c
<fix>[multi]: batch guard NPE quality issues

See merge request zstackio/zstack!9213
<fix>[telemetry]: fix Sentry transaction loss and add debug logging

See merge request zstackio/zstack!9220
<fix>[gpu]: add normalizedModelName migration SQL

See merge request zstackio/zstack!9218
<fix>[zbs]: sync MDS node statuses to DB when reconnect fails

See merge request zstackio/zstack!9161
Add missing error codes ORG_ZSTACK_STORAGE_PRIMARY_10048
and ORG_ZSTACK_STORAGE_PRIMARY_10051 to all 10 language
files. Fix zh_CN mistranslations replaced with correct
term. Fix zh_TW garbled characters in error messages.

Resolves: ZSTAC-72656

Change-Id: I5f08109d1c415b751ec130285b9d92522f1e0a34
Resolves: ZSTAC-77454

Change-Id: I3e91cc5eb349960d56f775097ed55f34f1866be2
add resourceType field to TagPatternInventory

Resolves: ZSTAC-74908

Change-Id: I34f60a714fa6f6be302d3e15cb8149321a1badc4
fix(ZSTAC-72656): improve i18n error messages for PS UUID conflicts

See merge request zstackio/zstack!9224
Add GoTestTemplate.groovy to auto-generate unit tests
and integration tests during SDK code generation.

Change-Id: Ic6f05df8609b406350c76fca9e7723298fa4b72a
Signed-off-by: AlanJager <ye.zou@zstack.io>
@MatheMatrix MatheMatrix force-pushed the sync/yingzhe.hu/fix/ZSTAC-83890@@3 branch 9 times, most recently from de880f2 to 572954d Compare May 11, 2026 11:30
shixin.ruan and others added 4 commits May 11, 2026 21:54
Existing host reconnect can fail before kvmagent echo.

Ansible masks libvirt sockets with systemctl during deploy.

If host systemd D-Bus is stuck, that optional step times out.

Continue only for this known mask timeout on reconnect.

New host deploy and other ansible failures still fail.

Resolves: ZSTAC-77120

Change-Id: I0ef4a535065ff797c9e4cfae5b39c2daa321a4cc
fix(ZSTAC-77120): continue reconnect on systemd timeout

See merge request zstackio/zstack!9810
Remove an unused MariaDB socket before zstack-server start.
Restart MariaDB so MN can recover after power loss.

Resolves: ZSTAC-83507

Change-Id: Ifb8255f7c9021ef12353f058a0230fe66989b590
…to avoid stale config cache

Resolves: ZSTAC-79075

Change-Id: I6868776c0ea5ef4905ec409181fbf2a431f3034a
@MatheMatrix MatheMatrix force-pushed the sync/yingzhe.hu/fix/ZSTAC-83890@@3 branch from 06546d0 to afc0c5b Compare May 12, 2026 07:43
<feature>[utils]: Network group high availability strategy

See merge request zstackio/zstack!9779
@MatheMatrix MatheMatrix force-pushed the sync/yingzhe.hu/fix/ZSTAC-83890@@3 branch from afc0c5b to 847cc3b Compare May 12, 2026 08:09
gitlab and others added 11 commits May 12, 2026 09:21
<feature>[conf]: recover stale mariadb socket

See merge request zstackio/zstack!9809
<fix>[virtualRouter]: skip grayscale upgrade check on auto reconnect to avoid stale config cache

See merge request zstackio/zstack!9814
Store VmModelMountVO lastAttachedEpoch in database so restore cleanup can distinguish successful asynchronous attach from stale failure callbacks.\n\nTested: docker verify-case runMavenProfile premium\nTested: docker verify-case VmModelMountCase

Resolves: ZSTAC-84246

Change-Id: Ib426797059be9401c3e2556ecc31c1879dd04049
Resolves: ZSTAC-84919

Change-Id: Iff44de2bbb1dad50e75678180cb89d3430c562b1
<fix>[ai]: persist mount restore epoch schema

See merge request zstackio/zstack!9833
<fix>[compute]: ZSTAC-84919 avoid stale iso detach NPE

See merge request zstackio/zstack!9834
Move the lastAttachedEpoch schema change out of the 5.5.16 upgrade file by restoring that file to match the 5.5.16 release branch. The 5.5.22 upgrade file already carries the ADD_COLUMN migration, which keeps the change scoped to the target release.

Constraint: 5.5.16 release schema must remain byte-for-byte aligned with upstream/5.5.16
Rejected: Leave the column in V5.5.16__schema.sql | would mutate an already released upgrade file
Confidence: high
Scope-risk: narrow
Tested: git diff --exit-code upstream/5.5.16 -- conf/db/upgrade/V5.5.16__schema.sql
Tested: git diff --exit-code upstream/5.5.22 -- conf/db/upgrade/V5.5.22__schema.sql
Tested: git diff --check

Resolves: ZSTAC-84246

Change-Id: Ibf18c6665d9d3c90361ed7f57d65fc609f6a1bfb
<fix>[ai]: keep 5.5.16 schema unchanged

See merge request zstackio/zstack!9874
APIImpact

Resolves: ZSTAC-85167

Change-Id: I6279726f7061786e6f6a66726368647074646c78
<fix>[loadBalancer]: support disabling UDP listener health check

See merge request zstackio/zstack!9860
Pre-fence leftover QEMU processes through a reachable sibling host
and pass that sibling through HA VM start. Use the agent success
flag as the pre-fence verdict and drop redundant response fields.

Test: mvn -pl plugin/kvm,simulator/simulatorImpl,testlib -am -DskipTests compile

Resolves: ZSTAC-83890

Change-Id: I168adf82338f9df9e76287619b7f76a8e5be695f
(cherry picked from commit 847cc3b)
@MatheMatrix MatheMatrix force-pushed the sync/yingzhe.hu/fix/ZSTAC-83890@@3 branch from 847cc3b to 28f5849 Compare May 15, 2026 10:16
@MatheMatrix MatheMatrix deleted the sync/yingzhe.hu/fix/ZSTAC-83890@@3 branch May 15, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.