[ZSTAC-83890][5.4.8] HA pre-fence leftover qemu on suspect host via sibling#3899
Closed
MatheMatrix wants to merge 501 commits into
Closed
[ZSTAC-83890][5.4.8] HA pre-fence leftover qemu on suspect host via sibling#3899MatheMatrix wants to merge 501 commits into
MatheMatrix wants to merge 501 commits into
Conversation
…anup Resolves: ZSTAC-80821 Change-Id: I59284c4e69f5d2ee357b1836b7c243200e30949a
Resolves: ZSTAC-77544 Change-Id: I1f711bff9c1e87a8cbf6a2eb310ca6086f0f99ba
Resolves: ZSTAC-80821 Change-Id: Ia9a9597feceb96b3e6e22259e2d0be7bde8ae499
…V bond check add new error code constant for SR-IOV bond slave NIC count validation. Resolves: ZSTAC-81163 Change-Id: Ie2a74411129a98c3c03a4a085e94a3bd45922da5
…tion Resolves: ZSTAC-80991 Change-Id: I7677ddc25c8859e35e8ba80fd3105406bc761a76
Resolves: ZSTAC-74908 Change-Id: I48054139babb1e8092ab81e4367743ae3fd8aefb
Resolves: ZSTAC-81354 Change-Id: Iff2131b3a878444fa27641f24dd727fe4fa176fb
- UpdateQueryImpl: guard val.getClass() NPE - LogSafeGson: return JsonNull when input is null - HostAllocatorChain: null check completion - VmInstanceVO: use Objects.equals to avoid NPE - SessionManagerImpl: guard null session - VmCapabilitiesJudger: guard null PS type result - CephPSMonBase: guard null self when evicted - CephPSBase: guard null mon.getSelf() - HostBase: guard null self when HostVO deleted - ExternalPSFactory: guard null URI protocol - LocalStorageBase: guard null errorCode.getCause() Resolves: ZSTAC-69300, ZSTAC-69957, ZSTAC-71973, ZSTAC-81294, ZSTAC-70180, ZSTAC-70181, ZSTAC-78309, ZSTAC-78310, ZSTAC-70668, ZSTAC-71909, ZSTAC-80555, ZSTAC-81270, ZSTAC-70101, ZSTAC-72034, ZSTAC-73197, ZSTAC-79921, ZSTAC-81160, ZSTAC-81224, ZSTAC-81805, ZSTAC-72304, ZSTAC-81804, ZSTAC-74898, ZSTAC-69215, ZSTAC-70151, ZSTAC-68933 Change-Id: I910e9b542ecd254fdf7e956f943316988a56a1f9
- LdapUtil: CRE to OperationFailureException - QueryFacadeImpl: CRE to OperationFailureException - HostAllocatorManagerImpl: CRE to warn + clamp - CloudOperationsErrorCode: add LDAP/PROMETHEUS codes Resolves: ZSTAC-81334 Change-Id: Iab947b0476e9174d5a61baa095847b521b1f59fa
<fix>[pciDevice]: add error code ORG_ZSTACK_PCIDEVICE_10077 See merge request zstackio/zstack!9205
Add ORG_ZSTACK_AI_10134 for ModelCenter disconnected check and ORG_ZSTACK_PCIDEVICE_10077 for SR-IOV bond validation. Resolves: ZSTAC-72783 Change-Id: I504a415a6e822513df955be600188ae88e2e1058
fix(ai): add error codes for AI and PCI modules [ZSTAC-72783] See merge request zstackio/zstack!9215
<fix>[multi]: batch fix CRE quality issues See merge request zstackio/zstack!9214
<fix>[volumebackup]: add backup cancel timeout error code See merge request zstackio/zstack!9204
Companion DB migration for premium MR !13025. Adds normalizedModelName column and index to GpuDeviceSpecVO for GPU spec dedup by normalized model name. Resolves: ZSTAC-75319 Change-Id: If15e615bcbda955cc1d6c58527bae27d4af4b497
<fix>[compute]: respect vm.migrationQuantity during host maintenance See merge request zstackio/zstack!9209
…ttachedPSMountPath NPE Resolves: ZSTAC-69300 Change-Id: I1b39a9e7b76751e8a4ef4cc53e9ac2028386e334
Resolves: ZSTAC-61988 Change-Id: Id3d5a48801fda21e2a51d96949c743bac254b2e6
<fix>[kvm]: configurable orphan skip timeout See merge request zstackio/zstack!9203
Resolves: ZSTAC-61988 Change-Id: I0908fce97128904f9954198645290f4e5709252e
- Add Javadoc: NULL resourceType = universal (backward compatible) - Add resourceType to TagPatternVO_ metamodel and TagPatternInventory - Add groovy integration test (3 scenarios: universal/scoped/combined filter) Resolves: ZSTAC-74908 Change-Id: I6fc05535ae688e50290759f1e129501f0240696c
<fix>[multi]: batch guard NPE quality issues See merge request zstackio/zstack!9213
<fix>[telemetry]: fix Sentry transaction loss and add debug logging See merge request zstackio/zstack!9220
<fix>[gpu]: add normalizedModelName migration SQL See merge request zstackio/zstack!9218
<fix>[zbs]: sync MDS node statuses to DB when reconnect fails See merge request zstackio/zstack!9161
Add missing error codes ORG_ZSTACK_STORAGE_PRIMARY_10048 and ORG_ZSTACK_STORAGE_PRIMARY_10051 to all 10 language files. Fix zh_CN mistranslations replaced with correct term. Fix zh_TW garbled characters in error messages. Resolves: ZSTAC-72656 Change-Id: I5f08109d1c415b751ec130285b9d92522f1e0a34
Resolves: ZSTAC-77454 Change-Id: I3e91cc5eb349960d56f775097ed55f34f1866be2
add resourceType field to TagPatternInventory Resolves: ZSTAC-74908 Change-Id: I34f60a714fa6f6be302d3e15cb8149321a1badc4
fix(ZSTAC-72656): improve i18n error messages for PS UUID conflicts See merge request zstackio/zstack!9224
Add GoTestTemplate.groovy to auto-generate unit tests and integration tests during SDK code generation. Change-Id: Ic6f05df8609b406350c76fca9e7723298fa4b72a Signed-off-by: AlanJager <ye.zou@zstack.io>
de880f2 to
572954d
Compare
Existing host reconnect can fail before kvmagent echo. Ansible masks libvirt sockets with systemctl during deploy. If host systemd D-Bus is stuck, that optional step times out. Continue only for this known mask timeout on reconnect. New host deploy and other ansible failures still fail. Resolves: ZSTAC-77120 Change-Id: I0ef4a535065ff797c9e4cfae5b39c2daa321a4cc
fix(ZSTAC-77120): continue reconnect on systemd timeout See merge request zstackio/zstack!9810
Remove an unused MariaDB socket before zstack-server start. Restart MariaDB so MN can recover after power loss. Resolves: ZSTAC-83507 Change-Id: Ifb8255f7c9021ef12353f058a0230fe66989b590
…to avoid stale config cache Resolves: ZSTAC-79075 Change-Id: I6868776c0ea5ef4905ec409181fbf2a431f3034a
06546d0 to
afc0c5b
Compare
<feature>[utils]: Network group high availability strategy See merge request zstackio/zstack!9779
afc0c5b to
847cc3b
Compare
<feature>[conf]: recover stale mariadb socket See merge request zstackio/zstack!9809
<fix>[virtualRouter]: skip grayscale upgrade check on auto reconnect to avoid stale config cache See merge request zstackio/zstack!9814
Store VmModelMountVO lastAttachedEpoch in database so restore cleanup can distinguish successful asynchronous attach from stale failure callbacks.\n\nTested: docker verify-case runMavenProfile premium\nTested: docker verify-case VmModelMountCase Resolves: ZSTAC-84246 Change-Id: Ib426797059be9401c3e2556ecc31c1879dd04049
Resolves: ZSTAC-84919 Change-Id: Iff44de2bbb1dad50e75678180cb89d3430c562b1
<fix>[ai]: persist mount restore epoch schema See merge request zstackio/zstack!9833
<fix>[compute]: ZSTAC-84919 avoid stale iso detach NPE See merge request zstackio/zstack!9834
Move the lastAttachedEpoch schema change out of the 5.5.16 upgrade file by restoring that file to match the 5.5.16 release branch. The 5.5.22 upgrade file already carries the ADD_COLUMN migration, which keeps the change scoped to the target release. Constraint: 5.5.16 release schema must remain byte-for-byte aligned with upstream/5.5.16 Rejected: Leave the column in V5.5.16__schema.sql | would mutate an already released upgrade file Confidence: high Scope-risk: narrow Tested: git diff --exit-code upstream/5.5.16 -- conf/db/upgrade/V5.5.16__schema.sql Tested: git diff --exit-code upstream/5.5.22 -- conf/db/upgrade/V5.5.22__schema.sql Tested: git diff --check Resolves: ZSTAC-84246 Change-Id: Ibf18c6665d9d3c90361ed7f57d65fc609f6a1bfb
<fix>[ai]: keep 5.5.16 schema unchanged See merge request zstackio/zstack!9874
APIImpact Resolves: ZSTAC-85167 Change-Id: I6279726f7061786e6f6a66726368647074646c78
<fix>[loadBalancer]: support disabling UDP listener health check See merge request zstackio/zstack!9860
Pre-fence leftover QEMU processes through a reachable sibling host and pass that sibling through HA VM start. Use the agent success flag as the pre-fence verdict and drop redundant response fields. Test: mvn -pl plugin/kvm,simulator/simulatorImpl,testlib -am -DskipTests compile Resolves: ZSTAC-83890 Change-Id: I168adf82338f9df9e76287619b7f76a8e5be695f (cherry picked from commit 847cc3b)
847cc3b to
28f5849
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
TIC-5513: Ceph OSD osd_client_watch_timeout (default 30s) evicts the watcher when watch_ping is delayed (e.g. source host OOM / storage net blocked). rbd status briefly returns Watchers: none even though qemu is still alive. HA's second watcher check hits this transient empty window -> VM starts on a new host -> split-brain after the source host recovers.
Fix
Add HaPreFenceVmExtensionPoint, called by VmInstanceBase.handle(HaStartVmInstanceMsg) before BeforeHaStartVmInstanceExtensionPoint. KVM impl KvmHaPreFenceVmExtension:
Companion MRs
Refs
Resolves: ZSTAC-83890
Refs: TIC-5513
sync from gitlab !9783