Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown expunged resources cleanup executor properly, and allow other components to configure/start/stop on error #9723

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sureshanaparti
Copy link
Contributor

@sureshanaparti sureshanaparti commented Sep 23, 2024

Description

This PR shutdowns expunged resources cleanup executor when obj is available (when config expunged.resources.purge.enabled is true), allows other components to configure/start/stop on error, and adds some logs in component lifecycle classes.

Noticed this exception with custom logs, the remaining components fails to stop after this exception.

WARN  [o.a.c.s.l.CloudStackExtendedLifeCycleStart] (SpringContextShutdownHook:null) (logid:) Error on stopping beans - null
java.lang.NullPointerException
        at org.apache.cloudstack.resource.ResourceCleanupServiceImpl.stop(ResourceCleanupServiceImpl.java:584)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycle$2.with(CloudStackExtendedLifeCycle.java:105)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycle.with(CloudStackExtendedLifeCycle.java:159)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycle.stopBeans(CloudStackExtendedLifeCycle.java:101)
        at org.apache.cloudstack.spring.lifecycle.CloudStackExtendedLifeCycleStart.stop(CloudStackExtendedLifeCycleStart.java:32)
        at org.apache.cloudstack.spring.lifecycle.AbstractSmartLifeCycle.stop(AbstractSmartLifeCycle.java:49)
        at org.springframework.context.support.DefaultLifecycleProcessor.doStop(DefaultLifecycleProcessor.java:234)
        at org.springframework.context.support.DefaultLifecycleProcessor.access$300(DefaultLifecycleProcessor.java:54)
        at org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.stop(DefaultLifecycleProcessor.java:373)
        at org.springframework.context.support.DefaultLifecycleProcessor.stopBeans(DefaultLifecycleProcessor.java:206)
        at org.springframework.context.support.DefaultLifecycleProcessor.onClose(DefaultLifecycleProcessor.java:129)
        at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1069)
        at org.springframework.context.support.AbstractApplicationContext$1.run(AbstractApplicationContext.java:993)

Fixes #9722

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

Mgmt2 service stopped =>

MgmtServers_4 20 0 0-SNAPSHOT_Fixed

How Has This Been Tested?

Manually tested management server start & stop.

2024-09-23 18:20:51,111 INFO  [o.a.c.s.l.CloudStackExtendedLifeCycle] (SpringContextShutdownHook:[]) (logid:) stopping bean ClusterServiceServletAdapter
2024-09-23 18:20:51,111 INFO  [o.a.c.s.l.CloudStackExtendedLifeCycle] (SpringContextShutdownHook:[]) (logid:) stopping bean ClusterManagerImpl

[root@ol8 ~]# cat stopping-beans-check.txt | wc -l
665

stopping-beans-check.txt

How did you try to break this feature and the system with this change?

…n config expunged.resources.purge.enabled is true), and added some logs in component lifecycle classes
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@sureshanaparti sureshanaparti added this to the 4.20.0.0 milestone Sep 23, 2024
@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Sep 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.48%. Comparing base (b1f683d) to head (a6eb0b0).
Report is 38 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (b1f683d) and HEAD (a6eb0b0). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (b1f683d) HEAD (a6eb0b0)
unittests 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #9723       +/-   ##
============================================
- Coverage     15.77%   4.48%   -11.30%     
============================================
  Files          5621     392     -5229     
  Lines        491564   32154   -459410     
  Branches      61174    5672    -55502     
============================================
- Hits          77562    1441    -76121     
+ Misses       405545   30707   -374838     
+ Partials       8457       6     -8451     
Flag Coverage Δ
uitests 4.48% <ø> (+0.43%) ⬆️
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11181

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11182

@DaanHoogland
Copy link
Contributor

@sureshanaparti , this looks like a good cleanup. I wonder what it fixes other than just the looks of the code, though. We have two shutdown issues:

  1. prolonged time of shutdown
  2. no status update for shutdown MSses
    does this address voth, @sureshanaparti ? (I can see you showed some evidence for the second)

@sureshanaparti
Copy link
Contributor Author

sureshanaparti commented Sep 24, 2024

@sureshanaparti , this looks like a good cleanup. I wonder what it fixes other than just the looks of the code, though. We have two shutdown issues:

  1. prolonged time of shutdown
  2. no status update for shutdown MSses
    does this address voth, @sureshanaparti ? (I can see you showed some evidence for the second)

@DaanHoogland this updates MS status to Down when service is stopped/shutdown. (it doesn't address prolonged time of shutdown)

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@blueorangutan
Copy link

[SF] Trillian test result (tid-11539)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 62071 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11539-kvm-ol8.zip
Smoke tests completed. 134 look OK, 2 have errors, 5 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_03_secured_to_nonsecured_vm_migration Error 375.65 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 282.02 test_vm_life_cycle.py
all_test_vpc_redundant Skipped --- test_vpc_redundant.py
all_test_vpc_router_nics Skipped --- test_vpc_router_nics.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_webhook_delivery Skipped --- test_webhook_delivery.py
all_test_webhook_lifecycle Skipped --- test_webhook_lifecycle.py

@DaanHoogland DaanHoogland added the Severity:Critical Critical bug label Sep 24, 2024
Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, did not test it

@DaanHoogland
Copy link
Contributor

@blueorangutan test keepEnv

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-11551)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 125751 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11551-kvm-ol8.zip
Smoke tests completed. 133 look OK, 5 have errors, 3 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_01_secure_vm_migration Error 135.10 test_vm_life_cycle.py
test_01_secure_vm_migration Error 135.11 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 1109.52 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 1109.53 test_vm_life_cycle.py
test_13_destroy_and_expunge_vm Error 32.81 test_vm_life_cycle.py
test_14_destroy_vm_delete_protection Error 38.62 test_vm_life_cycle.py
ContextSuite context=TestVMLifeCycle>:teardown Error 81.55 test_vm_life_cycle.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumeEncryption>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumes>:setup Error 0.00 test_volumes.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 41204.43 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 50.88 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 172.41 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 67.11 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 218.98 test_vpc_redundant.py
ContextSuite context=TestVPCRedundancy>:teardown Error 326.28 test_vpc_redundant.py
all_test_vm_strict_host_tags Skipped --- test_vm_strict_host_tags.py
all_test_vnf_templates Skipped --- test_vnf_templates.py
all_test_vpc_ipv6 Skipped --- test_vpc_ipv6.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-11554)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 59928 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11554-kvm-ol8.zip
Smoke tests completed. 122 look OK, 1 have errors, 18 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestNatRuleUsage>:setup Error 973.44 test_usage.py
ContextSuite context=TestPublicIPUsage>:setup Error 1990.90 test_usage.py
ContextSuite context=TestSnapshotUsage>:setup Error 2414.42 test_usage.py
ContextSuite context=TestTemplateUsage>:setup Error 2540.51 test_usage.py
ContextSuite context=TestVmUsage>:setup Error 2627.83 test_usage.py
ContextSuite context=TestVolumeUsage>:setup Error 2792.50 test_usage.py
ContextSuite context=TestVpnUsage>:setup Error 2859.77 test_usage.py
all_test_vm_autoscaling Skipped --- test_vm_autoscaling.py
all_test_vm_deployment_planner Skipped --- test_vm_deployment_planner.py
all_test_vm_life_cycle Skipped --- test_vm_life_cycle.py
all_test_vm_lifecycle_unmanage_import Skipped --- test_vm_lifecycle_unmanage_import.py
all_test_vm_schedule Skipped --- test_vm_schedule.py
all_test_vm_snapshot_kvm Skipped --- test_vm_snapshot_kvm.py
all_test_vm_snapshots Skipped --- test_vm_snapshots.py
all_test_vm_strict_host_tags Skipped --- test_vm_strict_host_tags.py
all_test_vnf_templates Skipped --- test_vnf_templates.py
all_test_volumes Skipped --- test_volumes.py
all_test_vpc_ipv6 Skipped --- test_vpc_ipv6.py
all_test_vpc_redundant Skipped --- test_vpc_redundant.py
all_test_vpc_router_nics Skipped --- test_vpc_router_nics.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_webhook_delivery Skipped --- test_webhook_delivery.py
all_test_webhook_lifecycle Skipped --- test_webhook_lifecycle.py
all_test_host_maintenance Skipped --- test_host_maintenance.py
all_test_hostha_kvm Skipped --- test_hostha_kvm.py

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11214

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

Tested. Both the status update and the prolonged shutdown time have been fixed by this.

@blueorangutan
Copy link

[SF] Trillian test result (tid-11558)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 66748 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9723-t11558-kvm-ol8.zip
Smoke tests completed. 139 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_01_migrate_VM_and_root_volume Error 136.04 test_vm_life_cycle.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Severity:Critical Critical bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Management server state not Down after management service stop [in main]
5 participants