Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

Open
DanielOsypenko opened this issue Jan 6, 2025 · 0 comments
Open

test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

DanielOsypenko opened this issue Jan 6, 2025 · 0 comments

Comments

@DanielOsypenko
Copy link
Contributor

Nodes are not getting Ready in time, failing the test case scenario.

[2024-12-19T07:55:06.329Z]     def test_all_worker_nodes_short_network_failure(
[2024-12-19T07:55:06.329Z]         self, nodes, setup, mcg_obj, bucket_factory, node_restart_teardown
[2024-12-19T07:55:06.329Z]     ):
[2024-12-19T07:55:06.329Z]         """
[2024-12-19T07:55:06.329Z]         OCS-1432/OCS-1433:
[2024-12-19T07:55:06.329Z]         - Start DeploymentConfig based app pods
[2024-12-19T07:55:06.329Z]         - Make all the worker nodes unresponsive by doing abrupt network failure
[2024-12-19T07:55:06.329Z]         - Reboot the unresponsive node after short duration of ~300 seconds
[2024-12-19T07:55:06.329Z]         - When unresponsive node recovers, app pods and ceph cluster should recover
[2024-12-19T07:55:06.329Z]         - Again run IOs from app pods
[2024-12-19T07:55:06.329Z]         - Create OBC and read/write objects
[2024-12-19T07:55:06.329Z]         """
[2024-12-19T07:55:06.329Z]         pod_objs = setup
[2024-12-19T07:55:06.329Z]         worker_nodes = node.get_worker_nodes()
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Run IO on pods
[2024-12-19T07:55:06.329Z]         logger.info(f"Starting IO on {len(pod_objs)} app pods")
[2024-12-19T07:55:06.329Z]         with ThreadPoolExecutor() as executor:
[2024-12-19T07:55:06.329Z]             for pod_obj in pod_objs:
[2024-12-19T07:55:06.329Z]                 logger.info(f"Starting IO on pod {pod_obj.name}")
[2024-12-19T07:55:06.329Z]                 storage_type = (
[2024-12-19T07:55:06.329Z]                     "block" if pod_obj.pvc.get_pvc_vol_mode == "Block" else "fs"
[2024-12-19T07:55:06.329Z]                 )
[2024-12-19T07:55:06.329Z]                 executor.submit(
[2024-12-19T07:55:06.329Z]                     pod_obj.run_io,
[2024-12-19T07:55:06.329Z]                     storage_type=storage_type,
[2024-12-19T07:55:06.329Z]                     size="2G",
[2024-12-19T07:55:06.329Z]                     runtime=30,
[2024-12-19T07:55:06.329Z]                     fio_filename=f"{pod_obj.name}_io_f1",
[2024-12-19T07:55:06.329Z]                 )
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         logger.info(f"IO started on all {len(pod_objs)} app pods")
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Wait for IO results
[2024-12-19T07:55:06.329Z]         for pod_obj in pod_objs:
[2024-12-19T07:55:06.329Z]             pod.get_fio_rw_iops(pod_obj)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Induce network failure on all worker nodes
[2024-12-19T07:55:06.329Z]         with ThreadPoolExecutor() as executor:
[2024-12-19T07:55:06.329Z]             for node_name in worker_nodes:
[2024-12-19T07:55:06.329Z]                 executor.submit(node.node_network_failure, node_name, False)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         node.wait_for_nodes_status(
[2024-12-19T07:55:06.329Z]             node_names=worker_nodes, status=constants.NODE_NOT_READY
[2024-12-19T07:55:06.329Z]         )
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         logger.info(f"Waiting for {self.short_nw_fail_time} seconds")
[2024-12-19T07:55:06.329Z]         sleep(self.short_nw_fail_time)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Reboot the worker nodes
[2024-12-19T07:55:06.329Z]         logger.info(f"Stop and start the worker nodes: {worker_nodes}")
[2024-12-19T07:55:06.329Z]         worker_node_objs = node.get_node_objs(worker_nodes)
[2024-12-19T07:55:06.329Z]         if config.ENV_DATA["platform"].lower() == constants.GCP_PLATFORM:
[2024-12-19T07:55:06.329Z]             nodes.restart_nodes_by_stop_and_start(worker_node_objs, force=False)
[2024-12-19T07:55:06.329Z]         else:
[2024-12-19T07:55:06.329Z]             nodes.restart_nodes_by_stop_and_start(worker_node_objs)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         try:
[2024-12-19T07:55:06.329Z] >           node.wait_for_nodes_status(
[2024-12-19T07:55:06.329Z]                 node_names=worker_nodes, status=constants.NODE_READY
[2024-12-19T07:55:06.329Z]             )
[2024-12-19T09:17:40.774Z] �[1m�[31mE               ocs_ci.ocs.exceptions.TimeoutExpiredError: Timed out after 180s running get_node_objs(['ip-10-0-0-158.us-west-2.compute.internal', 'ip-10-0-0-187.us-west-2.compute.internal', 'ip-10-0-0-195.us-west-2.compute.internal', 'ip-10-0-0-224.us-west-2.compute.internal', 'ip-10-0-0-24.us-west-2.compute.internal', 'ip-10-0-0-76.us-west-2.compute.internal'])�[0m
  1. try increase timeout for ROSA HCP worker nodes
  2. investigate restart_nodes_by_stop_and_start for ROSA HCP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant