test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

DanielOsypenko · 2025-01-06T09:15:41Z

Nodes are not getting Ready in time, failing the test case scenario.

[2024-12-19T07:55:06.329Z]     def test_all_worker_nodes_short_network_failure(
[2024-12-19T07:55:06.329Z]         self, nodes, setup, mcg_obj, bucket_factory, node_restart_teardown
[2024-12-19T07:55:06.329Z]     ):
[2024-12-19T07:55:06.329Z]         """
[2024-12-19T07:55:06.329Z]         OCS-1432/OCS-1433:
[2024-12-19T07:55:06.329Z]         - Start DeploymentConfig based app pods
[2024-12-19T07:55:06.329Z]         - Make all the worker nodes unresponsive by doing abrupt network failure
[2024-12-19T07:55:06.329Z]         - Reboot the unresponsive node after short duration of ~300 seconds
[2024-12-19T07:55:06.329Z]         - When unresponsive node recovers, app pods and ceph cluster should recover
[2024-12-19T07:55:06.329Z]         - Again run IOs from app pods
[2024-12-19T07:55:06.329Z]         - Create OBC and read/write objects
[2024-12-19T07:55:06.329Z]         """
[2024-12-19T07:55:06.329Z]         pod_objs = setup
[2024-12-19T07:55:06.329Z]         worker_nodes = node.get_worker_nodes()
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Run IO on pods
[2024-12-19T07:55:06.329Z]         logger.info(f"Starting IO on {len(pod_objs)} app pods")
[2024-12-19T07:55:06.329Z]         with ThreadPoolExecutor() as executor:
[2024-12-19T07:55:06.329Z]             for pod_obj in pod_objs:
[2024-12-19T07:55:06.329Z]                 logger.info(f"Starting IO on pod {pod_obj.name}")
[2024-12-19T07:55:06.329Z]                 storage_type = (
[2024-12-19T07:55:06.329Z]                     "block" if pod_obj.pvc.get_pvc_vol_mode == "Block" else "fs"
[2024-12-19T07:55:06.329Z]                 )
[2024-12-19T07:55:06.329Z]                 executor.submit(
[2024-12-19T07:55:06.329Z]                     pod_obj.run_io,
[2024-12-19T07:55:06.329Z]                     storage_type=storage_type,
[2024-12-19T07:55:06.329Z]                     size="2G",
[2024-12-19T07:55:06.329Z]                     runtime=30,
[2024-12-19T07:55:06.329Z]                     fio_filename=f"{pod_obj.name}_io_f1",
[2024-12-19T07:55:06.329Z]                 )
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         logger.info(f"IO started on all {len(pod_objs)} app pods")
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Wait for IO results
[2024-12-19T07:55:06.329Z]         for pod_obj in pod_objs:
[2024-12-19T07:55:06.329Z]             pod.get_fio_rw_iops(pod_obj)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Induce network failure on all worker nodes
[2024-12-19T07:55:06.329Z]         with ThreadPoolExecutor() as executor:
[2024-12-19T07:55:06.329Z]             for node_name in worker_nodes:
[2024-12-19T07:55:06.329Z]                 executor.submit(node.node_network_failure, node_name, False)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         node.wait_for_nodes_status(
[2024-12-19T07:55:06.329Z]             node_names=worker_nodes, status=constants.NODE_NOT_READY
[2024-12-19T07:55:06.329Z]         )
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         logger.info(f"Waiting for {self.short_nw_fail_time} seconds")
[2024-12-19T07:55:06.329Z]         sleep(self.short_nw_fail_time)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         # Reboot the worker nodes
[2024-12-19T07:55:06.329Z]         logger.info(f"Stop and start the worker nodes: {worker_nodes}")
[2024-12-19T07:55:06.329Z]         worker_node_objs = node.get_node_objs(worker_nodes)
[2024-12-19T07:55:06.329Z]         if config.ENV_DATA["platform"].lower() == constants.GCP_PLATFORM:
[2024-12-19T07:55:06.329Z]             nodes.restart_nodes_by_stop_and_start(worker_node_objs, force=False)
[2024-12-19T07:55:06.329Z]         else:
[2024-12-19T07:55:06.329Z]             nodes.restart_nodes_by_stop_and_start(worker_node_objs)
[2024-12-19T07:55:06.329Z]     
[2024-12-19T07:55:06.329Z]         try:
[2024-12-19T07:55:06.329Z] >           node.wait_for_nodes_status(
[2024-12-19T07:55:06.329Z]                 node_names=worker_nodes, status=constants.NODE_READY
[2024-12-19T07:55:06.329Z]             )

[2024-12-19T09:17:40.774Z] �[1m�[31mE               ocs_ci.ocs.exceptions.TimeoutExpiredError: Timed out after 180s running get_node_objs(['ip-10-0-0-158.us-west-2.compute.internal', 'ip-10-0-0-187.us-west-2.compute.internal', 'ip-10-0-0-195.us-west-2.compute.internal', 'ip-10-0-0-224.us-west-2.compute.internal', 'ip-10-0-0-24.us-west-2.compute.internal', 'ip-10-0-0-76.us-west-2.compute.internal'])�[0m

try increase timeout for ROSA HCP worker nodes
investigate restart_nodes_by_stop_and_start for ROSA HCP

The text was updated successfully, but these errors were encountered:

DanielOsypenko added the Squad/Brown label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

DanielOsypenko commented Jan 6, 2025

test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

test_all_worker_nodes_short_network_failure fail on ROSA HCP #11103

Comments

DanielOsypenko commented Jan 6, 2025