Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra][add/remove topo] Improve vm_topology performance #16230

Merged
merged 3 commits into from
Dec 27, 2024

Conversation

lolyu
Copy link
Contributor

@lolyu lolyu commented Dec 25, 2024

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

  • the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

Signed-off-by: Longxiang lolv@microsoft.com

How did you do it?

Introduce the thread pool to vm_topology to parallel run functions that take time to finish.

  • restart-ptf on dualtor-120 vm_topology profile statistics:
image

Top three total run time function call:

function name total run time
add_host_ports 1040s
bind_fp_ports 96.3s
init 16.7s
  • remove-topo on dualtor-120 vm_topology profile statistics:
image Top three total run time function call:
function name total run time
remove_host_ports 165s
unbind_fp_ports 40.6s
remove_injected_fp_ports_from_docker 3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

  • add_host_ports
  • remove_host_ports
  • bind_fp_ports
  • unbind_fp_ports

Two new classes are introduced to support this feature:

  • class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
  • class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation vm_topology run time without this PR vm_topology run time with this PR
remove-topo 3m19.786s 1m18.430s
restart-ptf 18m50.615s 3m58.963s
  • restart-ptf with-this-PR vm_topology profile statistics:
image
function name total run time without this PR total run time with this PR
add_host_ports 1040s 169s
bind_fp_ports 96.3s 39.3s
  • remove-topo with-this-pR vm_topology profile statistics:
image
function name total run time without this PR total run time with this PR
remove_host_ports 165s 68.8s
unbind_fp_ports 40.6s 8.4

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Signed-off-by: Longxiang <lolv@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lolyu lolyu marked this pull request as ready for review December 26, 2024 05:18
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lolyu lolyu force-pushed the improve_vm_topology branch from 12c0a77 to fa6b408 Compare December 26, 2024 05:23
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Longxiang <lolv@microsoft.com>
@lolyu lolyu force-pushed the improve_vm_topology branch from fa6b408 to 14b8df8 Compare December 26, 2024 05:24
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lolyu lolyu changed the title Improve vm_topology performance [Infra][add/remove topo] Improve vm_topology performance Dec 26, 2024
Signed-off-by: Longxiang <lolv@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lolyu lolyu requested review from wangxin and yxieca December 26, 2024 06:38
@wangxin wangxin merged commit 2a67d11 into sonic-net:master Dec 27, 2024
17 checks passed
@mssonicbld
Copy link
Collaborator

@lolyu PR conflicts with 202405 branch

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Dec 27, 2024
…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <lolv@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202411: #16245

@davidm-arista
Copy link
Contributor

This is mixing threading with the underlying ansible multi-processing, and is a recipe for deadlocks. Please note from https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html that Ansible is not thread safe.

mssonicbld pushed a commit that referenced this pull request Dec 28, 2024
What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <lolv@microsoft.com>
@lolyu
Copy link
Contributor Author

lolyu commented Jan 2, 2025

This is mixing threading with the underlying ansible multi-processing, and is a recipe for deadlocks. Please note from https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html that Ansible is not thread safe.

Hi @davidm-arista, this doc: https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html is about the Ansible Python API. AFAIK, sonic-mgmt doesn't have any usecase of this Python API to call vm_topology. Do you have any usecase that have the deadlock situation?

lolyu added a commit to lolyu/sonic-mgmt that referenced this pull request Jan 2, 2025
…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <lolv@microsoft.com>
lolyu added a commit to lolyu/sonic-mgmt that referenced this pull request Jan 2, 2025
…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
wangxin pushed a commit that referenced this pull request Jan 6, 2025
…16288)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants