-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infra][add/remove topo] Improve vm_topology
performance
#16230
Conversation
Signed-off-by: Longxiang <lolv@microsoft.com>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
12c0a77
to
fa6b408
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Longxiang <lolv@microsoft.com>
fa6b408
to
14b8df8
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
vm_topology
performancevm_topology
performance
Signed-off-by: Longxiang <lolv@microsoft.com>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@lolyu PR conflicts with 202405 branch |
…#16230) What is the motivation for this PR? vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations. the vm_topology script running statistics with restart-ptf: real 18m50.615s user 0m0.009s sys 0m0.099s With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations. How did you do it? Introduce the thread pool to vm_topology to parallel run functions that take time to finish. * restart-ptf on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time add_host_ports 1040s bind_fp_ports 96.3s init 16.7s * remove-topo on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time remove_host_ports 165s unbind_fp_ports 40.6s remove_injected_fp_ports_from_docker 3.3s Let's use thread pool to parallel run the following functions that take most of time from the above statistics: * add_host_ports * remove_host_ports * bind_fp_ports * unbind_fp_ports Two new classes are introduced to support this feature: * class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode. * class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together. How did you verify/test it? Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers. operation vm_topology run time without this PR vm_topology run time with this PR remove-topo 3m19.786s 1m18.430s restart-ptf 18m50.615s 3m58.963s * restart-ptf with-this-PR vm_topology profile statistics: function name total run time without this PR total run time with this PR add_host_ports 1040s 169s bind_fp_ports 96.3s 39.3s * remove-topo with-this-pR vm_topology profile statistics: function name total run time without this PR total run time with this PR remove_host_ports 165s 68.8s unbind_fp_ports 40.6s 8.4 Signed-off-by: Longxiang <lolv@microsoft.com>
Cherry-pick PR to 202411: #16245 |
This is mixing threading with the underlying ansible multi-processing, and is a recipe for deadlocks. Please note from https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html that Ansible is not thread safe. |
What is the motivation for this PR? vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations. the vm_topology script running statistics with restart-ptf: real 18m50.615s user 0m0.009s sys 0m0.099s With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations. How did you do it? Introduce the thread pool to vm_topology to parallel run functions that take time to finish. * restart-ptf on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time add_host_ports 1040s bind_fp_ports 96.3s init 16.7s * remove-topo on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time remove_host_ports 165s unbind_fp_ports 40.6s remove_injected_fp_ports_from_docker 3.3s Let's use thread pool to parallel run the following functions that take most of time from the above statistics: * add_host_ports * remove_host_ports * bind_fp_ports * unbind_fp_ports Two new classes are introduced to support this feature: * class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode. * class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together. How did you verify/test it? Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers. operation vm_topology run time without this PR vm_topology run time with this PR remove-topo 3m19.786s 1m18.430s restart-ptf 18m50.615s 3m58.963s * restart-ptf with-this-PR vm_topology profile statistics: function name total run time without this PR total run time with this PR add_host_ports 1040s 169s bind_fp_ports 96.3s 39.3s * remove-topo with-this-pR vm_topology profile statistics: function name total run time without this PR total run time with this PR remove_host_ports 165s 68.8s unbind_fp_ports 40.6s 8.4 Signed-off-by: Longxiang <lolv@microsoft.com>
Hi @davidm-arista, this doc: https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html is about the Ansible Python API. AFAIK, sonic-mgmt doesn't have any usecase of this Python API to call |
…#16230) What is the motivation for this PR? vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations. the vm_topology script running statistics with restart-ptf: real 18m50.615s user 0m0.009s sys 0m0.099s With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations. How did you do it? Introduce the thread pool to vm_topology to parallel run functions that take time to finish. * restart-ptf on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time add_host_ports 1040s bind_fp_ports 96.3s init 16.7s * remove-topo on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time remove_host_ports 165s unbind_fp_ports 40.6s remove_injected_fp_ports_from_docker 3.3s Let's use thread pool to parallel run the following functions that take most of time from the above statistics: * add_host_ports * remove_host_ports * bind_fp_ports * unbind_fp_ports Two new classes are introduced to support this feature: * class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode. * class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together. How did you verify/test it? Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers. operation vm_topology run time without this PR vm_topology run time with this PR remove-topo 3m19.786s 1m18.430s restart-ptf 18m50.615s 3m58.963s * restart-ptf with-this-PR vm_topology profile statistics: function name total run time without this PR total run time with this PR add_host_ports 1040s 169s bind_fp_ports 96.3s 39.3s * remove-topo with-this-pR vm_topology profile statistics: function name total run time without this PR total run time with this PR remove_host_ports 165s 68.8s unbind_fp_ports 40.6s 8.4 Signed-off-by: Longxiang <lolv@microsoft.com>
…#16230) What is the motivation for this PR? vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations. the vm_topology script running statistics with restart-ptf: real 18m50.615s user 0m0.009s sys 0m0.099s With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations. How did you do it? Introduce the thread pool to vm_topology to parallel run functions that take time to finish. * restart-ptf on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time add_host_ports 1040s bind_fp_ports 96.3s init 16.7s * remove-topo on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time remove_host_ports 165s unbind_fp_ports 40.6s remove_injected_fp_ports_from_docker 3.3s Let's use thread pool to parallel run the following functions that take most of time from the above statistics: * add_host_ports * remove_host_ports * bind_fp_ports * unbind_fp_ports Two new classes are introduced to support this feature: * class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode. * class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together. How did you verify/test it? Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers. operation vm_topology run time without this PR vm_topology run time with this PR remove-topo 3m19.786s 1m18.430s restart-ptf 18m50.615s 3m58.963s * restart-ptf with-this-PR vm_topology profile statistics: function name total run time without this PR total run time with this PR add_host_ports 1040s 169s bind_fp_ports 96.3s 39.3s * remove-topo with-this-pR vm_topology profile statistics: function name total run time without this PR total run time with this PR remove_host_ports 165s 68.8s unbind_fp_ports 40.6s 8.4 Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
…16288) What is the motivation for this PR? vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations. the vm_topology script running statistics with restart-ptf: real 18m50.615s user 0m0.009s sys 0m0.099s With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations. How did you do it? Introduce the thread pool to vm_topology to parallel run functions that take time to finish. * restart-ptf on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time add_host_ports 1040s bind_fp_ports 96.3s init 16.7s * remove-topo on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time remove_host_ports 165s unbind_fp_ports 40.6s remove_injected_fp_ports_from_docker 3.3s Let's use thread pool to parallel run the following functions that take most of time from the above statistics: * add_host_ports * remove_host_ports * bind_fp_ports * unbind_fp_ports Two new classes are introduced to support this feature: * class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode. * class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together. How did you verify/test it? Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers. operation vm_topology run time without this PR vm_topology run time with this PR remove-topo 3m19.786s 1m18.430s restart-ptf 18m50.615s 3m58.963s * restart-ptf with-this-PR vm_topology profile statistics: function name total run time without this PR total run time with this PR add_host_ports 1040s 169s bind_fp_ports 96.3s 39.3s * remove-topo with-this-pR vm_topology profile statistics: function name total run time without this PR total run time with this PR remove_host_ports 165s 68.8s unbind_fp_ports 40.6s 8.4 Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
vm_topology
builds up the testbed connections (veth
links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.vm_topology
script running statistics withrestart-ptf
:With the I/O bound nature,
vm_topology
runtime could be greatly decreased by using threading pool to parallelize the I/O operations.Signed-off-by: Longxiang lolv@microsoft.com
How did you do it?
Introduce the thread pool to
vm_topology
to parallel run functions that take time to finish.restart-ptf
ondualtor-120
vm_topology
profile statistics:Top three total run time function call:
add_host_ports
bind_fp_ports
init
remove-topo
ondualtor-120
vm_topology
profile statistics:remove_host_ports
unbind_fp_ports
remove_injected_fp_ports_from_docker
Let's use thread pool to parallel run the following functions that take most of time from the above statistics:
add_host_ports
remove_host_ports
bind_fp_ports
unbind_fp_ports
Two new classes are introduced to support this feature:
VMTopologyWorker
: a worker class to support work in either single thread mode or thread pool mode.ThreadBufferHandler
: a logging handler to buffer logs from each task submitted to theVMTopologyWorker
and flush when the task ends. This is to ensurevm_topology
logs are grouped by the tasks, logs from different tasks will not be mixed together.How did you verify/test it?
Let's test this PR on a
dualtor-120
testbed with this PR, and the thread pool has 13 thread workers.vm_topology
run time without this PRvm_topology
run time with this PRremove-topo
restart-ptf
restart-ptf
with-this-PRvm_topology
profile statistics:add_host_ports
bind_fp_ports
remove-topo
with-this-pRvm_topology
profile statistics:remove_host_ports
unbind_fp_ports
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation