New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Infra][add/remove topo] Improve `vm_topology` performance #16230

Merged

wangxin merged 3 commits into sonic-net:master from lolyu:improve_vm_topology

Dec 27, 2024

Contributor

lolyu commented Dec 25, 2024 •

edited

Loading

Description of PR

Summary:
Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

Approach

What is the motivation for this PR?

vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:

real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

Signed-off-by: Longxiang lolv@microsoft.com

How did you do it?

Introduce the thread pool to vm_topology to parallel run functions that take time to finish.

restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
`add_host_ports`	1040s
`bind_fp_ports`	96.3s
`init`	16.7s

remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
`remove_host_ports`	165s
`unbind_fp_ports`	40.6s
`remove_injected_fp_ports_from_docker`	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

add_host_ports
remove_host_ports
bind_fp_ports
unbind_fp_ports

Two new classes are introduced to support this feature:

class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	`vm_topology` run time without this PR	`vm_topology` run time with this PR
`remove-topo`	3m19.786s	1m18.430s
`restart-ptf`	18m50.615s	3m58.963s

restart-ptf with-this-PR vm_topology profile statistics:

function name	total run time without this PR	total run time with this PR
`add_host_ports`	1040s	169s
`bind_fp_ports`	96.3s	39.3s

remove-topo with-this-pR vm_topology profile statistics:

function name	total run time without this PR	total run time with this PR
`remove_host_ports`	165s	68.8s
`unbind_fp_ports`	40.6s	8.4

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation


          Improve vm_topology performance

c97fb41

Signed-off-by: Longxiang <lolv@microsoft.com>

Collaborator

mssonicbld commented Dec 25, 2024

/azp run

azure-pipelines bot commented Dec 25, 2024

Azure Pipelines successfully started running 1 pipeline(s).

lolyu marked this pull request as ready for review

December 26, 2024 05:18

Collaborator

mssonicbld commented Dec 26, 2024

/azp run

azure-pipelines bot commented Dec 26, 2024

Azure Pipelines successfully started running 1 pipeline(s).

lolyu force-pushed the improve_vm_topology branch from 12c0a77 to fa6b408 Compare

December 26, 2024 05:23

Collaborator

mssonicbld commented Dec 26, 2024

/azp run

azure-pipelines bot commented Dec 26, 2024

Azure Pipelines successfully started running 1 pipeline(s).


          Refine the thread log batch logic

14b8df8

Signed-off-by: Longxiang <lolv@microsoft.com>

lolyu force-pushed the improve_vm_topology branch from fa6b408 to 14b8df8 Compare

December 26, 2024 05:24

Collaborator

mssonicbld commented Dec 26, 2024

/azp run

azure-pipelines bot commented Dec 26, 2024

Azure Pipelines successfully started running 1 pipeline(s).

lolyu changed the title ~~Improve vm_topology performance~~ [Infra][add/remove topo] Improve vm_topology performance

lolyu added Request for 202405 branch Request for 202411 branch labels


          Fix pre-commit errors

161e5aa

Signed-off-by: Longxiang <lolv@microsoft.com>

Collaborator

mssonicbld commented Dec 26, 2024

/azp run

azure-pipelines bot commented Dec 26, 2024

Azure Pipelines successfully started running 1 pipeline(s).

lolyu requested review from wangxin and yxieca

December 26, 2024 06:38

wangxin approved these changes

View reviewed changes

xwjiang-ms approved these changes

View reviewed changes

wangxin merged commit 2a67d11 into sonic-net:master

17 checks passed

wangxin added Approved for 202405 branch Approved for 202411 branch labels

mssonicbld added the Cherry Pick Conflict_202405 label

Collaborator

mssonicbld commented Dec 27, 2024

@lolyu PR conflicts with 202405 branch

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request


          [Infra][add/remove topo] Improve vm_topology performance (sonic-net…

1cdc624

…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <lolv@microsoft.com>

mssonicbld added the Created PR to 202411 branch label

Collaborator

mssonicbld commented Dec 27, 2024

Cherry-pick PR to 202411: #16245

mssonicbld mentioned this pull request

[action] [PR:16230] [Infra][add/remove topo] Improve vm_topology performance #16245

Merged

8 tasks

Contributor

davidm-arista commented Dec 27, 2024

This is mixing threading with the underlying ansible multi-processing, and is a recipe for deadlocks. Please note from https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html that Ansible is not thread safe.

mssonicbld pushed a commit that referenced this pull request


          [Infra][add/remove topo] Improve vm_topology performance (#16230)

58ba938

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <lolv@microsoft.com>

mssonicbld added Included in 202411 branch and removed Created PR to 202411 branch labels

Contributor Author

lolyu commented Jan 2, 2025

This is mixing threading with the underlying ansible multi-processing, and is a recipe for deadlocks. Please note from https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html that Ansible is not thread safe.

Hi @davidm-arista, this doc: https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html is about the Ansible Python API. AFAIK, sonic-mgmt doesn't have any usecase of this Python API to call vm_topology. Do you have any usecase that have the deadlock situation?

lolyu added a commit to lolyu/sonic-mgmt that referenced this pull request


          [Infra][add/remove topo] Improve vm_topology performance (sonic-net…

bc3eba8

…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <lolv@microsoft.com>

lolyu mentioned this pull request

[202405][Infra][add/remove topo] Improve vm_topology performance (#16230) #16288

Merged

8 tasks

lolyu added a commit to lolyu/sonic-mgmt that referenced this pull request


          [Infra][add/remove topo] Improve vm_topology performance (sonic-net…

2fc4487

…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

wangxin pushed a commit that referenced this pull request


          [Infra][add/remove topo] Improve vm_topology performance (#16230) (#…

2bc8e20

…16288)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>

lolyu added the Included in 202405 branch label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved for 202405 branch Approved for 202411 branch Cherry Pick Conflict_202405 Included in 202405 branch Included in 202411 branch Request for 202405 branch Request for 202411 branch