Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for redis memory check failure after link flap and also sometimes cpu usage high failure #15732

Merged
merged 5 commits into from
Dec 9, 2024

Conversation

wumiaont
Copy link
Contributor

@wumiaont wumiaont commented Nov 25, 2024

Description of PR

Redis memory check result is not a stable value using "redis-cli info memory | grep used_memory_human". It's found on a stable system (BGO converged, no port flapping etc), the above check could have memory usage difference by more than 0.2M.

Followings are CLI output from 202405 and 202205.

202405:
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.64
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.74
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.52

202205:
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.02
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.26
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.14

We can see that 202405 has some memory optimization for redis and it's not using as much memory as 202205. 0.2M memory usage difference could easily reach the memory usage threshold of 5% in 202405.

Solution is to get the average redis memory usage before and after link flap. using 5 seconds interval and 5 times query and then get the average memory usage for redis. Also make the threshold to 10% from 5%. With this fix it's found that the redis memory check will not fail for 2405 after link flap.

This commit also provide a fix for sometimes CPU utilization check failed for orchagent after link flap. The reason is in scaling setup (34k routes) orchagent takes more time to calm down.

Summary:
Fixes # (issue) #15733

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

Fix test failures

How did you verify/test it?

OC tests run with the fix. Did not see the test failed.

@wumiaont wumiaont requested a review from prgeor as a code owner November 25, 2024 15:15
@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/platform_tests/link_flap/link_flap_utils.py:134:1: E302 expected 2 blank lines, found 1
tests/platform_tests/link_flap/test_link_flap.py:7:86: E231 missing whitespace after ','

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

Copy link
Contributor

@arista-nwolfe arista-nwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me, thanks!

tests/platform_tests/link_flap/test_cont_link_flap.py Outdated Show resolved Hide resolved
@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing tests/platform_tests/link_flap/link_flap_utils.py
Fixing tests/platform_tests/link_flap/test_cont_link_flap.py

check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/platform_tests/link_flap/test_link_flap.py:79:23: E127 continuation line over-indented for visual indent

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing tests/platform_tests/link_flap/link_flap_utils.py
Fixing tests/platform_tests/link_flap/test_cont_link_flap.py

check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Passed
flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@yejianquan yejianquan merged commit 700c02b into sonic-net:master Dec 9, 2024
16 checks passed
@mssonicbld
Copy link
Collaborator

@wumiaont PR conflicts with 202405 branch

yejianquan pushed a commit to yejianquan/sonic-mgmt that referenced this pull request Dec 9, 2024
… cpu usage high failure (sonic-net#15732)

Description of PR
Redis memory check result is not a stable value using "redis-cli info memory | grep used_memory_human". It's found on a stable system (BGO converged, no port flapping etc), the above check could have memory usage difference by more than 0.2M.

Followings are CLI output from 202405 and 202205.

202405:
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.64
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.74
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.52

202205:
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.02
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.26
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.14

We can see that 202405 has some memory optimization for redis and it's not using as much memory as 202205. 0.2M memory usage difference could easily reach the memory usage threshold of 5% in 202405.

Solution is to get the average redis memory usage before and after link flap. using 5 seconds interval and 5 times query and then get the average memory usage for redis. Also make the threshold to 10% from 5%. With this fix it's found that the redis memory check will not fail for 2405 after link flap.

This commit also provide a fix for sometimes CPU utilization check failed for orchagent after link flap. The reason is in scaling setup (34k routes) orchagent takes more time to calm down.

Summary:
Fixes # (issue) sonic-net#15733

Approach
What is the motivation for this PR?
Fix test failures

How did you verify/test it?
OC tests run with the fix. Did not see the test failed.

co-authorized by: jianquanye@microsoft.com
yejianquan pushed a commit to yejianquan/sonic-mgmt that referenced this pull request Dec 9, 2024
… cpu usage high failure (sonic-net#15732)

Description of PR
Redis memory check result is not a stable value using "redis-cli info memory | grep used_memory_human". It's found on a stable system (BGO converged, no port flapping etc), the above check could have memory usage difference by more than 0.2M.

Followings are CLI output from 202405 and 202205.

202405:
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.64
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.74
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.52

202205:
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.02
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.26
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.14

We can see that 202405 has some memory optimization for redis and it's not using as much memory as 202205. 0.2M memory usage difference could easily reach the memory usage threshold of 5% in 202405.

Solution is to get the average redis memory usage before and after link flap. using 5 seconds interval and 5 times query and then get the average memory usage for redis. Also make the threshold to 10% from 5%. With this fix it's found that the redis memory check will not fail for 2405 after link flap.

This commit also provide a fix for sometimes CPU utilization check failed for orchagent after link flap. The reason is in scaling setup (34k routes) orchagent takes more time to calm down.

Summary:
Fixes # (issue) sonic-net#15733

Approach
What is the motivation for this PR?
Fix test failures

How did you verify/test it?
OC tests run with the fix. Did not see the test failed.

co-authorized by: jianquanye@microsoft.com
yejianquan added a commit that referenced this pull request Dec 10, 2024
#15955)

* Fix for redis memory check failure after link flap and also sometimes cpu usage high failure (#15732)

Description of PR
Redis memory check result is not a stable value using "redis-cli info memory | grep used_memory_human". It's found on a stable system (BGO converged, no port flapping etc), the above check could have memory usage difference by more than 0.2M.

Followings are CLI output from 202405 and 202205.

202405:
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.64
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.74
admin@ixre-egl-board30: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
2.52

202205:
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.02
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.26
admin@ixre-egl-board64: redis-cli info memory | grep used_memory_human | sed -e 's/.:(.)M/\1/'
6.14

We can see that 202405 has some memory optimization for redis and it's not using as much memory as 202205. 0.2M memory usage difference could easily reach the memory usage threshold of 5% in 202405.

Solution is to get the average redis memory usage before and after link flap. using 5 seconds interval and 5 times query and then get the average memory usage for redis. Also make the threshold to 10% from 5%. With this fix it's found that the redis memory check will not fail for 2405 after link flap.

This commit also provide a fix for sometimes CPU utilization check failed for orchagent after link flap. The reason is in scaling setup (34k routes) orchagent takes more time to calm down.

Summary:
Fixes # (issue) #15733

Approach
What is the motivation for this PR?
Fix test failures

How did you verify/test it?
OC tests run with the fix. Did not see the test failed.

co-authorized by: jianquanye@microsoft.com

* Fix syntax issue

---------

Co-authored-by: wumiao_nokia <wu.miao@nokia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants