Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TRANSCIEVER_INFO table for ports that doesn't exist in Config DB #556

Closed

Conversation

noaOrMlnx
Copy link
Collaborator

@noaOrMlnx noaOrMlnx commented Nov 5, 2024

Description

This change

# Skip deleting intf_tbl for avoiding OA to trigger Tx disable signal
, caused TRANSCEIVER_INFO table to remain after config reload.

When TRANSCEIVER_INFO table is not deleted from State DB, we have old ports there.
if we do split to 8 in SPC4, we must deleted the adjacent port. so we have all old ports in TRANSCIVER_INFO but not in Config DB.

To not cause any crash when trying to search for non-exist tables and ports, we should delete all ports which are not in Config DB from TRANSCEIVER_INFO table.

example to a possible crash -
image

Motivation and Context

Motivation is to be able to use cmis host mgmt feature in case we split to 8 and delete some ports from Config DB.

How Has This Been Tested?

with a split to 8 port

Additional Information (Optional)

@@ -2178,6 +2178,15 @@ def init(self):

self.log_notice("XCVRD INIT: After port config is done")
port_mapping_data = port_event_helper.get_port_mapping(self.namespaces)
# remove ports from TRANSCEIVER_INFO table, if they don't exist in CONFIG DB
logical_ports_list = port_mapping_data.logical_port_list
asic_id = multi_asic.get_asic_index_from_namespace(self.namespaces)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is self.namesapces a list? Maybe we need a loop here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Fixed

@noaOrMlnx noaOrMlnx force-pushed the port_not_exist_xcvrd_fix branch from aa0135a to 7f9e669 Compare November 7, 2024 09:32
@noaOrMlnx noaOrMlnx changed the title Fix issue when port is not in config DB, but TRANSCIEVER_INFO table is changed Update TRANSCIEVER_INFO table for ports that doesn't exist in Config DB Nov 7, 2024
@noaOrMlnx noaOrMlnx marked this pull request as ready for review November 7, 2024 12:46
@prgeor
Copy link
Collaborator

prgeor commented Nov 7, 2024

@noaOrMlnx How did we land up in this situation where TRANSCEIVER_INFO table exists for the ports NOT in CONFIG_DB ?

@prgeor
Copy link
Collaborator

prgeor commented Nov 7, 2024

@noaOrMlnx What crash are we talking about? backtrace please?

Copy link
Contributor

@mihirpat1 mihirpat1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noaOrMlnx Can you please explain why on_remove_logical_port is not taking care of deleting the TRANSCEIEVR_INFO table for the relevant ports?

sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
@noaOrMlnx
Copy link
Collaborator Author

@noaOrMlnx How did we land up in this situation where TRANSCEIVER_INFO table exists for the ports NOT in CONFIG_DB ?

@prgeor please check following code -

# Skip deleting intf_tbl for avoiding OA to trigger Tx disable signal

MSFT didn't want 'config reload' to remove TRANSCEIVER_INFO table, as it retriggers host tx ready setting.

@noaOrMlnx
Copy link
Collaborator Author

@noaOrMlnx Can you please explain why on_remove_logical_port is not taking care of deleting the TRANSCEIEVR_INFO table for the relevant ports?

@mihirpat1
on_remove_logical_port function is triggered when a port is removed on run time.
If user removes port from config DB and do 'config reload' it will not be removed from TRANSCEIVER_INFO table.

@liat-grozovik
Copy link
Collaborator

@mihirpat1 please let me know of any further questions otherwise please approve so i can merge.

@mihirpat1
Copy link
Contributor

@mihirpat1 please let me know of any further questions otherwise please approve so i can merge.

@noaOrMlnx Can you please help in addressing #556 (comment)

Also, can you please add the traceback of crash to the description of the PR?

@noaOrMlnx
Copy link
Collaborator Author

@mihirpat1 @prgeor all comments were addressed

@mihirpat1
Copy link
Contributor

@noaOrMlnx Please help in fixing the build failure

@noaOrMlnx
Copy link
Collaborator Author

@noaOrMlnx Please help in fixing the build failure

@mihirpat1 Done

@prgeor
Copy link
Collaborator

prgeor commented Nov 13, 2024

@noaOrMlnx Can you please explain why on_remove_logical_port is not taking care of deleting the TRANSCEIEVR_INFO table for the relevant ports?

@mihirpat1 on_remove_logical_port function is triggered when a port is removed on run time. If user removes port from config DB and do 'config reload' it will not be removed from TRANSCEIVER_INFO table.

@noaOrMlnx what do you mean by "If user removes port from config DB" ? What is this usecase?

@prgeor
Copy link
Collaborator

prgeor commented Nov 13, 2024

@noaOrMlnx please share the steps to reproduce this issue

Copy link
Collaborator

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noaOrMlnx as discussed, the reason to NOT delete TRANSCIEVER_INFO table during Xcvrd deinit was to prevent link flap on Mellanox platform because on mellanox platform as per the design TRANSCIEVER_INFO table update is considered as optics insertion/deletion and OA uses this event to reinitialize the port. To prevent link flap during Xcvrd restart/crash on Mellanox platform we had to take this approach.

To fix the present issue which is to reconfigure the complete switch with new configuration, please delete all of the following tables during config reload command so that Xcvrd populates them as per new configuration push:-

1. TRANSCEIVER_INFO_TABLE = 'TRANSCEIVER_INFO'    
2. TRANSCEIVER_FIRMWARE_INFO_TABLE = 'TRANSCEIVER_FIRMWARE_INFO'    
3. TRANSCEIVER_DOM_SENSOR_TABLE = 'TRANSCEIVER_DOM_SENSOR'    
4. TRANSCEIVER_DOM_THRESHOLD_TABLE = 'TRANSCEIVER_DOM_THRESHOLD'    
5. TRANSCEIVER_STATUS_TABLE = 'TRANSCEIVER_STATUS'    
6. TRANSCEIVER_PM_TABLE = 'TRANSCEIVER_PM'

@prgeor
Copy link
Collaborator

prgeor commented Dec 1, 2024

@noaOrMlnx closing as not needed.

@prgeor prgeor closed this Dec 1, 2024
@Junchao-Mellanox Junchao-Mellanox deleted the port_not_exist_xcvrd_fix branch December 2, 2024 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants