From 2ab6d1b7188547be0adc1d7da06445e759b6679a Mon Sep 17 00:00:00 2001 From: Phil Jones Date: Tue, 15 Oct 2024 07:44:11 -0500 Subject: [PATCH] Adds design document for an Omega error handler --- components/omega/doc/design/Error.md | 308 +++++++++++++++++++++++++++ components/omega/doc/index.md | 1 + 2 files changed, 309 insertions(+) create mode 100644 components/omega/doc/design/Error.md diff --git a/components/omega/doc/design/Error.md b/components/omega/doc/design/Error.md new file mode 100644 index 000000000000..54c4ec5fe729 --- /dev/null +++ b/components/omega/doc/design/Error.md @@ -0,0 +1,308 @@ +(omega-design-error)= +# ErrorHandler (Error) + +## 1 Overview + +Any simulation code requires a means to check for errors, +report errors and abort if the error is critical. Omega +includes a logging facility that can perform some of these +functions. However, we desire a facility that can check +for errors in a manner that improves code readability and +that also provides a stack trace to better understand the +path taken that resulted in the error. + +## 2 Requirements + +### 2.1 Requirement: Logging errors + +We require the ability to log an error with an error message +that includes the routine and line number. Note that the +current Omega Logging facility already supports this and +should be leveraged for this purpose. + +### 2.2 Requirement: Levels of error severity + +Omega must support several levels of error severity. The +Logging facility already supports a common hierarchy with +debug, info, warn, error and critical levels. Because the +debug and info simply write information, they will continue +to be managed by the Logging interfaces. + +The warn level may be shared by both the Logging facility and +error handler. For example, if only a warning message is required, +the Logging facility should manage the message. If we wish to +inform the calling routine that an error at the warning level +has occurred, the Error handler should manage the warning. + +The error handler should be the primary interface for the more +severe errors at the error and critical levels. In a canonical +hierarchy, critical errors are those that cause the application +to fail, while the error level refers to errors that are severe +enough to be addressed but still allow the application to proceed +without crashing. In Omega, these latter errors typically cause +incorrect or unintended simulation results and are considered +an application failure. We prefer to use critical errors that +abort the simulation for most such errors. The error level +should be reserved for cases in which an error is encountered +but we must pass error information back to the calling routine +(eg through a return code) to judge the response to the error. + +Note that at the initial creation of this document, many of +the current errors reported by Omega are treated at the error +level and should be promoted to critical errors following the +description above. + +### 2.3 Requirement: Abort on error + +When a critical error is encountered, we must have the ability +to abort the simulation. + +### 2.4 Requirement: Generate stack trace + +The error log must include a stack trace to determine and +output the calling sequence that resulted in the error being +reported. + +### 2.5 Requirement: Identify task id + +We require a means of determining which parallel task is +encountering this error. + +### 2.6 Requirement: Output location + +We require errors and the related stack trace to be logged +in an error log file. A separate error log file for each MPI +task may be desirable both for ease of implementation and to +better satisfy requirements 2.4 and 2.5 above. An abort error +message should also be sent to the default log file when +possible so that users know to look for an error log for +details. + +### 2.7 Requirement: Success/fail error codes + +The error handler should include a public success error code +(typically zero) to standardize a successful return code for +Omega routines. Any return code not equal to success should +be interpreted as a failure. However, a standard Fail code +can also be provided as a default failure code. + +### 2.8 Requirement: Error messages and formatting + +We require the ability to define an error message that will +be output to an error log for any error encountered. We also +require the ability to include some variable or run-time +information as part of the the logged message to better +describe the cause of the error, similar to the existing +Logging capability (eg "Value of MyVar is {} but expected +{}"). Ideally, the formatting would be identical to that +provided by the Logging facility and would leverage that +facility for the logging of error messages. + +### 2.9 Requirement: Accumulated error + +In some cases, we wish to accumulate (sum) non-critical errors +encountered within a routine rather than returning or aborting +immediately. For example, a unit test might prefer to report +all errors rather than aborting on the first error encountered. +Among other things, this is a requirment that error codes must +all have the same sign to avoid summing to zero (the nominal +success code). + +### 2.10 Desired: Compact error checking + +We desire error checking to be very concise - ideally a +single line of code - so that the error checking does +not impair code readability. + +### 2.11 Desired: Registering error codes and messages + +To facilitate 2.10 or to provide unique error codes, it may +be desirable to pre-register an error code and related error +message during initialization of a module so that it can +be used later and referred to simply by the assigned code. + +## 3 Algorithmic Formulation + +No algorithms are needed. + +## 4 Design + +We will build the error handler on the existing Logging +functionality, the spdlog library and the cpptrace library +that provides a stack trace capability. Similar to the +Logging facility, we will make use of defined cpp macros +for many interfaces. These macros allow us to both condense +multi-line patterns into a single-line interface, but also +allow us to incorporate the LINE and FILE macros at the +actual line where the error is encountered. + +### 4.1 Data types and parameters + +#### 4.1.1 Parameters + +There will be two public integer parameters (constexpr) +``Err::Success`` and ``Err::Fail``. Success will be the only +identifier for success and will be equal to zero. Any error code +not equal to success will denote failure. This allows multiple +failure error codes and accumulation of error. The ``Err::Fail`` +code is defined to provide a default failure code if one is +not defined or if a specific value is not required. + +#### 4.1.2 Class/structs/data types + +There will be an Error class that primarily holds defined +error codes and associated messages. +```c++ +class Error { + public: + + static constexpr int Success = 0; + static constexpr int Fail = 1; + + // public methods (see below) + + private: + + // container for pre-defined error codes/messages + static std::map AllCodes; + + // private methods (see below) +}; +``` + +### 4.2 Methods + +#### 4.2.1 Macros + +We provide a number of macros for error checking. As described +in section 2.2, we expect the most common interface should be +the critical error macro: +```c++ +ERROR_CRITICAL(ErrCode, ErrMsg, additional args for ErrMsg) +``` +The error message (ErrMsg) can be either a simple string or +can follow format of the Logging facility in which ``{}`` +placeholders can be used with additional arguments to +include variable information +(see the [Logging documentation](#omega-user-logging)). +If the input ErrCode is ``Err::Success`` nothing happens and +the code continues. If the ErrCode matches a code predefined +using the ``defineCode`` function described below, the ErrMsg +from that predefined code is used with any of the additional +arguments to fill in any ``{}`` placeholders. Finally, if the +ErrCode is not equal to Success and is not a predefined code, +the ErrMsg is used, again with the subsequent arguments to print +an error message. For the failure modes, the macro also prints +a stack trace using cpptrace calls and then aborts the +simulation using the ``Err::abort`` function described below. +Typical use cases might be: +```c++ +// Checking an error condition +if (MyFailCondition) ERROR_CRITICAL(Err::Fail, "MySimpleMessage"); + +// or +if (MyFailCondition) + ERROR_CRITICAL(Err::Fail, "Variable {} has value {}", + MyVarName, Value); + +// or if the code is predefined +if (MyFailCondition) ERROR_CRITICAL(DefinedCode, " "); + +// Checking a return code from another routine +Err = MyFunction(...); +ERROR_CRITICAL(Err, "Error encountered in MyFunction"); + +``` + +For less severe errors, we provide several different macros +to match the behavior desired when continuing the simulation. +The first interface is similar to the critical interface above: +```c++ +ERROR_CHECK(ErrCode, ErrMsg, addition args for message); +ERROR_WARN(ErrCode, ErrMsg, addition args for message); +``` +which simply prints the message when the ErrCode is not +equal to Success. The ``ERROR_WARN`` interface is used when +only a warning message is desired. Predefined +error codes can be used here as well. Unlike the +critical interface, this does not print a stack trace or abort +the simulation and the behavior is more similar to the +``LOG_ERROR`` or ``LOG_WARN`` macros. + +If a routine has a return code and a return-on-error is +desired, we provide an interface: +```c++ +ERROR_RETURN(ReturnCode, ErrCode, ErrMsg, added vars if needed); +ERROR_RETURN_WARN(ReturnCode, ErrCode, ErrMsg, added vars); +``` +This behaves similar to the ``ERROR_CHECK`` except that after +the message is printed, the error code is added to the return +code and a return statement is executed to exit the routine. +A use case might be: +```c++ +if (MyFailCondition) ERROR_RETURN(ReturnCode, Err::Fail, "MyErrMsg"); + +// which is equivalent to: + +if (MyFailCondition) { + LOG_ERROR("MyErrMsg"); + ReturnCode += Err::Fail; + return ReturnCode; +} +``` + +Finally, a macro is provided to accumulate errors: +```c++ +ERROR_SUM(ErrSum, ErrCode, ErrMsg, added vars if needed); +``` +which after printing the messages, adds the most recent +error code to the running sum stored in ``ErrSum``. + +Note that in the latter two cases, care must be taken so that +the accumulated error or return codes do not match a +predefined code and cause confustion in a later use of the +code in an ``ERROR_CRITICAL`` or ``ERROR_CHECK`` call. + +#### 4.2.2 Other Error methods + +Some additional functions are needed to complement the macros +described above. + +To allow more concise error checking, developers can pre-define +and error code and associated error message using: +```c++ +Error::defineCode(ErrCode, ErrMsg); +``` +This routine addes the ErrMsg to a map of predefined messages +and returns the new ErrCode assigned to that message for later +use in the macros described above. This function should be +called at some point during model (or module) initialization +and the ErrCode saved for use during the run sequence. + +The abort routine: +```c++ +Error::abort(); +``` +is provided to abort a simulation. A minimal version would +simply call the MPI abort function, but it is possible to +add additional cleanup tasks. + +A finalize routine: +```c++ +Error::finalize(ErrCode); +``` +will clean up all of the predefined error codes. If the +ErrCode is not equal to success, it will set the error code +to the standard abnormal termination exit code (1) and avoid other +standard c++ exit codes for other run-time errors. + +## 5 Verification and Testing + +### 5.1 Test all failure modes + +A unit test will test all non-critical error interfaces first +and then end with a critical failure. The ``WILL_FAIL`` property +in ctest will ensure the expected unit test failure will pass. +An inspection of the log and error files will be used to verify +messages are printed as expected. + - tests all requirement diff --git a/components/omega/doc/index.md b/components/omega/doc/index.md index 711853fc6f6f..17df3a720d13 100644 --- a/components/omega/doc/index.md +++ b/components/omega/doc/index.md @@ -92,6 +92,7 @@ design/Config design/DataTypes design/Decomp design/Driver +design/Error design/Halo design/HorzMeshClass design/Logging