forked from E3SM-Project/E3SM
-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds design document for an Omega error handler
- Loading branch information
1 parent
aa4fe34
commit 2ab6d1b
Showing
2 changed files
with
309 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,308 @@ | ||
(omega-design-error)= | ||
# ErrorHandler (Error) | ||
|
||
## 1 Overview | ||
|
||
Any simulation code requires a means to check for errors, | ||
report errors and abort if the error is critical. Omega | ||
includes a logging facility that can perform some of these | ||
functions. However, we desire a facility that can check | ||
for errors in a manner that improves code readability and | ||
that also provides a stack trace to better understand the | ||
path taken that resulted in the error. | ||
|
||
## 2 Requirements | ||
|
||
### 2.1 Requirement: Logging errors | ||
|
||
We require the ability to log an error with an error message | ||
that includes the routine and line number. Note that the | ||
current Omega Logging facility already supports this and | ||
should be leveraged for this purpose. | ||
|
||
### 2.2 Requirement: Levels of error severity | ||
|
||
Omega must support several levels of error severity. The | ||
Logging facility already supports a common hierarchy with | ||
debug, info, warn, error and critical levels. Because the | ||
debug and info simply write information, they will continue | ||
to be managed by the Logging interfaces. | ||
|
||
The warn level may be shared by both the Logging facility and | ||
error handler. For example, if only a warning message is required, | ||
the Logging facility should manage the message. If we wish to | ||
inform the calling routine that an error at the warning level | ||
has occurred, the Error handler should manage the warning. | ||
|
||
The error handler should be the primary interface for the more | ||
severe errors at the error and critical levels. In a canonical | ||
hierarchy, critical errors are those that cause the application | ||
to fail, while the error level refers to errors that are severe | ||
enough to be addressed but still allow the application to proceed | ||
without crashing. In Omega, these latter errors typically cause | ||
incorrect or unintended simulation results and are considered | ||
an application failure. We prefer to use critical errors that | ||
abort the simulation for most such errors. The error level | ||
should be reserved for cases in which an error is encountered | ||
but we must pass error information back to the calling routine | ||
(eg through a return code) to judge the response to the error. | ||
|
||
Note that at the initial creation of this document, many of | ||
the current errors reported by Omega are treated at the error | ||
level and should be promoted to critical errors following the | ||
description above. | ||
|
||
### 2.3 Requirement: Abort on error | ||
|
||
When a critical error is encountered, we must have the ability | ||
to abort the simulation. | ||
|
||
### 2.4 Requirement: Generate stack trace | ||
|
||
The error log must include a stack trace to determine and | ||
output the calling sequence that resulted in the error being | ||
reported. | ||
|
||
### 2.5 Requirement: Identify task id | ||
|
||
We require a means of determining which parallel task is | ||
encountering this error. | ||
|
||
### 2.6 Requirement: Output location | ||
|
||
We require errors and the related stack trace to be logged | ||
in an error log file. A separate error log file for each MPI | ||
task may be desirable both for ease of implementation and to | ||
better satisfy requirements 2.4 and 2.5 above. An abort error | ||
message should also be sent to the default log file when | ||
possible so that users know to look for an error log for | ||
details. | ||
|
||
### 2.7 Requirement: Success/fail error codes | ||
|
||
The error handler should include a public success error code | ||
(typically zero) to standardize a successful return code for | ||
Omega routines. Any return code not equal to success should | ||
be interpreted as a failure. However, a standard Fail code | ||
can also be provided as a default failure code. | ||
|
||
### 2.8 Requirement: Error messages and formatting | ||
|
||
We require the ability to define an error message that will | ||
be output to an error log for any error encountered. We also | ||
require the ability to include some variable or run-time | ||
information as part of the the logged message to better | ||
describe the cause of the error, similar to the existing | ||
Logging capability (eg "Value of MyVar is {} but expected | ||
{}"). Ideally, the formatting would be identical to that | ||
provided by the Logging facility and would leverage that | ||
facility for the logging of error messages. | ||
|
||
### 2.9 Requirement: Accumulated error | ||
|
||
In some cases, we wish to accumulate (sum) non-critical errors | ||
encountered within a routine rather than returning or aborting | ||
immediately. For example, a unit test might prefer to report | ||
all errors rather than aborting on the first error encountered. | ||
Among other things, this is a requirment that error codes must | ||
all have the same sign to avoid summing to zero (the nominal | ||
success code). | ||
|
||
### 2.10 Desired: Compact error checking | ||
|
||
We desire error checking to be very concise - ideally a | ||
single line of code - so that the error checking does | ||
not impair code readability. | ||
|
||
### 2.11 Desired: Registering error codes and messages | ||
|
||
To facilitate 2.10 or to provide unique error codes, it may | ||
be desirable to pre-register an error code and related error | ||
message during initialization of a module so that it can | ||
be used later and referred to simply by the assigned code. | ||
|
||
## 3 Algorithmic Formulation | ||
|
||
No algorithms are needed. | ||
|
||
## 4 Design | ||
|
||
We will build the error handler on the existing Logging | ||
functionality, the spdlog library and the cpptrace library | ||
that provides a stack trace capability. Similar to the | ||
Logging facility, we will make use of defined cpp macros | ||
for many interfaces. These macros allow us to both condense | ||
multi-line patterns into a single-line interface, but also | ||
allow us to incorporate the LINE and FILE macros at the | ||
actual line where the error is encountered. | ||
|
||
### 4.1 Data types and parameters | ||
|
||
#### 4.1.1 Parameters | ||
|
||
There will be two public integer parameters (constexpr) | ||
``Err::Success`` and ``Err::Fail``. Success will be the only | ||
identifier for success and will be equal to zero. Any error code | ||
not equal to success will denote failure. This allows multiple | ||
failure error codes and accumulation of error. The ``Err::Fail`` | ||
code is defined to provide a default failure code if one is | ||
not defined or if a specific value is not required. | ||
|
||
#### 4.1.2 Class/structs/data types | ||
|
||
There will be an Error class that primarily holds defined | ||
error codes and associated messages. | ||
```c++ | ||
class Error { | ||
public: | ||
|
||
static constexpr int Success = 0; | ||
static constexpr int Fail = 1; | ||
|
||
// public methods (see below) | ||
|
||
private: | ||
|
||
// container for pre-defined error codes/messages | ||
static std::map<int, std::string> AllCodes; | ||
|
||
// private methods (see below) | ||
}; | ||
``` | ||
### 4.2 Methods | ||
#### 4.2.1 Macros | ||
We provide a number of macros for error checking. As described | ||
in section 2.2, we expect the most common interface should be | ||
the critical error macro: | ||
```c++ | ||
ERROR_CRITICAL(ErrCode, ErrMsg, additional args for ErrMsg) | ||
``` | ||
The error message (ErrMsg) can be either a simple string or | ||
can follow format of the Logging facility in which ``{}`` | ||
placeholders can be used with additional arguments to | ||
include variable information | ||
(see the [Logging documentation](#omega-user-logging)). | ||
If the input ErrCode is ``Err::Success`` nothing happens and | ||
the code continues. If the ErrCode matches a code predefined | ||
using the ``defineCode`` function described below, the ErrMsg | ||
from that predefined code is used with any of the additional | ||
arguments to fill in any ``{}`` placeholders. Finally, if the | ||
ErrCode is not equal to Success and is not a predefined code, | ||
the ErrMsg is used, again with the subsequent arguments to print | ||
an error message. For the failure modes, the macro also prints | ||
a stack trace using cpptrace calls and then aborts the | ||
simulation using the ``Err::abort`` function described below. | ||
Typical use cases might be: | ||
```c++ | ||
// Checking an error condition | ||
if (MyFailCondition) ERROR_CRITICAL(Err::Fail, "MySimpleMessage"); | ||
|
||
// or | ||
if (MyFailCondition) | ||
ERROR_CRITICAL(Err::Fail, "Variable {} has value {}", | ||
MyVarName, Value); | ||
|
||
// or if the code is predefined | ||
if (MyFailCondition) ERROR_CRITICAL(DefinedCode, " "); | ||
|
||
// Checking a return code from another routine | ||
Err = MyFunction(...); | ||
ERROR_CRITICAL(Err, "Error encountered in MyFunction"); | ||
|
||
``` | ||
For less severe errors, we provide several different macros | ||
to match the behavior desired when continuing the simulation. | ||
The first interface is similar to the critical interface above: | ||
```c++ | ||
ERROR_CHECK(ErrCode, ErrMsg, addition args for message); | ||
ERROR_WARN(ErrCode, ErrMsg, addition args for message); | ||
``` | ||
which simply prints the message when the ErrCode is not | ||
equal to Success. The ``ERROR_WARN`` interface is used when | ||
only a warning message is desired. Predefined | ||
error codes can be used here as well. Unlike the | ||
critical interface, this does not print a stack trace or abort | ||
the simulation and the behavior is more similar to the | ||
``LOG_ERROR`` or ``LOG_WARN`` macros. | ||
|
||
If a routine has a return code and a return-on-error is | ||
desired, we provide an interface: | ||
```c++ | ||
ERROR_RETURN(ReturnCode, ErrCode, ErrMsg, added vars if needed); | ||
ERROR_RETURN_WARN(ReturnCode, ErrCode, ErrMsg, added vars); | ||
``` | ||
This behaves similar to the ``ERROR_CHECK`` except that after | ||
the message is printed, the error code is added to the return | ||
code and a return statement is executed to exit the routine. | ||
A use case might be: | ||
```c++ | ||
if (MyFailCondition) ERROR_RETURN(ReturnCode, Err::Fail, "MyErrMsg"); | ||
// which is equivalent to: | ||
if (MyFailCondition) { | ||
LOG_ERROR("MyErrMsg"); | ||
ReturnCode += Err::Fail; | ||
return ReturnCode; | ||
} | ||
``` | ||
|
||
Finally, a macro is provided to accumulate errors: | ||
```c++ | ||
ERROR_SUM(ErrSum, ErrCode, ErrMsg, added vars if needed); | ||
``` | ||
which after printing the messages, adds the most recent | ||
error code to the running sum stored in ``ErrSum``. | ||
Note that in the latter two cases, care must be taken so that | ||
the accumulated error or return codes do not match a | ||
predefined code and cause confustion in a later use of the | ||
code in an ``ERROR_CRITICAL`` or ``ERROR_CHECK`` call. | ||
#### 4.2.2 Other Error methods | ||
Some additional functions are needed to complement the macros | ||
described above. | ||
To allow more concise error checking, developers can pre-define | ||
and error code and associated error message using: | ||
```c++ | ||
Error::defineCode(ErrCode, ErrMsg); | ||
``` | ||
This routine addes the ErrMsg to a map of predefined messages | ||
and returns the new ErrCode assigned to that message for later | ||
use in the macros described above. This function should be | ||
called at some point during model (or module) initialization | ||
and the ErrCode saved for use during the run sequence. | ||
|
||
The abort routine: | ||
```c++ | ||
Error::abort(); | ||
``` | ||
is provided to abort a simulation. A minimal version would | ||
simply call the MPI abort function, but it is possible to | ||
add additional cleanup tasks. | ||
|
||
A finalize routine: | ||
```c++ | ||
Error::finalize(ErrCode); | ||
``` | ||
will clean up all of the predefined error codes. If the | ||
ErrCode is not equal to success, it will set the error code | ||
to the standard abnormal termination exit code (1) and avoid other | ||
standard c++ exit codes for other run-time errors. | ||
## 5 Verification and Testing | ||
### 5.1 Test all failure modes | ||
A unit test will test all non-critical error interfaces first | ||
and then end with a critical failure. The ``WILL_FAIL`` property | ||
in ctest will ensure the expected unit test failure will pass. | ||
An inspection of the log and error files will be used to verify | ||
messages are printed as expected. | ||
- tests all requirement |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters