Skip to content

Commit

Permalink
Adds design document for an Omega error handler
Browse files Browse the repository at this point in the history
  • Loading branch information
philipwjones committed Oct 15, 2024
1 parent aa4fe34 commit 2ab6d1b
Show file tree
Hide file tree
Showing 2 changed files with 309 additions and 0 deletions.
308 changes: 308 additions & 0 deletions components/omega/doc/design/Error.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
(omega-design-error)=
# ErrorHandler (Error)

## 1 Overview

Any simulation code requires a means to check for errors,
report errors and abort if the error is critical. Omega
includes a logging facility that can perform some of these
functions. However, we desire a facility that can check
for errors in a manner that improves code readability and
that also provides a stack trace to better understand the
path taken that resulted in the error.

## 2 Requirements

### 2.1 Requirement: Logging errors

We require the ability to log an error with an error message
that includes the routine and line number. Note that the
current Omega Logging facility already supports this and
should be leveraged for this purpose.

### 2.2 Requirement: Levels of error severity

Omega must support several levels of error severity. The
Logging facility already supports a common hierarchy with
debug, info, warn, error and critical levels. Because the
debug and info simply write information, they will continue
to be managed by the Logging interfaces.

The warn level may be shared by both the Logging facility and
error handler. For example, if only a warning message is required,
the Logging facility should manage the message. If we wish to
inform the calling routine that an error at the warning level
has occurred, the Error handler should manage the warning.

The error handler should be the primary interface for the more
severe errors at the error and critical levels. In a canonical
hierarchy, critical errors are those that cause the application
to fail, while the error level refers to errors that are severe
enough to be addressed but still allow the application to proceed
without crashing. In Omega, these latter errors typically cause
incorrect or unintended simulation results and are considered
an application failure. We prefer to use critical errors that
abort the simulation for most such errors. The error level
should be reserved for cases in which an error is encountered
but we must pass error information back to the calling routine
(eg through a return code) to judge the response to the error.

Note that at the initial creation of this document, many of
the current errors reported by Omega are treated at the error
level and should be promoted to critical errors following the
description above.

### 2.3 Requirement: Abort on error

When a critical error is encountered, we must have the ability
to abort the simulation.

### 2.4 Requirement: Generate stack trace

The error log must include a stack trace to determine and
output the calling sequence that resulted in the error being
reported.

### 2.5 Requirement: Identify task id

We require a means of determining which parallel task is
encountering this error.

### 2.6 Requirement: Output location

We require errors and the related stack trace to be logged
in an error log file. A separate error log file for each MPI
task may be desirable both for ease of implementation and to
better satisfy requirements 2.4 and 2.5 above. An abort error
message should also be sent to the default log file when
possible so that users know to look for an error log for
details.

### 2.7 Requirement: Success/fail error codes

The error handler should include a public success error code
(typically zero) to standardize a successful return code for
Omega routines. Any return code not equal to success should
be interpreted as a failure. However, a standard Fail code
can also be provided as a default failure code.

### 2.8 Requirement: Error messages and formatting

We require the ability to define an error message that will
be output to an error log for any error encountered. We also
require the ability to include some variable or run-time
information as part of the the logged message to better
describe the cause of the error, similar to the existing
Logging capability (eg "Value of MyVar is {} but expected
{}"). Ideally, the formatting would be identical to that
provided by the Logging facility and would leverage that
facility for the logging of error messages.

### 2.9 Requirement: Accumulated error

In some cases, we wish to accumulate (sum) non-critical errors
encountered within a routine rather than returning or aborting
immediately. For example, a unit test might prefer to report
all errors rather than aborting on the first error encountered.
Among other things, this is a requirment that error codes must
all have the same sign to avoid summing to zero (the nominal
success code).

### 2.10 Desired: Compact error checking

We desire error checking to be very concise - ideally a
single line of code - so that the error checking does
not impair code readability.

### 2.11 Desired: Registering error codes and messages

To facilitate 2.10 or to provide unique error codes, it may
be desirable to pre-register an error code and related error
message during initialization of a module so that it can
be used later and referred to simply by the assigned code.

## 3 Algorithmic Formulation

No algorithms are needed.

## 4 Design

We will build the error handler on the existing Logging
functionality, the spdlog library and the cpptrace library
that provides a stack trace capability. Similar to the
Logging facility, we will make use of defined cpp macros
for many interfaces. These macros allow us to both condense
multi-line patterns into a single-line interface, but also
allow us to incorporate the LINE and FILE macros at the
actual line where the error is encountered.

### 4.1 Data types and parameters

#### 4.1.1 Parameters

There will be two public integer parameters (constexpr)
``Err::Success`` and ``Err::Fail``. Success will be the only
identifier for success and will be equal to zero. Any error code
not equal to success will denote failure. This allows multiple
failure error codes and accumulation of error. The ``Err::Fail``
code is defined to provide a default failure code if one is
not defined or if a specific value is not required.

#### 4.1.2 Class/structs/data types

There will be an Error class that primarily holds defined
error codes and associated messages.
```c++
class Error {
public:

static constexpr int Success = 0;
static constexpr int Fail = 1;

// public methods (see below)

private:

// container for pre-defined error codes/messages
static std::map<int, std::string> AllCodes;

// private methods (see below)
};
```
### 4.2 Methods
#### 4.2.1 Macros
We provide a number of macros for error checking. As described
in section 2.2, we expect the most common interface should be
the critical error macro:
```c++
ERROR_CRITICAL(ErrCode, ErrMsg, additional args for ErrMsg)
```
The error message (ErrMsg) can be either a simple string or
can follow format of the Logging facility in which ``{}``
placeholders can be used with additional arguments to
include variable information
(see the [Logging documentation](#omega-user-logging)).
If the input ErrCode is ``Err::Success`` nothing happens and
the code continues. If the ErrCode matches a code predefined
using the ``defineCode`` function described below, the ErrMsg
from that predefined code is used with any of the additional
arguments to fill in any ``{}`` placeholders. Finally, if the
ErrCode is not equal to Success and is not a predefined code,
the ErrMsg is used, again with the subsequent arguments to print
an error message. For the failure modes, the macro also prints
a stack trace using cpptrace calls and then aborts the
simulation using the ``Err::abort`` function described below.
Typical use cases might be:
```c++
// Checking an error condition
if (MyFailCondition) ERROR_CRITICAL(Err::Fail, "MySimpleMessage");

// or
if (MyFailCondition)
ERROR_CRITICAL(Err::Fail, "Variable {} has value {}",
MyVarName, Value);

// or if the code is predefined
if (MyFailCondition) ERROR_CRITICAL(DefinedCode, " ");

// Checking a return code from another routine
Err = MyFunction(...);
ERROR_CRITICAL(Err, "Error encountered in MyFunction");

```
For less severe errors, we provide several different macros
to match the behavior desired when continuing the simulation.
The first interface is similar to the critical interface above:
```c++
ERROR_CHECK(ErrCode, ErrMsg, addition args for message);
ERROR_WARN(ErrCode, ErrMsg, addition args for message);
```
which simply prints the message when the ErrCode is not
equal to Success. The ``ERROR_WARN`` interface is used when
only a warning message is desired. Predefined
error codes can be used here as well. Unlike the
critical interface, this does not print a stack trace or abort
the simulation and the behavior is more similar to the
``LOG_ERROR`` or ``LOG_WARN`` macros.

If a routine has a return code and a return-on-error is
desired, we provide an interface:
```c++
ERROR_RETURN(ReturnCode, ErrCode, ErrMsg, added vars if needed);
ERROR_RETURN_WARN(ReturnCode, ErrCode, ErrMsg, added vars);
```
This behaves similar to the ``ERROR_CHECK`` except that after
the message is printed, the error code is added to the return
code and a return statement is executed to exit the routine.
A use case might be:
```c++
if (MyFailCondition) ERROR_RETURN(ReturnCode, Err::Fail, "MyErrMsg");
// which is equivalent to:
if (MyFailCondition) {
LOG_ERROR("MyErrMsg");
ReturnCode += Err::Fail;
return ReturnCode;
}
```

Finally, a macro is provided to accumulate errors:
```c++
ERROR_SUM(ErrSum, ErrCode, ErrMsg, added vars if needed);
```
which after printing the messages, adds the most recent
error code to the running sum stored in ``ErrSum``.
Note that in the latter two cases, care must be taken so that
the accumulated error or return codes do not match a
predefined code and cause confustion in a later use of the
code in an ``ERROR_CRITICAL`` or ``ERROR_CHECK`` call.
#### 4.2.2 Other Error methods
Some additional functions are needed to complement the macros
described above.
To allow more concise error checking, developers can pre-define
and error code and associated error message using:
```c++
Error::defineCode(ErrCode, ErrMsg);
```
This routine addes the ErrMsg to a map of predefined messages
and returns the new ErrCode assigned to that message for later
use in the macros described above. This function should be
called at some point during model (or module) initialization
and the ErrCode saved for use during the run sequence.

The abort routine:
```c++
Error::abort();
```
is provided to abort a simulation. A minimal version would
simply call the MPI abort function, but it is possible to
add additional cleanup tasks.

A finalize routine:
```c++
Error::finalize(ErrCode);
```
will clean up all of the predefined error codes. If the
ErrCode is not equal to success, it will set the error code
to the standard abnormal termination exit code (1) and avoid other
standard c++ exit codes for other run-time errors.
## 5 Verification and Testing
### 5.1 Test all failure modes
A unit test will test all non-critical error interfaces first
and then end with a critical failure. The ``WILL_FAIL`` property
in ctest will ensure the expected unit test failure will pass.
An inspection of the log and error files will be used to verify
messages are printed as expected.
- tests all requirement
1 change: 1 addition & 0 deletions components/omega/doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ design/Config
design/DataTypes
design/Decomp
design/Driver
design/Error
design/Halo
design/HorzMeshClass
design/Logging
Expand Down

0 comments on commit 2ab6d1b

Please sign in to comment.