Skip to content

Commit

Permalink
Add developer doc for model development debugging. (iree-org#17732)
Browse files Browse the repository at this point in the history
Collecting some lessons learned and debugging tips from
nod-ai/shark-ai#22 into a single document.
  • Loading branch information
ScottTodd authored Jun 25, 2024
1 parent 247de36 commit 1c53595
Show file tree
Hide file tree
Showing 2 changed files with 158 additions and 5 deletions.
152 changes: 152 additions & 0 deletions docs/website/docs/developers/debugging/model-development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
hide:
- tags
tags:
icon: octicons/bug-16
---

# Model development debugging

Bringing up new models or diagnosing regressions in existing models written
using one of IREE's supported [ML frameworks](../../guides/ml-frameworks/index.md)
or downstream projects like [sharktank](https://github.com/nod-ai/sharktank) can
involve debugging up and down the tech stack. Here are some tips to make that
process easier.

## Helpful build settings

### Use a debug build

Build with `-DCMAKE_BUILD_TYPE=Debug` or `-DCMAKE_BUILD_TYPE=RelWithDebInfo` to
include debug information in binaries you build.

### Enable assertions

Build with `-DIREE_ENABLE_ASSERTIONS=ON` to ensure that asserts in compiler
and runtime code are included in your program binaries. If an assert is missed
and the program compiles anyways, the output should not be trusted. The compiler
_must_ not crash on valid input programs, so assert failures should be fixed and
not worked around.

!!! note - "Note: release builds and some CI jobs may not have asserts enabled!"

### Run using sanitizers (ASan/TSan/UBSan)

Building and running using [sanitizers](../debugging/sanitizers.md) can catch
memory usage issues (ASan), thread synchronization issues (TSan), and undefined
behavior (UBSan).

## Helpful compiler and runtime flags

### VM execution tracing

The `--trace_execution` flag to runtime tools like `iree-run-module` will print
each VM instruction as it is executed. This can help with associating other logs
and system behavior with the compiled VM program.

### Tensor tracing

* The `--iree-flow-trace-dispatch-tensors` flag to `iree-compile` inserts
trace markers for all dispatch operation tensor inputs and outputs. This lets
you see tensor contents change as the program runs.
* The `--iree-flow-break-dispatch` flag to `iree-compile` inserts breaks after
a specified dispatch, allowing early termination of the program and shorter
logs when focusing debugging around a specific dispatch

### Executable substitution

Executable sources can be dumped, edited, and then loaded back into a program
using `--iree-hal-dump-executable-sources-to` and
`--iree-hal-substitute-executable-source`. This can be used for performace
tuning or for debugging (e.g. by replacing a complicated dispatch with a
simpler one).

See <https://github.com/iree-org/iree/pull/12240> for examples.

## Alternate perspectives

### Try using other data types

Nearly all targets support the `i32` and `f32` data types well, while higher
and lower bit depth types and more esoteric types like `bf16` and `complex` may
be supported partially or not at all on some targets.

If a program fails to compile or produces incorrect outputs, consider checking
if the program works after converting to other data types.

!!! tip

These compiler options automatically convert between several types on
import:

* `--iree-input-demote-i64-to-i32`
* `--iree-input-demote-f32-to-f16`
* `--iree-input-demote-f64-to-f32`
* `--iree-input-promote-f16-to-f32`
* `--iree-input-promote-bf16-to-f32`

If using `iree-run-module --input=@path/to/input_values.npy`, consider also
using `.bin` binary files instead of `.npy` numpy files, since IREE supports
different types than numpy and signedness information is lost at that level.

### Try using other targets / devices

Large parts of IREE's compilation pipelines and runtime libraries are shared
between compiler target backends and runtime HAL devices/drivers. If a program
works in one configuration but fails in another, that indicates an issue or
missing functionality in the failing configuration.

Some configurations also offer unique debugging functionality:

Compiler target | Runtime device | Notable properties for debugging
-- | -- | --
`vmvx` | `local-sync` | Easy to step into generated code, limited type support
`llvm-cpu` | `local-sync` | Single-threaded, broad type support
`llvm-cpu` | `local-task` | Multi-threaded, broad type support
`vulkan-spirv` | `vulkan` | Compatible with Renderdoc ([docs here](../performance/profiling-gpu-vulkan.md#renderdoc))
`cuda` | `cuda` | Compatible with [NVIDIA Nsight Graphics](https://developer.nvidia.com/nsight-graphics)
`rocm` | `hip` | Compatible with [Omniperf](https://github.com/ROCm/omniperf)
`metal-spirv` | `metal` | Compatible with the [Metal Debugger](https://developer.apple.com/documentation/xcode/metal-debugger/)

!!! tip

See the
[deployment configurations](../../guides/deployment-configurations/index.md)
pages for more information about each backend and device.

### Run natively and via Python bindings

Some problems manifest only when running through the Python (or some other
language/framework) bindings. The Python bindings have some non-trivial interop
and memory management across the C/C++/Python boundary.

Try extracting standalone `.mlir` files, compiling through `iree-compile`, then
running through `iree-run-module`. Extracting these artifacts can also help
other developers follow your reproduction steps.

## Reducing complexity

### Top-down reduction

Starting from a full program, try to reduce the program size and complexity
while keeping the issue you are debugging present. This can be either a manual
process or the `iree-reduce` tool can automate it. For manual reduction, here
are some general strategies:

* Reduce tensor sizes (e.g. image dimensions, context lengths) in your ML
framework
* Cut out duplicate layers (e.g. attention blocks in LLMs)
* If your program has multiple functions, test each in isolation

### Bottom-up reduction

Consider writing unit tests for individual ops or combinations of ops to see
if crashes, bugs, numerical issues, etc. can be reproduced at that scale.

Some existing test suites can be found at these locations:

* <https://github.com/iree-org/iree/tree/main/tests/e2e>
* <https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests/onnx/node/generated>
* <https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark/onnx/operators>
* <https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark/pytorch/operators>
* <https://github.com/openxla/stablehlo/tree/main/stablehlo/tests/interpret>
11 changes: 6 additions & 5 deletions docs/website/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ theme:
- navigation.tabs # Show primary sections in tabs below the header
- navigation.tabs.sticky # Keep tabs visible at the top when scrolled
- navigation.top # "Back to top" button
- navigation.indexes # Section names can link to index.md pages
- navigation.indexes # Section names can link to index.md pages

- toc.follow # Scroll the TOC panel to follow the reader

Expand Down Expand Up @@ -99,7 +99,7 @@ markdown_extensions:
# Support for embedding external files (e.g. source code samples) via snippets
# https://squidfunk.github.io/mkdocs-material/reference/code-blocks/#embedding-external-files
- pymdownx.snippets:
base_path: ["../../"] # Use paths relative to the repository root
base_path: ["../../"] # Use paths relative to the repository root
check_paths: true
# Diagram support, see
# https://squidfunk.github.io/mkdocs-material/reference/diagrams/
Expand All @@ -115,8 +115,8 @@ markdown_extensions:
- pymdownx.tasklist:
custom_checkbox: true
- tables
- toc: # Table of Contents
permalink: 'link' # Use Material font's "link" icon; see iree.css
- toc: # Table of Contents
permalink: "link" # Use Material font's "link" icon; see iree.css
toc_depth: 4

# Navigation with explicit ordering and nesting.
Expand Down Expand Up @@ -196,6 +196,7 @@ nav:
- "developers/debugging/compile-time-regressions.md"
- "developers/debugging/gpu.md"
- "developers/debugging/integration-tests.md"
- "developers/debugging/model-development.md"
- "developers/debugging/releases.md"
- "developers/debugging/sanitizers.md"
- "Performance":
Expand Down Expand Up @@ -243,7 +244,7 @@ plugins:

# https://github.com/mkdocs/mkdocs-redirects
- redirects:
redirect_maps: # old -> new
redirect_maps: # old -> new
"extensions/index.md": "reference/extensions.md"

# "getting-started/" moved under "guides/ml-frameworks/"
Expand Down

0 comments on commit 1c53595

Please sign in to comment.