[NOT-BUG] The purpose of this project #755

xushijie · 2025-01-06T06:41:36Z

I am new to the IREE, and then came across this project. The first question for me is that the motivation of this project. What kinds of problems does this project try to resolve.
I know it is library for model serving, but currently there already have vLLM, and other serving systems, so what the shark-ai distinguish to them? Since I did not find any detail information about it except AMD bought it, so I raise this question.

ScottTodd · 2025-01-08T23:41:57Z

We just published a release that has some context: https://github.com/nod-ai/shark-ai/releases/tag/v3.1.0

The full vertically-integrated SHARK AI stack is now available for deploying machine learning models:

The sharktank package builds bridges from popular machine learning models coming from existing model repositories like Hugging Face and frameworks like llama.cpp to the IREE compiler. This model export and compilation pipeline features whole program optimization and efficient cross-target code generation without depending on operator libraries.

The shortfin package provides serving applications built on top of the IREE runtime, with integration points to other ecosystem projects like the SGLang frontend. These applications are lightweight, portable, and packed with optimizations to improve serving efficiency.

Together, these packages simplify model deployment by eliminating the need for complex Docker containers or vendor-specific libraries while continuing to provide competitive performance and flexibility. Here are some metrics:

The native shortfin serving library, including a GPU runtime, fits in less than 2MB.

The self-contained compiler fits within 70MB. Once a model is compiled, it can be deployed using shortfin with no additional dependencies.

Expanding on that in regards to your specific questions and comparisons to other projects:

IREE supports programs from multiple ML frameworks, including TensorFlow, TensorFlow Lite / LiteRT, JAX, PyTorch, and ONNX. As a serving framework built on IREE, shortfin can serve those programs too. We have an example of serving a mobilenet onnx model here: https://github.com/nod-ai/shark-ai/tree/main/shortfin/examples/python/mobilenet_server. In the grand scheme of things we would like to support the full matrix of programs off the shelf across all hardware. That being said, we are focusing development on the latest popular models (currently SDXL and Llama 3.1, more to come) on the latest AMD hardware, with our own optimized implementations using sharktank to squeeze as much as we can from the tech stack.
This tech stack is fully open source and does not depend on operator libraries (for any backend - CPU/CUDA/ROCm/Vulkan/Metal/etc.). Some other serving frameworks may be closed source or may have such complex dependencies that they recommend installation via carefully managed Docker containers.
Implementing the base layers of the runtime and serving stack in systems languages like C and C++ instead of Python allows for more flexible deployment options, on devices ranging from embedded systems up to datacenter servers.

We'll also be updating the project documentation to highlight this more. Thanks for the feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NOT-BUG] The purpose of this project #755

[NOT-BUG] The purpose of this project #755

xushijie commented Jan 6, 2025

ScottTodd commented Jan 8, 2025

[NOT-BUG] The purpose of this project #755

[NOT-BUG] The purpose of this project #755

Comments

xushijie commented Jan 6, 2025

ScottTodd commented Jan 8, 2025