Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOT-BUG] The purpose of this project #755

Open
xushijie opened this issue Jan 6, 2025 · 1 comment
Open

[NOT-BUG] The purpose of this project #755

xushijie opened this issue Jan 6, 2025 · 1 comment

Comments

@xushijie
Copy link

xushijie commented Jan 6, 2025

I am new to the IREE, and then came across this project. The first question for me is that the motivation of this project. What kinds of problems does this project try to resolve.
I know it is library for model serving, but currently there already have vLLM, and other serving systems, so what the shark-ai distinguish to them? Since I did not find any detail information about it except AMD bought it, so I raise this question.

@ScottTodd
Copy link
Member

We just published a release that has some context: https://github.com/nod-ai/shark-ai/releases/tag/v3.1.0

The full vertically-integrated SHARK AI stack is now available for deploying machine learning models:

  • The sharktank package builds bridges from popular machine learning models coming from existing model repositories like Hugging Face and frameworks like llama.cpp to the IREE compiler. This model export and compilation pipeline features whole program optimization and efficient cross-target code generation without depending on operator libraries.
  • The shortfin package provides serving applications built on top of the IREE runtime, with integration points to other ecosystem projects like the SGLang frontend. These applications are lightweight, portable, and packed with optimizations to improve serving efficiency.

Together, these packages simplify model deployment by eliminating the need for complex Docker containers or vendor-specific libraries while continuing to provide competitive performance and flexibility. Here are some metrics:

  • The native shortfin serving library, including a GPU runtime, fits in less than 2MB.
  • The self-contained compiler fits within 70MB. Once a model is compiled, it can be deployed using shortfin with no additional dependencies.

Expanding on that in regards to your specific questions and comparisons to other projects:

  • IREE supports programs from multiple ML frameworks, including TensorFlow, TensorFlow Lite / LiteRT, JAX, PyTorch, and ONNX. As a serving framework built on IREE, shortfin can serve those programs too. We have an example of serving a mobilenet onnx model here: https://github.com/nod-ai/shark-ai/tree/main/shortfin/examples/python/mobilenet_server. In the grand scheme of things we would like to support the full matrix of programs off the shelf across all hardware. That being said, we are focusing development on the latest popular models (currently SDXL and Llama 3.1, more to come) on the latest AMD hardware, with our own optimized implementations using sharktank to squeeze as much as we can from the tech stack.
  • This tech stack is fully open source and does not depend on operator libraries (for any backend - CPU/CUDA/ROCm/Vulkan/Metal/etc.). Some other serving frameworks may be closed source or may have such complex dependencies that they recommend installation via carefully managed Docker containers.
  • Implementing the base layers of the runtime and serving stack in systems languages like C and C++ instead of Python allows for more flexible deployment options, on devices ranging from embedded systems up to datacenter servers.

We'll also be updating the project documentation to highlight this more. Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants