docs for 8.0b2 (apple#2313)

aseemw · Aug 16, 2024 · 5a3740b · 5a3740b
1 parent 8053bf1
commit 5a3740b
Show file tree

Hide file tree

Showing 82 changed files with 480 additions and 201 deletions.
diff --git a/docs-guides/_sources/source/mlmodel-utilities.md b/docs-guides/_sources/source/mlmodel-utilities.md
@@ -172,3 +172,40 @@ config = cto.coreml.OptimizationConfig(
 compressed_mlmodel = cto.coreml.palettize_weights(mlmodel, config)
 
 ```
+
+## Bisect Model
+
+In certain scenarios, you may want to break a large Core ML model into two smaller models. For instance, if you are deploying a model to run on neural engine on an iPhone, it cannot be larger than 1 GB. If you are working with, say, [Stable Diffusion](https://github.com/apple/ml-stable-diffusion) 1.5 model which is 1.72 GB large (Float 16 precision), then it needs to be broken up into two chunks, each less than 1 GB. The utility `ct.models.utils.bisect_model` will allow you to do exactly that. When using this API, you can also opt-in to package the two chunks of the model into a pipeline model, so that its still a single mlpackage file, with the two models arranged in a sequential manner.
+
+The example below shows how to bisect a model, test the accuracy, and save them on disk.
+
+```python
+
+import coremltools as ct
+
+model_path = "my_model.mlpackage"
+output_dir = "./output/"
+
+# The following code will produce two smaller models:
+# `./output/my_model_chunk1.mlpackage` and `./output/my_model_chunk2.mlpackage`
+# It also compares the output numerical of the original Core ML model with the chunked models.
+ct.models.utils.bisect_model(
+    model_path,
+    output_dir,
+)
+
+# The following code will produce a single pipeline model `./output/my_model_chunked_pipeline.mlpackage`
+ct.models.utils.bisect_model(
+    model_path,
+    output_dir,
+    merge_chunks_to_pipeline=True,
+)
+
+# You can also pass the MLModel object directly
+mlmodel = ct.models.MLModel(model_path)
+ct.models.utils.bisect_model(
+    mlmodel,
+    output_dir,
+    merge_chunks_to_pipeline=True,
+)
+```
diff --git a/docs-guides/_sources/source/opt-palettization-api.md b/docs-guides/_sources/source/opt-palettization-api.md
@@ -83,7 +83,7 @@ from coremltools.optimize.torch.palettization import PostTrainingPalettizer, \
 # load model
 torch_model = get_torch_model()
 palettization_config_dict = {
-  "global_config": {"n_bits": 4, "granulatity": "per_grouped_channel", "group_size": 4},
+  "global_config": {"n_bits": 4, "granularity": "per_grouped_channel", "group_size": 4},
 }
 palettization_config = PostTrainingPalettizerConfig.from_dict(palettization_config_dict)
 palettizer = PostTrainingPalettizer(torch_model, palettization_config)

diff --git a/docs-guides/_sources/source/opt-quantization-overview.md b/docs-guides/_sources/source/opt-quantization-overview.md
@@ -56,5 +56,5 @@ of the network can also be quantized with their own scale factors.
 
 Activations are quantized using `per-tensor` mode. During the process of training or passing calibration data through the model, the values of intermediate activations are observed and their max and min values are used to compute the quantization scales, which are stored during inference. Quantizing the intermediate tensors may help in inference of networks that are bottlenecked by memory bandwidth due to large activations.
 
-On newer hardware, e.g. iPhone 15 pro (A17 pro), quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.
+On newer hardware with A17 Pro or M4 chips, e.g. iPhone 15 Pro, quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.
 
diff --git a/docs-guides/_sources/source/opt-quantization-perf.md b/docs-guides/_sources/source/opt-quantization-perf.md
@@ -10,7 +10,7 @@ compute units (CPU and sometimes GPU) that employ load-time weight decompression
 load time, and they need to be decompressed at runtime, slowing down the inference. Therefore it is recommended to use
 activation quantization only when your model is fully or mostly running on the Neural Engine (NE).
 
-In newer hardware, e.g. iPhone 15 pro (A17 pro) , M4 iPads etc, there is increased throughput possible for int8-int8 
+In newer hardware with A17 Pro or M4 chips (e.g. iPhone 15 Pro), there is increased throughput possible for int8-int8 
 compute on Neural Engine, compared to previous versions. Hence, activation and weight quantization for networks running on
 Neural Engine can give even more latency gains. This can be seen in the [table below](#results) (e.g. 
 The ResNet50 model with `W8A8` mode runs considerably faster than its `W16A16` equivalent).  
@@ -29,7 +29,7 @@ selection is `all` unless otherwise noted. The latency numbers are sensitive to
 on the device state and build versions. 
 
 - Device: iPhone 14 Pro (A16), unless otherwise mentioned
-- iOS build: iOS17 
+- iOS build: iOS 17 
 - Xcode : Xcode 15
 
 For more details on base models and compression methodology, please refer to docs [here](opt-palettization-perf.md).
@@ -56,4 +56,4 @@ For more details on base models and compression methodology, please refer to doc
 | [MobileViTv2-1.0](https://ml-assets.apple.com/coreml/quantized_models/post_training_compressed/quantized/MobileViTV2Alpha1WeightOnlySymmetricQuantized.mlpackage.zip)     | Weight-only                                                                                                  | Post Training         | 1.92              | 77.66          | 1.43                                       | 1.37                                       |
 | [MobileViTv2-1.0](https://ml-assets.apple.com/coreml/quantized_models/training_time_compressed/quantized/MobileViTV2Alpha1SymmetricPerChannel.mlpackage.zip)              | [Weight & activation](https://ml-assets.apple.com/coreml/quantized_models/training_time_compressed/quantized/MobileViTV2Alpha1SymmetricPerChannel.yaml) | Training Time         | 1.89              | 76.89 ± 0.07   | 1.18                                       | 1.03                                       |
 
-**Note**: The trained and compressed models and the `coremltools.optimize.torch` config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.
+**Note**: The trained and compressed models and the `coremltools.optimize.torch` config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.
diff --git a/docs-guides/_sources/source/stateful-models.md b/docs-guides/_sources/source/stateful-models.md
@@ -227,6 +227,219 @@ potential runtime performance improvements.
 For instance, please check out the [2024 WWDC session](https://developer.apple.com/videos/play/wwdc2024/10159/) for an 
 example that uses the [Mistral 7B model](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
 and utilizes the stateful prediction feature for improved performance on a GPU on a macbook pro.  
+The code for converting and deploying Mistral 7B model is released in [Hugging Face Mistral7B Example](
+https://github.com/huggingface/swift-transformers/blob/preview/Examples/Mistral7B/export.py) along with the
+[blog article](https://huggingface.co/blog/mistral-coreml).
 
+## Example: Toy Attention Model with Stateful KV-Cache
 
+To help you better understand how to make an attention model stateful with key cache and value cache, here is a
+toy example.
 
+We start with a toy model with simple attention, where the `query`, `key`, and `value` are calculated by a linear layer
+and then fed into the `scaled_dot_product_attention`.
+This toy example only focuses on stateful kv-cache, so it omits other details such as multi-head, multi-layer,
+positional encoding, final logits, etc.
+```python
+import torch
+import torch.nn as nn
+
+class SimpleAttention(nn.Module):
+     def __init__(self, embed_size):
+         super().__init__()
+         self.query = nn.Linear(embed_size, embed_size)
+         self.key = nn.Linear(embed_size, embed_size)
+         self.value = nn.Linear(embed_size, embed_size)
+
+     def forward(self, x):
+         Q = self.query(x)  # (batch_size, seq_len, embed_size)
+         K = self.key(x)  # (batch_size, seq_len, embed_size)
+         V = self.value(x)  # (batch_size, seq_len, embed_size)
+         return torch.nn.functional.scaled_dot_product_attention(Q, K, V)
+
+class ToyModel(nn.Module):
+     def __init__(self, vocab_size, embed_size):
+         super().__init__()
+         self.embedding = nn.Embedding(vocab_size, embed_size)
+         self.attention = SimpleAttention(embed_size)
+         self.fc = nn.Linear(embed_size, embed_size)
+
+     def forward(self, x):
+         embedded = self.embedding(x)
+         attention_output = self.attention(embedded)
+         return self.fc(attention_output)
+```
+
+To use key cache and value cache for attention, we can write a new class `SimpleAttentionWithKeyValueCache` which
+inherits the `SimpleAttention`, and the in the `forward` function we re-use previously computed `k` and `v`, 
+and fill the newly computed `k` and `v` back to the cache.
+```python
+class SimpleAttentionWithKeyValueCache(SimpleAttention):
+     """Add kv-cache into SimpleAttention."""
+
+     def forward(self, x, attention_mask, k_cache, v_cache):
+         Q = self.query(x)
+         newly_computed_k = self.key(x)
+         newly_computed_v = self.value(x)
+
+         # Update kv-cache in-place.
+         q_len = Q.shape[-2]
+         end_step = attention_mask.shape[-1]
+         past_kv_len = end_step - q_len
+         k_cache[:, past_kv_len:end_step, :] = newly_computed_k
+         v_cache[:, past_kv_len:end_step, :] = newly_computed_v
+
+         # The K and V we need is (batch_size, q_len + past_kv_len, embed_size).
+         K = k_cache[:, :end_step, :]
+         V = v_cache[:, :end_step, :]
+
+         return torch.nn.functional.scaled_dot_product_attention(
+             Q, K, V, attn_mask=attention_mask
+         )
+```
+
+Then the toy model with kv-cache will look like:
+```python
+class ToyModelWithKeyValueCache(nn.Module):
+     def __init__(self, vocab_size, embed_size, batch_size, max_seq_len):
+         super().__init__()
+         self.embedding = nn.Embedding(vocab_size, embed_size)
+         self.attention = SimpleAttentionWithKeyValueCache(embed_size)
+         self.fc = nn.Linear(embed_size, embed_size)
+
+         self.kvcache_shape = (batch_size, max_seq_len, embed_size)
+         self.register_buffer("k_cache", torch.zeros(self.kvcache_shape))
+         self.register_buffer("v_cache", torch.zeros(self.kvcache_shape))
+
+     def forward(
+         self,
+         input_ids,  # [batch_size, seq_len]
+         causal_mask,  # [batch_size, seq_len, seq_len + past_kv_len]
+     ):
+         embedded = self.embedding(input_ids)
+         attention_output = self.attention(embedded, causal_mask, self.k_cache, self.v_cache)
+         return self.fc(attention_output)
+```
+
+Now let's compare the speed between the original model and the stateful kv-cache model.
+
+First we set up some hyper-parameters:
+```python
+vocab_size = 32000
+embed_size = 1024
+batch_size = 1
+seq_len = 5
+max_seq_len = 1024
+num_iterations = 100
+```
+
+The original model could be initialized and converted by the following code snippet.
+```python
+import numpy as np
+import coremltools as ct
+
+torch_model = ToyModel(vocab_size, embed_size)
+torch_model.eval()
+input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
+torch_output = torch_model(input_ids).detach().numpy()
+traced_model = torch.jit.trace(torch_model, [input_ids])
+query_length = ct.RangeDim(lower_bound=1, upper_bound=max_seq_len, default=1)
+inputs = [ct.TensorType(shape=(batch_size, query_length), dtype=np.int32, name="input_ids")]
+outputs = [ct.TensorType(dtype=np.float16, name="output")]
+
+converted_model = ct.convert(
+     traced_model,
+     inputs=inputs,
+     outputs=outputs,
+     minimum_deployment_target=ct.target.iOS18,
+     compute_units=ct.ComputeUnit.CPU_AND_GPU,
+)
+```
+Notice that the `minimum_deployment_target=ct.target.iOS18` is not necessary if you only
+want to use the stateless model, as stateless models are supported before iOS18.
+Here we set it just for fair comparison with the stateful kvcache model later.
+
+We can time the prediction of the stateless model
+```python
+from time import perf_counter
+
+t_start: float = perf_counter()
+for token_id in range(num_iterations):
+     inputs = {"input_ids": np.array([list(range(token_id + 1))], dtype=np.int32)}
+     converted_model.predict(inputs)
+print(f"Time without kv-cache: {(perf_counter() - t_start) * 1.0e3} ms")
+```
+
+Now let's initialize and convert the stateful kv-cache model in a similar way
+```python
+past_kv_len = 0
+torch_model_kvcache = ToyModelWithKeyValueCache(
+    vocab_size, embed_size, batch_size, max_seq_len
+)
+torch_model_kvcache.load_state_dict(torch_model.state_dict(), strict=False)
+torch_model_kvcache.eval()
+causal_mask = torch.zeros((batch_size, seq_len, seq_len + past_kv_len), dtype=torch.float32)
+
+# Make sure the output matches the non-kv-cache version.
+torch_kvcache_output = torch_model_kvcache(input_ids, causal_mask).detach().numpy()
+np.testing.assert_allclose(torch_output, torch_kvcache_output)
+
+traced_model_kvcache = torch.jit.trace(torch_model_kvcache, [input_ids, causal_mask])
+query_length = ct.RangeDim(lower_bound=1, upper_bound=max_seq_len, default=1)
+end_step_dim = ct.RangeDim(lower_bound=1, upper_bound=max_seq_len, default=1)
+inputs = [
+    ct.TensorType(shape=(batch_size, query_length), dtype=np.int32, name="input_ids"),
+    ct.TensorType(
+        shape=(batch_size, query_length, end_step_dim), dtype=np.float16, name="causal_mask"
+    ),
+]
+outputs = [ct.TensorType(dtype=np.float16, name="output")]
+
+# In addition to `inputs` and `outputs`, we need `states` which uses the same name as the
+# registered buffers in `ToyModelWithKeyValueCache`.
+states = [
+    ct.StateType(
+        wrapped_type=ct.TensorType(
+            shape=torch_model_kvcache.kvcache_shape, dtype=np.float16
+        ),
+        name="k_cache",
+    ),
+    ct.StateType(
+        wrapped_type=ct.TensorType(
+            shape=torch_model_kvcache.kvcache_shape, dtype=np.float16
+        ),
+        name="v_cache",
+    ),
+]
+converted_model_kvcache = ct.convert(
+    traced_model_kvcache,
+    inputs=inputs,
+    outputs=outputs,
+    states=states,
+    minimum_deployment_target=ct.target.iOS18,
+    compute_units=ct.ComputeUnit.CPU_AND_GPU,
+)
+```
+
+We can also time the prediction of this stateful kv-cache model
+```python
+past_kv_len = 0
+kv_cache_state = converted_model_kvcache.make_state()
+t_start: float = perf_counter()
+for token_id in range(num_iterations):
+    inputs = {
+        "input_ids": np.array([[token_id]], dtype=np.int32),
+        "causal_mask": np.zeros((1, 1, past_kv_len + 1), dtype=np.float16),
+    }
+    converted_model_kvcache.predict(inputs, kv_cache_state)
+    past_kv_len += 1
+print(f"Time with kv-cache: {(perf_counter() - t_start) * 1.0e3} ms")
+```
+
+After running the prediction, we can get the following output (on a MacBook Pro with M3 Max chip)
+```text
+Time (ms) without kv-cache: 4245.6
+Time (ms) with kv-cache: 238.0
+```
+It demonstrates how to modify the attention module to get a stateful model with kv-cache, which runs much faster than
+the original stateless model.
diff --git a/docs-guides/index.html b/docs-guides/index.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Core ML Tools &#8212; Guide to Core ML Tools</title>
 

diff --git a/docs-guides/searchindex.js b/docs-guides/searchindex.js
diff --git a/docs-guides/source/classifiers.html b/docs-guides/source/classifiers.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Classifiers &#8212; Guide to Core ML Tools</title>
 

diff --git a/docs-guides/source/comparing-ml-programs-and-neural-networks.html b/docs-guides/source/comparing-ml-programs-and-neural-networks.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Comparing ML Programs and Neural Networks &#8212; Guide to Core ML Tools</title>
 

diff --git a/docs-guides/source/composite-operators.html b/docs-guides/source/composite-operators.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Composite Operators &#8212; Guide to Core ML Tools</title>
 
@@ -464,10 +464,12 @@ <h2>Import and Convert the Pre-trained Model<a class="headerlink" href="#import-
 <span id="index-1"></span><h2>Decompose into Existing MIL Operators<a class="headerlink" href="#decompose-into-existing-mil-operators" title="Permalink to this heading">#</a></h2>
 <p>The TensorFlow <a class="reference external" href="https://www.tensorflow.org/api_docs/java/org/tensorflow/op/core/Einsum">documentation on Einsum</a> refers to Einstein summation notation. You can use this notation to represent a variety of tensor operations such as <code class="docutils literal notranslate"><span class="pre">reduce_sum</span></code>, <code class="docutils literal notranslate"><span class="pre">transpose</span></code>, and <code class="docutils literal notranslate"><span class="pre">trace</span></code>, using a string. Einsum is usually a complicated operation, but with this example you don’t need to know all the possible cases, just the particular notation that this model uses.</p>
 <p>The error trace shows that the model uses the following notation for Einsum:</p>
-<a class="imgnoborder reference internal image-reference" href="../_images/first_eq_300.png"><img alt="Notation for Einsum" class="imgnoborder align-center" src="../_images/first_eq_300.png" style="width: 400px;" /></a>
+<a class="imgnoborder reference internal image-reference" href="../_images/first_eq_300.png"><img alt="Notation for Einsum" class="imgnoborder align-center" src="../_images/first_eq_300.png" style="width: 400px;" />
+</a>
 <hr class="docutils" />
 <p>The above notation translates into the following mathematical expression:</p>
-<a class="imgnoborder reference internal image-reference" href="../_images/second_eq_300.png"><img alt="Math expression" class="imgnoborder align-center" src="../_images/second_eq_300.png" style="width: 600px;" /></a>
+<a class="imgnoborder reference internal image-reference" href="../_images/second_eq_300.png"><img alt="Math expression" class="imgnoborder align-center" src="../_images/second_eq_300.png" style="width: 600px;" />
+</a>
 <hr class="docutils" />
 <p>While the above may look complicated, it is essentially a batched matrix multiplication with a transpose on the second input:</p>
 <img alt="Batched matrix multiplication" class="imgnoborder align-center" src="../_images/third_eq_300.png" />

diff --git a/docs-guides/source/conversion-options.html b/docs-guides/source/conversion-options.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Conversion Options &#8212; Guide to Core ML Tools</title>
 

diff --git a/docs-guides/source/convert-a-tensorflow-1-deepspeech-model.html b/docs-guides/source/convert-a-tensorflow-1-deepspeech-model.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Converting a TensorFlow 1 DeepSpeech Model &#8212; Guide to Core ML Tools</title>
 

diff --git a/docs-guides/source/convert-a-tensorflow-1-image-classifier.html b/docs-guides/source/convert-a-tensorflow-1-image-classifier.html
@@ -7,7 +7,7 @@
 
   <head>
     <meta charset="utf-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Converting a TensorFlow 1 Image Classifier &#8212; Guide to Core ML Tools</title>
Original file line number	Diff line number	Diff line change
Expand Up		@@ -56,5 +56,5 @@ of the network can also be quantized with their own scale factors.

		Activations are quantized using `per-tensor` mode. During the process of training or passing calibration data through the model, the values of intermediate activations are observed and their max and min values are used to compute the quantization scales, which are stored during inference. Quantizing the intermediate tensors may help in inference of networks that are bottlenecked by memory bandwidth due to large activations.

		On newer hardware, e.g. iPhone 15 pro (A17 pro), quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.
		On newer hardware with A17 Pro or M4 chips, e.g. iPhone 15 Pro, quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.