Skip to content

Commit

Permalink
docs for 8.0b2 (apple#2313)
Browse files Browse the repository at this point in the history
  • Loading branch information
aseemw authored Aug 16, 2024
1 parent 8053bf1 commit 5a3740b
Show file tree
Hide file tree
Showing 82 changed files with 480 additions and 201 deletions.
37 changes: 37 additions & 0 deletions docs-guides/_sources/source/mlmodel-utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,3 +172,40 @@ config = cto.coreml.OptimizationConfig(
compressed_mlmodel = cto.coreml.palettize_weights(mlmodel, config)

```

## Bisect Model

In certain scenarios, you may want to break a large Core ML model into two smaller models. For instance, if you are deploying a model to run on neural engine on an iPhone, it cannot be larger than 1 GB. If you are working with, say, [Stable Diffusion](https://github.com/apple/ml-stable-diffusion) 1.5 model which is 1.72 GB large (Float 16 precision), then it needs to be broken up into two chunks, each less than 1 GB. The utility `ct.models.utils.bisect_model` will allow you to do exactly that. When using this API, you can also opt-in to package the two chunks of the model into a pipeline model, so that its still a single mlpackage file, with the two models arranged in a sequential manner.

The example below shows how to bisect a model, test the accuracy, and save them on disk.

```python

import coremltools as ct

model_path = "my_model.mlpackage"
output_dir = "./output/"

# The following code will produce two smaller models:
# `./output/my_model_chunk1.mlpackage` and `./output/my_model_chunk2.mlpackage`
# It also compares the output numerical of the original Core ML model with the chunked models.
ct.models.utils.bisect_model(
model_path,
output_dir,
)

# The following code will produce a single pipeline model `./output/my_model_chunked_pipeline.mlpackage`
ct.models.utils.bisect_model(
model_path,
output_dir,
merge_chunks_to_pipeline=True,
)

# You can also pass the MLModel object directly
mlmodel = ct.models.MLModel(model_path)
ct.models.utils.bisect_model(
mlmodel,
output_dir,
merge_chunks_to_pipeline=True,
)
```
2 changes: 1 addition & 1 deletion docs-guides/_sources/source/opt-palettization-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ from coremltools.optimize.torch.palettization import PostTrainingPalettizer, \
# load model
torch_model = get_torch_model()
palettization_config_dict = {
"global_config": {"n_bits": 4, "granulatity": "per_grouped_channel", "group_size": 4},
"global_config": {"n_bits": 4, "granularity": "per_grouped_channel", "group_size": 4},
}
palettization_config = PostTrainingPalettizerConfig.from_dict(palettization_config_dict)
palettizer = PostTrainingPalettizer(torch_model, palettization_config)
Expand Down
2 changes: 1 addition & 1 deletion docs-guides/_sources/source/opt-quantization-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,5 @@ of the network can also be quantized with their own scale factors.

Activations are quantized using `per-tensor` mode. During the process of training or passing calibration data through the model, the values of intermediate activations are observed and their max and min values are used to compute the quantization scales, which are stored during inference. Quantizing the intermediate tensors may help in inference of networks that are bottlenecked by memory bandwidth due to large activations.

On newer hardware, e.g. iPhone 15 pro (A17 pro), quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.
On newer hardware with A17 Pro or M4 chips, e.g. iPhone 15 Pro, quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.

6 changes: 3 additions & 3 deletions docs-guides/_sources/source/opt-quantization-perf.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ compute units (CPU and sometimes GPU) that employ load-time weight decompression
load time, and they need to be decompressed at runtime, slowing down the inference. Therefore it is recommended to use
activation quantization only when your model is fully or mostly running on the Neural Engine (NE).

In newer hardware, e.g. iPhone 15 pro (A17 pro) , M4 iPads etc, there is increased throughput possible for int8-int8
In newer hardware with A17 Pro or M4 chips (e.g. iPhone 15 Pro), there is increased throughput possible for int8-int8
compute on Neural Engine, compared to previous versions. Hence, activation and weight quantization for networks running on
Neural Engine can give even more latency gains. This can be seen in the [table below](#results) (e.g.
The ResNet50 model with `W8A8` mode runs considerably faster than its `W16A16` equivalent).
Expand All @@ -29,7 +29,7 @@ selection is `all` unless otherwise noted. The latency numbers are sensitive to
on the device state and build versions.

- Device: iPhone 14 Pro (A16), unless otherwise mentioned
- iOS build: iOS17
- iOS build: iOS 17
- Xcode : Xcode 15

For more details on base models and compression methodology, please refer to docs [here](opt-palettization-perf.md).
Expand All @@ -56,4 +56,4 @@ For more details on base models and compression methodology, please refer to doc
| [MobileViTv2-1.0](https://ml-assets.apple.com/coreml/quantized_models/post_training_compressed/quantized/MobileViTV2Alpha1WeightOnlySymmetricQuantized.mlpackage.zip) | Weight-only | Post Training | 1.92 | 77.66 | 1.43 | 1.37 |
| [MobileViTv2-1.0](https://ml-assets.apple.com/coreml/quantized_models/training_time_compressed/quantized/MobileViTV2Alpha1SymmetricPerChannel.mlpackage.zip) | [Weight & activation](https://ml-assets.apple.com/coreml/quantized_models/training_time_compressed/quantized/MobileViTV2Alpha1SymmetricPerChannel.yaml) | Training Time | 1.89 | 76.89 ± 0.07 | 1.18 | 1.03 |

**Note**: The trained and compressed models and the `coremltools.optimize.torch` config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.
**Note**: The trained and compressed models and the `coremltools.optimize.torch` config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.
213 changes: 213 additions & 0 deletions docs-guides/_sources/source/stateful-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,219 @@ potential runtime performance improvements.
For instance, please check out the [2024 WWDC session](https://developer.apple.com/videos/play/wwdc2024/10159/) for an
example that uses the [Mistral 7B model](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
and utilizes the stateful prediction feature for improved performance on a GPU on a macbook pro.
The code for converting and deploying Mistral 7B model is released in [Hugging Face Mistral7B Example](
https://github.com/huggingface/swift-transformers/blob/preview/Examples/Mistral7B/export.py) along with the
[blog article](https://huggingface.co/blog/mistral-coreml).

## Example: Toy Attention Model with Stateful KV-Cache

To help you better understand how to make an attention model stateful with key cache and value cache, here is a
toy example.

We start with a toy model with simple attention, where the `query`, `key`, and `value` are calculated by a linear layer
and then fed into the `scaled_dot_product_attention`.
This toy example only focuses on stateful kv-cache, so it omits other details such as multi-head, multi-layer,
positional encoding, final logits, etc.
```python
import torch
import torch.nn as nn

class SimpleAttention(nn.Module):
def __init__(self, embed_size):
super().__init__()
self.query = nn.Linear(embed_size, embed_size)
self.key = nn.Linear(embed_size, embed_size)
self.value = nn.Linear(embed_size, embed_size)

def forward(self, x):
Q = self.query(x) # (batch_size, seq_len, embed_size)
K = self.key(x) # (batch_size, seq_len, embed_size)
V = self.value(x) # (batch_size, seq_len, embed_size)
return torch.nn.functional.scaled_dot_product_attention(Q, K, V)

class ToyModel(nn.Module):
def __init__(self, vocab_size, embed_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.attention = SimpleAttention(embed_size)
self.fc = nn.Linear(embed_size, embed_size)

def forward(self, x):
embedded = self.embedding(x)
attention_output = self.attention(embedded)
return self.fc(attention_output)
```

To use key cache and value cache for attention, we can write a new class `SimpleAttentionWithKeyValueCache` which
inherits the `SimpleAttention`, and the in the `forward` function we re-use previously computed `k` and `v`,
and fill the newly computed `k` and `v` back to the cache.
```python
class SimpleAttentionWithKeyValueCache(SimpleAttention):
"""Add kv-cache into SimpleAttention."""

def forward(self, x, attention_mask, k_cache, v_cache):
Q = self.query(x)
newly_computed_k = self.key(x)
newly_computed_v = self.value(x)

# Update kv-cache in-place.
q_len = Q.shape[-2]
end_step = attention_mask.shape[-1]
past_kv_len = end_step - q_len
k_cache[:, past_kv_len:end_step, :] = newly_computed_k
v_cache[:, past_kv_len:end_step, :] = newly_computed_v

# The K and V we need is (batch_size, q_len + past_kv_len, embed_size).
K = k_cache[:, :end_step, :]
V = v_cache[:, :end_step, :]

return torch.nn.functional.scaled_dot_product_attention(
Q, K, V, attn_mask=attention_mask
)
```

Then the toy model with kv-cache will look like:
```python
class ToyModelWithKeyValueCache(nn.Module):
def __init__(self, vocab_size, embed_size, batch_size, max_seq_len):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.attention = SimpleAttentionWithKeyValueCache(embed_size)
self.fc = nn.Linear(embed_size, embed_size)

self.kvcache_shape = (batch_size, max_seq_len, embed_size)
self.register_buffer("k_cache", torch.zeros(self.kvcache_shape))
self.register_buffer("v_cache", torch.zeros(self.kvcache_shape))

def forward(
self,
input_ids, # [batch_size, seq_len]
causal_mask, # [batch_size, seq_len, seq_len + past_kv_len]
):
embedded = self.embedding(input_ids)
attention_output = self.attention(embedded, causal_mask, self.k_cache, self.v_cache)
return self.fc(attention_output)
```

Now let's compare the speed between the original model and the stateful kv-cache model.

First we set up some hyper-parameters:
```python
vocab_size = 32000
embed_size = 1024
batch_size = 1
seq_len = 5
max_seq_len = 1024
num_iterations = 100
```

The original model could be initialized and converted by the following code snippet.
```python
import numpy as np
import coremltools as ct

torch_model = ToyModel(vocab_size, embed_size)
torch_model.eval()
input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
torch_output = torch_model(input_ids).detach().numpy()
traced_model = torch.jit.trace(torch_model, [input_ids])
query_length = ct.RangeDim(lower_bound=1, upper_bound=max_seq_len, default=1)
inputs = [ct.TensorType(shape=(batch_size, query_length), dtype=np.int32, name="input_ids")]
outputs = [ct.TensorType(dtype=np.float16, name="output")]

converted_model = ct.convert(
traced_model,
inputs=inputs,
outputs=outputs,
minimum_deployment_target=ct.target.iOS18,
compute_units=ct.ComputeUnit.CPU_AND_GPU,
)
```
Notice that the `minimum_deployment_target=ct.target.iOS18` is not necessary if you only
want to use the stateless model, as stateless models are supported before iOS18.
Here we set it just for fair comparison with the stateful kvcache model later.

We can time the prediction of the stateless model
```python
from time import perf_counter

t_start: float = perf_counter()
for token_id in range(num_iterations):
inputs = {"input_ids": np.array([list(range(token_id + 1))], dtype=np.int32)}
converted_model.predict(inputs)
print(f"Time without kv-cache: {(perf_counter() - t_start) * 1.0e3} ms")
```

Now let's initialize and convert the stateful kv-cache model in a similar way
```python
past_kv_len = 0
torch_model_kvcache = ToyModelWithKeyValueCache(
vocab_size, embed_size, batch_size, max_seq_len
)
torch_model_kvcache.load_state_dict(torch_model.state_dict(), strict=False)
torch_model_kvcache.eval()
causal_mask = torch.zeros((batch_size, seq_len, seq_len + past_kv_len), dtype=torch.float32)

# Make sure the output matches the non-kv-cache version.
torch_kvcache_output = torch_model_kvcache(input_ids, causal_mask).detach().numpy()
np.testing.assert_allclose(torch_output, torch_kvcache_output)

traced_model_kvcache = torch.jit.trace(torch_model_kvcache, [input_ids, causal_mask])
query_length = ct.RangeDim(lower_bound=1, upper_bound=max_seq_len, default=1)
end_step_dim = ct.RangeDim(lower_bound=1, upper_bound=max_seq_len, default=1)
inputs = [
ct.TensorType(shape=(batch_size, query_length), dtype=np.int32, name="input_ids"),
ct.TensorType(
shape=(batch_size, query_length, end_step_dim), dtype=np.float16, name="causal_mask"
),
]
outputs = [ct.TensorType(dtype=np.float16, name="output")]

# In addition to `inputs` and `outputs`, we need `states` which uses the same name as the
# registered buffers in `ToyModelWithKeyValueCache`.
states = [
ct.StateType(
wrapped_type=ct.TensorType(
shape=torch_model_kvcache.kvcache_shape, dtype=np.float16
),
name="k_cache",
),
ct.StateType(
wrapped_type=ct.TensorType(
shape=torch_model_kvcache.kvcache_shape, dtype=np.float16
),
name="v_cache",
),
]
converted_model_kvcache = ct.convert(
traced_model_kvcache,
inputs=inputs,
outputs=outputs,
states=states,
minimum_deployment_target=ct.target.iOS18,
compute_units=ct.ComputeUnit.CPU_AND_GPU,
)
```

We can also time the prediction of this stateful kv-cache model
```python
past_kv_len = 0
kv_cache_state = converted_model_kvcache.make_state()
t_start: float = perf_counter()
for token_id in range(num_iterations):
inputs = {
"input_ids": np.array([[token_id]], dtype=np.int32),
"causal_mask": np.zeros((1, 1, past_kv_len + 1), dtype=np.float16),
}
converted_model_kvcache.predict(inputs, kv_cache_state)
past_kv_len += 1
print(f"Time with kv-cache: {(perf_counter() - t_start) * 1.0e3} ms")
```

After running the prediction, we can get the following output (on a MacBook Pro with M3 Max chip)
```text
Time (ms) without kv-cache: 4245.6
Time (ms) with kv-cache: 238.0
```
It demonstrates how to modify the attention module to get a stateful model with kv-cache, which runs much faster than
the original stateless model.
2 changes: 1 addition & 1 deletion docs-guides/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Core ML Tools &#8212; Guide to Core ML Tools</title>

Expand Down
2 changes: 1 addition & 1 deletion docs-guides/searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs-guides/source/classifiers.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Classifiers &#8212; Guide to Core ML Tools</title>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Comparing ML Programs and Neural Networks &#8212; Guide to Core ML Tools</title>

Expand Down
8 changes: 5 additions & 3 deletions docs-guides/source/composite-operators.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Composite Operators &#8212; Guide to Core ML Tools</title>

Expand Down Expand Up @@ -464,10 +464,12 @@ <h2>Import and Convert the Pre-trained Model<a class="headerlink" href="#import-
<span id="index-1"></span><h2>Decompose into Existing MIL Operators<a class="headerlink" href="#decompose-into-existing-mil-operators" title="Permalink to this heading">#</a></h2>
<p>The TensorFlow <a class="reference external" href="https://www.tensorflow.org/api_docs/java/org/tensorflow/op/core/Einsum">documentation on Einsum</a> refers to Einstein summation notation. You can use this notation to represent a variety of tensor operations such as <code class="docutils literal notranslate"><span class="pre">reduce_sum</span></code>, <code class="docutils literal notranslate"><span class="pre">transpose</span></code>, and <code class="docutils literal notranslate"><span class="pre">trace</span></code>, using a string. Einsum is usually a complicated operation, but with this example you don’t need to know all the possible cases, just the particular notation that this model uses.</p>
<p>The error trace shows that the model uses the following notation for Einsum:</p>
<a class="imgnoborder reference internal image-reference" href="../_images/first_eq_300.png"><img alt="Notation for Einsum" class="imgnoborder align-center" src="../_images/first_eq_300.png" style="width: 400px;" /></a>
<a class="imgnoborder reference internal image-reference" href="../_images/first_eq_300.png"><img alt="Notation for Einsum" class="imgnoborder align-center" src="../_images/first_eq_300.png" style="width: 400px;" />
</a>
<hr class="docutils" />
<p>The above notation translates into the following mathematical expression:</p>
<a class="imgnoborder reference internal image-reference" href="../_images/second_eq_300.png"><img alt="Math expression" class="imgnoborder align-center" src="../_images/second_eq_300.png" style="width: 600px;" /></a>
<a class="imgnoborder reference internal image-reference" href="../_images/second_eq_300.png"><img alt="Math expression" class="imgnoborder align-center" src="../_images/second_eq_300.png" style="width: 600px;" />
</a>
<hr class="docutils" />
<p>While the above may look complicated, it is essentially a batched matrix multiplication with a transpose on the second input:</p>
<img alt="Batched matrix multiplication" class="imgnoborder align-center" src="../_images/third_eq_300.png" />
Expand Down
2 changes: 1 addition & 1 deletion docs-guides/source/conversion-options.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Conversion Options &#8212; Guide to Core ML Tools</title>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Converting a TensorFlow 1 DeepSpeech Model &#8212; Guide to Core ML Tools</title>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Converting a TensorFlow 1 Image Classifier &#8212; Guide to Core ML Tools</title>

Expand Down
Loading

0 comments on commit 5a3740b

Please sign in to comment.