Skip to content

Commit

Permalink
optimize performance of mi300
Browse files Browse the repository at this point in the history
  • Loading branch information
samos123 committed Jan 19, 2025
1 parent da7d1d4 commit b37a1bf
Show file tree
Hide file tree
Showing 4 changed files with 63 additions and 2 deletions.
5 changes: 4 additions & 1 deletion charts/kubeai/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,10 @@ modelServers:
# Source: https://github.com/drikster80/vllm/tree/gh200-docker
# gh200: "drikster80/vllm-gh200-openai:v0.6.4.post1"
gh200: "substratusai/vllm-gh200-openai:v0.6.4.post1"
amd-gpu: "substratusai/vllm-rocm:v0.6.6.post1"
# upstream vLLM seems to have broken ROCm support, so we are using a fork from AMD.
# Source: https://hub.docker.com/r/rocm/vllm-dev
# Source: https://github.com/ROCm/vllm
amd-gpu: substratusai/vllm-rocm:nightly_main_20250117
OLlama:
images:
default: "ollama/ollama:latest"
Expand Down
22 changes: 22 additions & 0 deletions charts/models/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,28 @@ catalog:
# You can also use nvidia-gpu-a100-80gb:8
resourceProfile: nvidia-gpu-h100:8
targetRequests: 500
llama-3.1-70b-instruct-fp8-mi300x:
enabled: false
features: [TextGeneration]
url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
engine: VLLM
env:
HIP_FORCE_DEV_KERNARG: "1"
NCCL_MIN_NCHANNELS: "112"
TORCH_BLAS_PREFER_HIPBLASLT: "1"
VLLM_USE_TRITON_FLASH_ATTN: "0"
args:
- --max-model-len=120000
- --max-num-batched-token=120000
- --max-num-seqs=1024
- --num-scheduler-steps=15
- --gpu-memory-utilization=0.9
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enable-chunked-prefill=false
- --max-seq-len-to-capture=16384
resourceProfile: amd-gpu-mi300x:1
targetRequests: 1024
llama-3.1-70b-instruct-fp8-gh200:
enabled: false
features: [TextGeneration]
Expand Down
12 changes: 11 additions & 1 deletion docs/installation/any.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Optionally, inspect the values file to see the default resourceProfiles:
helm show values kubeai/kubeai > values.yaml
```

## Installation using GPUs
## Installation using NVIDIA GPUs

This section assumes you have a Kubernetes cluster with GPU resources available and
installed the NVIDIA device plugin that adds GPU information labels to the nodes.
Expand All @@ -65,6 +65,16 @@ helm upgrade --install kubeai kubeai/kubeai \
--wait
```

## Installation using AMD GPUs

```
helm upgrade --install kubeai ./charts/kubeai \
-f charts/kubeai/values-amd-gpu-device-plugin.yaml \
--set secrets.huggingface.token=$HF_TOKEN \
--wait
```


## Deploying models

Take a look at the following how-to guides to deploy models:
Expand Down
26 changes: 26 additions & 0 deletions manifests/models/llama-3.1-70b-instruct-fp8-mi300x.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Source: models/templates/models.yaml
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: llama-3.1-70b-instruct-fp8-mi300x
spec:
features: [TextGeneration]
url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
engine: VLLM
args:
- --max-model-len=120000
- --max-num-batched-token=120000
- --max-num-seqs=1024
- --num-scheduler-steps=15
- --gpu-memory-utilization=0.9
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enable-chunked-prefill=false
- --max-seq-len-to-capture=16384
env:
HIP_FORCE_DEV_KERNARG: "1"
NCCL_MIN_NCHANNELS: "112"
TORCH_BLAS_PREFER_HIPBLASLT: "1"
VLLM_USE_TRITON_FLASH_ATTN: "0"
targetRequests: 1024
resourceProfile: amd-gpu-mi300x:1

0 comments on commit b37a1bf

Please sign in to comment.