Skip to content

Latest commit

 

History

History
254 lines (181 loc) · 8.8 KB

README.md

File metadata and controls

254 lines (181 loc) · 8.8 KB

Windows llama.cpp

A PowerShell automation to rebuild llama.cpp for a Windows environment. It automates the following steps:

  1. Fetching and extracting a specific release of OpenBLAS
  2. Fetching the latest version of llama.cpp
  3. Fixing OpenBLAS binding in the CMakeLists.txt
  4. Rebuilding the binaries with CMake
  5. Updating the Python dependencies
  6. Automatically detects the best BLAS acceleration

BLAS support

This script currently supports OpenBLAS for CPU BLAS acceleration and CUDA for NVIDIA GPU BLAS acceleration.

Installation

1. Install Prerequisites

Download and install the latest versions:

Tip

When installing Visual Studio 2022 it is sufficent to just install the Build Tools for Visual Studio 2022 package. Also make sure that Desktop development with C++ is enabled in the installer.

2. Enable Hardware Accelerated GPU Scheduling (optional)

Execute the following in a PowerShell terminal with Administrator privileges to enable the Hardware Accelerated GPU Scheduling feature:

New-ItemProperty `
    -Path "HKLM:\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" `
    -Name "HwSchMode" `
    -Value "2" `
    -PropertyType DWORD `
    -Force

Then restart your computer to activate the feature.

3. Clone the repository from GitHub

Clone the repository to a nice place on your machine via:

git clone --recurse-submodules git@github.com:countzero/windows_llama.cpp.git

4. Create a new Conda environment

Create a new Conda environment for this project with a specific version of Python:

conda create --name llama.cpp python=3.12

5. Initialize Conda for shell interaction

To make Conda available in you current shell execute the following:

conda init

Tip

You can always revert this via conda init --reverse.

6. Execute the build script

To build llama.cpp binaries for a Windows environment with the best available BLAS acceleration execute the script:

./rebuild_llama.cpp.ps1

Tip

If PowerShell is not configured to execute files allow it by executing the following in an elevated PowerShell: Set-ExecutionPolicy RemoteSigned

7. Download a large language model

Download a large language model (LLM) with weights in the GGUF format into the ./vendor/llama.cpp/models directory. You can for example download the gemma-2-9b-it model in a quantized GGUF format:

Tip

See the 🤗 Open LLM Leaderboard and LMSYS Chatbot Arena Leaderboard for best in class open source LLMs.

Usage

Chat via server script

You can easily chat with a specific model by using the .\examples\server.ps1 script:

.\examples\server.ps1 -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"

Note

The script will automatically start the llama.cpp server with an optimal configuration for your machine.

Execute the following to get detailed help on further options of the server script:

Get-Help -Detailed .\examples\server.ps1

Chat via CLI

You can now chat with the model:

./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --reverse-prompt '[[USER_NAME]]:' `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
    --color `
    --interactive

Chat via Webinterface

You can start llama.cpp as a webserver:

./vendor/llama.cpp/build/bin/Release/llama-server `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33

And then access llama.cpp via the webinterface at:

Increase the context size

You can increase the context size of a model with a minimal quality loss by setting the RoPE parameters. The formula for the parameters is as follows:

context_scale = increased_context_size / original_context_size
rope_frequency_scale = 1 / context_scale
rope_frequency_base = 10000 * context_scale

Note

To increase the context size of an openchat-3.6-8b-20240522 model from its original context size of 8192 to 32768 means, that the context_scale is 4.0. The rope_frequency_scale will then be 0.25 and the rope_frequency_base equals 40000.

To extend the context to 32k execute the following:

./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 32768 `
    --rope-freq-scale 0.25 `
    --rope-freq-base 40000 `
    --threads 16 `
    --n-gpu-layers 33 `
    --reverse-prompt '[[USER_NAME]]:' `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
    --color `
    --interactive

Enforce JSON response

You can enforce a specific grammar for the response generation. The following will always return a JSON response:

./vendor/llama.cpp/build/bin/Release/llama-cli `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
    --prompt "The scientific classification (Taxonomy) of a Llama: " `
    --grammar-file "./vendor/llama.cpp/grammars/json.gbnf"
    --color

Measure model perplexity

Execute the following to measure the perplexity of the GGML formatted model:

./vendor/llama.cpp/build/bin/Release/llama-perplexity `
    --model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 33 `
    --file "./vendor/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw"

Count prompt tokens

You can easily count the tokens of a prompt for a specific model by using the .\examples\count_tokens.ps1 script:

 .\examples\count_tokens.ps1 `
     -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
     -file ".\prompts\chat_with_llm.txt"

To inspect the actual tokenization result you can use the -debug flag:

 .\examples\count_tokens.ps1 `
     -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
     -prompt "Hello Word!" `
     -debug

Note

The script is a simple wrapper for the tokenize.cpp example of the llama.cpp project.

Execute the following to get detailed help on further options of the server script:

Get-Help -Detailed .\examples\count_tokens.ps1

Build

Rebuild llama.cpp

Every time there is a new release of llama.cpp you can simply execute the script to automatically rebuild everything:

Command Description
./rebuild_llama.cpp.ps1 Automatically detects best BLAS acceleration
./rebuild_llama.cpp.ps1 -blasAccelerator "OFF" Without any BLAS acceleration
./rebuild_llama.cpp.ps1 -blasAccelerator "OpenBLAS" With CPU BLAS acceleration
./rebuild_llama.cpp.ps1 -blasAccelerator "CUDA" With NVIDIA GPU BLAS acceleration

Build a specific version of llama.cpp

You can build a specific version of llama.cpp by specifying a git tag or commit:

Command Description
./rebuild_llama.cpp.ps1 The latest release
./rebuild_llama.cpp.ps1 -version "b1138" The tag b1138
./rebuild_llama.cpp.ps1 -version "1d16309" The commit 1d16309