Skip to content

Commit

Permalink
Merge pull request #15 from gotzmann/server
Browse files Browse the repository at this point in the history
Server Mode
  • Loading branch information
gotzmann authored Apr 28, 2023
2 parents ea45a8a + e274511 commit bf2bddd
Show file tree
Hide file tree
Showing 13 changed files with 1,179 additions and 452 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.env
*.bin
.idea
.vscode
*.pprof
Expand Down
9 changes: 8 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
TARGET = llama
VERSION = $(shell cat VERSION)
# $(shell cat VERSION)
VERSION = v1.4.0
OS = linux
ARCH = amd64
PACKAGE = github.com/gotzmann/$(TARGET)
Expand Down Expand Up @@ -140,3 +141,9 @@ fp16:
pprof:
go tool pprof -pdf cpu.pprof > cpu.pdf

.PHONY: builds
builds:
GOOS=windows GOARCH=amd64 go build -o ./builds/llama-go-$(VERSION).exe -ldflags "-s -w" main.go
GOOS=darwin GOARCH=amd64 go build -o ./builds/llama-go-$(VERSION)-macos -ldflags "-s -w" main.go
GOOS=linux GOARCH=amd64 go build -o ./builds/llama-go-$(VERSION)-linux -ldflags "-s -w" main.go

201 changes: 158 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,83 +2,198 @@

![](./assets/images/terminal.png?raw=true)

## The Goal
## Motivation

We dream of a world where ML hackers are able to grok with **REALLY BIG GPT** models without having GPU clusters consuming a shit tons of **$$$** - using only machines in their own homelabs.
We dream of a world where fellow ML hackers are grokking **REALLY BIG GPT** models in their homelabs without having GPU clusters consuming a shit tons of **$$$**.

The code of the project is based on the legendary **[ggml.cpp](https://github.com/ggerganov/llama.cpp)** framework of Georgi Gerganov written in C++
The code of the project is based on the legendary **[ggml.cpp](https://github.com/ggerganov/llama.cpp)** framework of Georgi Gerganov written in C++ with the same attitude to performance and elegance.

We hope using our beloved Golang instead of *soo-powerful* but *too-low-level* language will allow much greater adoption of the **NoGPU** ideas.

The V1 supports only FP32 math, so you'll need at least 32GB RAM to work even with the smallest **LLaMA-7B** model. As a preliminary step you should have binary files converted from original LLaMA model locally.
We hope using Golang instead of *soo-powerful* but *too-low-level* language will allow much greater adoption.

## V0 Roadmap

- [x] Run tensor math in pure Golang based on C++ source
- [x] Tensor math in pure Golang
- [x] Implement LLaMA neural net architecture and model loading
- [x] Run smaller LLaMA-7B model
- [x] Be sure Go inference works EXACT SAME way as C++
- [x] Let Go shine! Enable multi-threading and boost performance
- [x] Test with smaller LLaMA-7B model
- [x] Be sure Go inference works exactly same way as C++
- [x] Let Go shine! Enable multi-threading and messaging to boost performance

## V1 Roadmap
## V1 Roadmap - Spring'23

- [x] Cross-patform compatibility with Mac, Linux and Windows
- [x] Release first stable version for ML hackers
- [x] Support bigger LLaMA models: 13B, 30B, 65B
- [x] ARM NEON support on Apple Silicon (modern Macs) and ARM servers
- [x] Performance boost with x64 AVX2 support for Intel and AMD
- [x] Release first stable version for ML hackers - v1.0
- [x] Enable bigger LLaMA models: 13B, 30B, 65B - v1.1
- [x] ARM NEON support on Apple Silicon (modern Macs) and ARM servers - v1.2
- [x] Performance boost with x64 AVX2 support for Intel and AMD - v1.2
- [x] Better memory use and GC optimizations - v1.3
- [x] Introduce Server Mode (embedded REST API) for use in real projects - v1.4
- [x] Release converted models for free access over the Internet - v1.4
- [ ] INT8 quantization to allow x4 bigger models fit same memory
- [ ] Benchmark LLaMA.go against some mainstream Python / C++ frameworks
- [ ] Enable some popular models of LLaMA family: Vicuna, Alpaca, etc
- [ ] Speed-up AVX2 with memory aligned tensors
- [ ] INT8 quantization to allow x4 bigger models fit the same memory
- [ ] Enable interactive mode for real-time chat with GPT
- [ ] Allow automatic download converted model weights from the Internet
- [ ] Extensive logging for production monitoring
- [ ] Interactive mode for real-time chat with GPT

## V2 Roadmap - Summer'23

- [ ] Automatic CPU / GPU features detection
- [ ] Implement metrics for RAM and CPU usage
- [ ] Server Mode for use in Clouds as part of Microservice Architecture
- [ ] Standalone GUI or web interface for better access to framework
- [ ] Support popular open models: Open Assistant, StableLM, BLOOM, Anthropic, etc.
- [ ] AVX512 support - yet another performance boost for AMD Epyc and Intel Sapphire Rapids
- [ ] Nvidia GPUs support (CUDA or Tensor Cores)

## V2 Roadmap
## V3 Roadmap - Fall'23

- [ ] Allow plugins and external APIs for complex projects
- [ ] AVX512 support - yet another performance boost for AMD Epyc
- [ ] FP16 and BF16 support when hardware support there
- [ ] Support INT4 and GPTQ quantization
- [ ] Allow model training and fine-tuning
- [ ] Speed up execution on GPU cards and clusters
- [ ] FP16 and BF16 math if hardware support is there
- [ ] INT4 and GPTQ quantization
- [ ] AMD Radeon GPUs support with OpenCL

## How to Run?

First, obtain and convert original LLaMA models on your own, or just download ready-to-rock ones:

**LLaMA-7B:** [llama-7b-fp32.bin](https://nogpu.com/llama-7b-fp32.bin)

**LLaMA-13B:** [llama-7b-fp32.bin](https://nogpu.com/llama-7b-fp32.bin)

Both models store FP32 weights, so you'll needs at least 32Gb of RAM (not VRAM or GPU RAM) for LLaMA-7B. Double to 64Gb for LLaMA-13B.

## How to Run
Next, build app binary from sources (see instructions below), or just download already built one:

**Windows:** [llama-go-v1.4.0.exe](./builds/llama-go-v1.4.0.exe)

**MacOS:** [llama-go-v1.4.0-macos](./builds/llama-go-v1.4.0-macos)

**Linux:** [llama-go-v1.4.0-linux](./builds/llama-go-v1.4.0-linux)

So now you have both executable and model, go try it for yourself:

```shell
go run main.go \
--model ~/models/7B/ggml-model-f32.bin \
--temp 0.80 \
--context 128 \
--predict 128 \
--prompt "Why Golang is so popular?"
llama-go-v1.4.0-macos \
--model ~/models/llama-7b-fp32.bin \
--prompt "Why Golang is so popular?" \
```

Or build it with Makefile and then run binary.

## Useful CLI parameters:
## Useful command line flags:

```shell
--prompt Text prompt from user to feed the model input
--model Path and file name of converted .bin LLaMA model
--model Path and file name of converted .bin LLaMA model [ llama-7b-fp32.bin, etc ]
--server Start in Server Mode acting as REST API endpoint
--host Host to allow requests from in Server Mode [ localhost by default ]
--port Port listen to in Server Mode [ 8080 by default ]
--pods Maximum pods or units of parallel execution allowed in Server Mode [ 1 by default ]
--threads Adjust to the number of CPU cores you want to use [ all cores by default ]
--predict Number of tokens to predict [ 64 by default ]
--context Context size in tokens [ 64 by default ]
--temp Model temperature hyper parameter [ 0.8 by default ]
--silent Hide welcome logo and other output [ show by default ]
--context Context size in tokens [ 1024 by default ]
--predict Number of tokens to predict [ 512 by default ]
--temp Model temperature hyper parameter [ 0.5 by default ]
--silent Hide welcome logo and other output [ shown by default ]
--chat Chat with user in interactive mode instead of compute over static prompt
--profile Profe CPU performance while running and store results to [cpu.pprof] file
--profile Profe CPU performance while running and store results to cpu.pprof file
--avx Enable x64 AVX2 optimizations for Intel and AMD machines
--neon Enable ARM NEON optimizations for Apple Macs and ARM server
```
## Going Production
LLaMA.go embeds standalone HTTP server exposing REST API. To enable it, run app with special flags:
```shell
llama-go-v1.4.0-macos \
--model ~/models/llama-7b-fp32.bin \
--server \
--host 127.0.0.1 \
--port 8080 \
--pods 4 \
--threads 4
```
Depending on the model size, how many CPU cores available there, how many requests you want to process in parallel, how fast you'd like to get answers, choose **pods** and **threads** parameters wisely.
**Pods** is a number of inference instances that might run in parallel.
**Threads** parameter sets how many cores will be used for tensor math within a pod.
So for example if you have machine with 16 hardware cores capable running 32 hyper-threads in parallel, you might end up with something like that:
```shell
--server --pods 4 --threads 8
```
When there is no free pod to handle arriving request, it will be placed into the waiting queue and started when some pod gets job finished.
# REST API examples
## Place new job
Send POST request (with Postman) to your server address with JSON containing unique UUID v4 and prompt:
```json
{
"id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3",
"prompt": "Why Golang is so popular?"
}
```
## Check job status
Send GET request (with Postman or browser) to URL like http://host:port/jobs/status/:id
```shell
GET http://localhost:8080/jobs/status/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fcb
```
## Get the results
Send GET request (with Postman or browser) to URL like http://host:port/jobs/:id
```shell
GET http://localhost:8080/jobs/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fcb
```
# How to build
First, install **Golang** and **git** (you'll need to download installers in case of Windows).
```shell
brew install git
brew install golang
```
Then clone the repo and enter the project folder:
```
git clone https://github.com/gotzmann/llama.go.git
cd llama.go
```
Some Go magic to install external dependencies:
```
go tidy
go vendor
```
Now we are ready to build the binary from the source code:
```shell
go build -o llama-go-v1.exe -ldflags "-s -w" main.go
```
## FAQ
**1] Where might I get original LLaMA model files?**
**1) From where I might obtain original LLaMA models?**
Contact Meta directly or look around for some torrent alternatives
Contact Meta directly or just look around for some torrent alternatives.
**2] How to convert original LLaMA files into supported format?**
**2) How to convert original LLaMA files into supported format?**
Youl'll need original FP16 files placed into **models** directory, then convert with command:
Place original PyTorch FP16 files into **models** directory, then convert with command:
```shell
python3 ./scripts/convert.py ~/models/LLaMA/7B/ 0
Expand Down
Binary file modified assets/images/terminal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added builds/llama-go-v1.4.0-linux
Binary file not shown.
Binary file added builds/llama-go-v1.4.0-macos
Binary file not shown.
Binary file added builds/llama-go-v1.4.0.exe
Binary file not shown.
16 changes: 14 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ module github.com/gotzmann/llama.go
go 1.20

require (
github.com/gofiber/fiber/v2 v2.44.0
github.com/google/uuid v1.3.0
github.com/gotzmann/llama.go/llama v0.0.0-20230412160549-c20730f209a3
github.com/gotzmann/llama.go/ml v0.0.0-20230412160549-c20730f209a3
github.com/jessevdk/go-flags v1.5.0
github.com/mattn/go-colorable v0.1.13
Expand All @@ -14,11 +17,20 @@ require (
)

require (
github.com/andybalholm/brotli v1.0.5 // indirect
github.com/felixge/fgprof v0.9.3 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/mattn/go-isatty v0.0.17 // indirect
github.com/klauspost/compress v1.16.3 // indirect
github.com/mattn/go-isatty v0.0.18 // indirect
github.com/mattn/go-runewidth v0.0.14 // indirect
github.com/philhofer/fwd v1.1.2 // indirect
github.com/rivo/uniseg v0.2.0 // indirect
golang.org/x/sys v0.6.0 // indirect
github.com/savsgio/dictpool v0.0.0-20221023140959-7bf2e61cea94 // indirect
github.com/savsgio/gotils v0.0.0-20230208104028-c358bd845dee // indirect
github.com/tinylib/msgp v1.1.8 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/valyala/fasthttp v1.45.0 // indirect
github.com/valyala/tcplisten v1.0.0 // indirect
golang.org/x/sys v0.7.0 // indirect
golang.org/x/term v0.6.0 // indirect
)
Loading

0 comments on commit bf2bddd

Please sign in to comment.