Skip to content

Commit

Permalink
fix safetensors issue
Browse files Browse the repository at this point in the history
  • Loading branch information
1b5d committed May 5, 2023
1 parent 250ec5c commit 1b2037b
Show file tree
Hide file tree
Showing 4 changed files with 33 additions and 8 deletions.
1 change: 1 addition & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
[MESSAGES CONTROL]
disable=duplicate-code
generated-members=numpy.*, torch.*
30 changes: 26 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ tested on CPU with the following models :
tested on GPU with GPTQ-for-LlaMa with

- Koala 7B-4bit-128g
- wizardLM 7B-4bit-128g

Contribution for supporting more models is welcomed.

Expand All @@ -29,8 +30,8 @@ Contribution for supporting more models is welcomed.
- [x] Support GPTQ-for-LLaMa
- [ ] Lora support
- [ ] huggingface pipeline
- [ ] Write an implementation for OpenAI
- [ ] Write an implementation for RWKV-LM
- [ ] Support OpenAI
- [ ] Support RWKV-LM

# Usage

Expand Down Expand Up @@ -151,16 +152,37 @@ curl --location 'localhost:8000/generate' \

## Llama / Alpaca on GPU - using GPTQ-for-LLaMa (beta)

**Note**: According to [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), you might want to install the [NVIDIA Driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html) on your host machine
**Note**: According to [nvidia-docker](https://github.com/NVIDIA/nvidia-docker), you might want to install the [NVIDIA Driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html) on your host machine. Verify that your nvidia environment is properly by running this:

You can also run the Llama model using GPTQ-for-LLaMa 4 bit quantization, you can use a docker image specially built for that purpose `1b5d/llm-api:0.0.2-gptq-llama-cuda` instead of the default image.
```
docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
```

You should see a table showing you the current nvidia driver version and some other info:
```
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 11.7 |
|-----------------------------------------+----------------------+----------------------+
...
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
```

You can also run the Llama model using GPTQ-for-LLaMa 4 bit quantization, you can use a docker image specially built for that purpose `1b5d/llm-api:0.0.3-gptq-llama-cuda` instead of the default image.

a separate docker-compose file is also available to run this mode:

```
docker compose -f docker-compose.gptq-llama-cuda.yaml up
```

or by directly running the container:

```
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:0.0.3-gptq-llama-cuda
```

Example config file:

```
Expand Down
6 changes: 5 additions & 1 deletion app/llms/gptq_llama/gptq_llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,11 @@ def __init__(self, params: Dict[str, str]) -> None:
os.environ["CUDA_VISIBLE_DEVICES"] = cuda_visible_devices
self.device = torch.device(dev)
self.model = self._load_quant(
settings.setup_params["repo_id"], model_path, wbits, group_size, self.device
settings.setup_params["repo_id"],
model_path,
wbits,
group_size,
cuda_visible_devices,
)

self.model.to(self.device)
Expand Down
4 changes: 1 addition & 3 deletions docker-compose.gptq-llama-cuda.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ version: '3'

services:
app:
image: 1b5d/llm-api:0.0.2-gptq-llama-cuda
image: 1b5d/llm-api:0.0.3-gptq-llama-cuda
container_name: llm-api-app
ports:
- "8000:8000"
Expand All @@ -11,8 +11,6 @@ services:
volumes:
- "./models:/models:rw"
- "./config.yaml:/llm-api/config.yaml:ro"
ulimits:
memlock: 16000000000
deploy:
resources:
reservations:
Expand Down

0 comments on commit 1b2037b

Please sign in to comment.