any advice for getting this running with `gpt-x-alpaca` models? #1

darth-veitcher · 2023-04-09T20:26:32Z

First of all thanks for the repo, looks ideal.

I'm using gpt-x-alpaca-13b-native-4bit-128g-cuda.pt which can be found at repo anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g on HF.

The error I'm receiving is

invalid model file (bad magic [got 0x4034b50 want 0x67676a74])

Is this something which should be compatible?

The text was updated successfully, but these errors were encountered:

1b5d · 2023-04-09T21:28:48Z

Hey there! thanks for the feedback, the current implementation can only run models in the ggml format to be able to do inference on CPUs using the llama.cpp lib, but I think it might be interesting to add another implementation to run inference using the PT format. I currently have limited access to my laptop so It might take me longer to reply, but it would be a great help if you already have a python code example of how to run inference with this model (I'm not familiar with gpt-x-alpaca) 👍

darth-veitcher · 2023-04-09T22:11:54Z

Thanks for the quick reply. My understanding is that it's possible to use inside text-generation-webui,

Seems to be referenced in pull request the other week: oobabooga/text-generation-webui#530 and that itself links back to https://huggingface.co/ozcur/alpaca-native-4bit which probably has the code you're after?

CUDA_VISIBLE_DEVICES=0 python llama_inference.py /root/alpaca-native-4bit --wbits 4 --groupsize 128 --load /root/alpaca-native-4bit/alpaca7b-4bit.pt --max_length 300 --text "$(cat test_prompt.txt)"

Source repo: https://github.com/qwopqwop200/GPTQ-for-LLaMa and underlying inference implementation https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/3274a12aad9e26d2ee883e56e48f977ee5f58fcb/llama_inference.py

Hope that helps?

fblissjr · 2023-04-10T12:43:25Z

On a related note, @1b5d - any plans to incorporate gptq into this? would love a lighter weight API / langchain integration for gpu inference.

1b5d · 2023-04-10T18:54:00Z

I can give it a shot, might take me few days until I'm able to dedicate time for it though, but I think this API should also support gpu inference for sure!

1b5d · 2023-04-23T19:57:34Z

@darth-veitcher @fblissjr I just pushed an image to run inference on GPU using GPTQ-for-llama, you can find the image here

I also added a section in the README.md file, please try it out and let me know your thoughts, my access to GPUs is quite limited

darth-veitcher · 2023-04-23T19:58:31Z

@darth-veitcher @fblissjr I just pushed an image to run inference on GPU using GPTQ-for-llama, you can find the image here

I also added a section in the README.md file, please try it out and let me know your thoughts, my access to GPUs is quite limited

Amazing thanks @1b5d. I will have a look at this shortly.

darth-veitcher · 2023-04-23T21:40:39Z

@1b5d - tested briefly. With host OS Windows (using WSL2)

Can you advise which repos/models you've used successfully? Tried a couple with no joy (all have the same error).

Needed to modify the docker-compose to request a GPU as follows (I'm using the second GPU in the machine with id: 1):

version: '3'

services:
  app:
    image: 1b5d/llm-api:0.0.1-gptq-llama-cuda
    container_name: llm-api-app
    ports:
      - "8000:8000"
    environment:
      - LLM_API_MODELS_DIR=/models
    volumes:
      - "./models:/models:rw"
      - "./config.yaml:/llm-api/config.yaml:ro"
    ulimits:
      memlock: 16000000000
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            #count: 1  # if not fussed about the device
+            device_ids: ['1']  # if you want to specify
+            capabilities: [gpu]

Tests

With the following config:

# file: config.yaml

models_dir: /models
model_family: gptq_llama
setup_params:
  repo_id: TheBloke/koala-13B-GPTQ-4bit-128g
  filename: koala-13B-4bit-128g.safetensors
model_params:
  group_size: 128
  wbits: 4
  cuda_visible_devices: "0"
  device: "cuda:0"

llm-api-app  | Traceback (most recent call last):
llm-api-app  |   File "/llm-api/./app/main.py", line 59, in <module>
llm-api-app  |     llm = ModelClass(params=settings.model_params)
llm-api-app  |   File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 97, in __init__
llm-api-app  |     self.model = self._load_quant(
llm-api-app  |   File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 119, in _load_quant
llm-api-app  |     model = LlamaForCausalLM(config)
llm-api-app  |   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 615, in __init__
llm-api-app  |     self.model = LlamaModel(config)
llm-api-app  |   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 445, in __init__
llm-api-app  |     self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
llm-api-app  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 144, in __init__
llm-api-app  |     self.reset_parameters()
llm-api-app  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 153, in reset_parameters
llm-api-app  |     init.normal_(self.weight)
llm-api-app  | TypeError: GPTQLlamaLLM._load_quant.<locals>.noop() takes 0 positional arguments but 1 was given

1b5d · 2023-04-23T22:15:53Z

@darth-veitcher Thank you for testing this on such a short notice, I think this error is just a stupid typo in the code that I recently pushed, and I've just published a fix for it, could you please try again after pulling the same image name & tag https://hub.docker.com/layers/1b5d/llm-api/0.0.1-gptq-llama-cuda/images/sha256-5b3942381f7a78cec97c2542b83e3fa866a2cac54d66f3bb34c5350933434ba1?context=explore ?

darth-veitcher · 2023-04-23T22:17:16Z

Yeah I had a look and I think it's this that just needs to be re-written. Was just building my own image locally with that change.

        def noop(*args, **kwargs):
            pass

darth-veitcher · 2023-04-23T23:16:27Z

Got a bit further now thanks but seem to be getting issues with CUDA devices not being available on the container, despite the GPUs being visible from within it with commands such as nvidia-smi and nvcc --version.

I'll have another look tomorrow but can you confirm the container was working at your end using a local Nvidia GPU? If so, posting settings here (including a working docker-compose) would help.

llm-api-app  |     torch._C._cuda_init()
llm-api-app  | RuntimeError: No CUDA GPUs are available

If I force the container command to be nvidia-smi the cards are clearly getting passed through.

llm-api-app  | +-----------------------------------------------------------------------------+
llm-api-app  | | NVIDIA-SMI 520.61.03    Driver Version: 522.06       CUDA Version: 11.7     |
llm-api-app  | |-------------------------------+----------------------+----------------------+
llm-api-app  | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
llm-api-app  | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
llm-api-app  | |                               |                      |               MIG M. |
llm-api-app  | |===============================+======================+======================|
llm-api-app  | |   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
llm-api-app  | | 67%   77C    P2   280W / 350W |   8453MiB / 24576MiB |     67%      Default |
llm-api-app  | |                               |                      |                  N/A |
llm-api-app  | +-------------------------------+----------------------+----------------------+
llm-api-app  | |   1  NVIDIA GeForce ...  On   | 00000000:04:00.0 Off |                  N/A |
llm-api-app  | |  0%   17C    P8    11W / 350W |      0MiB / 24576MiB |      0%      Default |
llm-api-app  | |                               |                      |                  N/A |
llm-api-app  | +-------------------------------+----------------------+----------------------+
llm-api-app  |
llm-api-app  | +-----------------------------------------------------------------------------+
llm-api-app  | | Processes:                                                                  |
llm-api-app  | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
llm-api-app  | |        ID   ID                                                   Usage      |
llm-api-app  | |=============================================================================|
llm-api-app  | |    0   N/A  N/A        20      G   /Xwayland                       N/A      |
llm-api-app  | |    0   N/A  N/A        20      G   /Xwayland                       N/A      |
llm-api-app  | |    0   N/A  N/A        22      G   /Xwayland                       N/A      |
llm-api-app  | |    1   N/A  N/A        20      G   /Xwayland                       N/A      |
llm-api-app  | |    1   N/A  N/A        20      G   /Xwayland                       N/A      |
llm-api-app  | |    1   N/A  N/A        22      G   /Xwayland                       N/A      |
llm-api-app  | +-----------------------------------------------------------------------------+

1b5d · 2023-04-24T06:39:14Z

Maybe you already checked this, but in your compose file above you had device_ids: ['1'] and in your config file there is cuda_visible_devices: "0", did you try to set the same value for these?

darth-veitcher · 2023-04-24T07:10:45Z

I'm not sure that's the issue as in theory it means I'm passing the second device through to the container and then it's using the first available GPU to it for the model. Which should be accurate (as that's the only one it should have visible).

I believe it's an issue with the Python environment configuration inside the container and will have a proper look at it today.

Can you confirm you have this container working on an nvidia GPU? If so could you share the settings / run command etc.

1b5d · 2023-04-24T07:21:34Z

cuda_visible_devices correlates to the CUDA_VISIBLE_DEVICES environment variable, which uses the GPU identifiers not the order in which they appear in, what you described above is the behaviour of the device attribute, which for e.g. value cuda:0 selects the first device visible. Is this difficult to test?

darth-veitcher · 2023-04-24T07:23:57Z

I'm pretty sure I ran it last night with a --gpus all command to remove this as a potential but will double check a bit later. Should be able to test in an hour or so 👍🏻

darth-veitcher · 2023-04-24T08:45:03Z

Same issue, running the container with the following commands

docker run --gpus all --rm -it 1b5d/llm-api:0.0.1-gptq-llama-cuda bash

Inside container

python3 -c 'import torch; print(torch.cuda.is_available())'

Answer: False

1b5d · 2023-04-24T09:00:44Z

I see, I probably misunderstood the requirements from nvidia to run in container. Now I think the solution might be in the answer https://stackoverflow.com/a/58432877 would you be able to install the nvidia-container-toolkit on your host and try again? sorry for the back and forth, but every host system is different and I'm still learning when it comes to GPUs

darth-veitcher · 2023-04-24T09:16:26Z

I see, I probably misunderstood the requirements from nvidia to run in container. Now I think the solution might be in the answer https://stackoverflow.com/a/58432877 would you be able to install the nvidia-container-toolkit on your host and try again? sorry for the back and forth, but every host system is different and I'm still learning when it comes to GPUs

I think it's just the way the python environment is configured inside the container you've created. Will have a look today and see if I can get a chance to build one as looks like you've done the actual hard work of implementing the code in the api, this is hopefully just a container thing.

For example if I run the same commands but inside another container on the same host.

docker run --gpus all --rm -it --entrypoint /bin/bash ghcr.io/huggingface/text-generation-inference:latest

Inside the container

python3 -c 'import torch; print(torch.cuda.is_available())'

Answer: True

1b5d · 2023-04-24T09:24:59Z

Please note that they also wrote on their package page:

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

Do you already have it installed by any chance?

1b5d · 2023-04-25T18:41:36Z

Hey @darth-veitcher I just pushed a new image 1b5d/llm-api:0.0.2-gptq-llama-cuda using a different base - would you please give it a try

darth-veitcher · 2023-04-25T18:54:29Z

Hey @darth-veitcher I just pushed a new image 1b5d/llm-api:0.0.2-gptq-llama-cuda using a different base - would you please give it a try

Ah thanks. Got distracted with some side quests but I did get a semi-working image built.

Needed to change some of the library code as well at the image so will share the modifications I made back upstream.

I'd also flag I was getting some issues with .safetensors specifically. A .pt extension loaded fine but was getting a lot of "missing keys, extra keys" type errors when loading a safetensor.

Will review the new image and submit a PR tonight hopefully with the changes I made to get it working.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

any advice for getting this running with `gpt-x-alpaca` models? #1

any advice for getting this running with `gpt-x-alpaca` models? #1

darth-veitcher commented Apr 9, 2023

1b5d commented Apr 9, 2023

darth-veitcher commented Apr 9, 2023

fblissjr commented Apr 10, 2023 •

edited

Loading

1b5d commented Apr 10, 2023

1b5d commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023

1b5d commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023 •

edited

Loading

1b5d commented Apr 24, 2023 •

edited

Loading

darth-veitcher commented Apr 24, 2023

1b5d commented Apr 24, 2023

darth-veitcher commented Apr 24, 2023 •

edited

Loading

darth-veitcher commented Apr 24, 2023

1b5d commented Apr 24, 2023

darth-veitcher commented Apr 24, 2023

1b5d commented Apr 24, 2023

1b5d commented Apr 25, 2023

darth-veitcher commented Apr 25, 2023

any advice for getting this running with gpt-x-alpaca models? #1

any advice for getting this running with gpt-x-alpaca models? #1

Comments

darth-veitcher commented Apr 9, 2023

1b5d commented Apr 9, 2023

darth-veitcher commented Apr 9, 2023

fblissjr commented Apr 10, 2023 • edited Loading

1b5d commented Apr 10, 2023

1b5d commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023

Tests

1b5d commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023

darth-veitcher commented Apr 23, 2023 • edited Loading

1b5d commented Apr 24, 2023 • edited Loading

darth-veitcher commented Apr 24, 2023

1b5d commented Apr 24, 2023

darth-veitcher commented Apr 24, 2023 • edited Loading

darth-veitcher commented Apr 24, 2023

1b5d commented Apr 24, 2023

darth-veitcher commented Apr 24, 2023

1b5d commented Apr 24, 2023

1b5d commented Apr 25, 2023

darth-veitcher commented Apr 25, 2023

any advice for getting this running with `gpt-x-alpaca` models? #1

any advice for getting this running with `gpt-x-alpaca` models? #1

fblissjr commented Apr 10, 2023 •

edited

Loading

darth-veitcher commented Apr 23, 2023 •

edited

Loading

1b5d commented Apr 24, 2023 •

edited

Loading

darth-veitcher commented Apr 24, 2023 •

edited

Loading