Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please upload GGUF model to hugging face link #7

Open
wardensc2 opened this issue Sep 19, 2024 · 20 comments
Open

Please upload GGUF model to hugging face link #7

wardensc2 opened this issue Sep 19, 2024 · 20 comments

Comments

@wardensc2
Copy link

Hi @MinusZoneAI

Please upload GGUF model to hugging face. The link from china server is very slow to download.

Thank you

@wailovet
Copy link
Contributor

@wardensc2
Copy link
Author

@wardensc2
Copy link
Author

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?

Thank in advance

@wardensc2 wardensc2 reopened this Sep 20, 2024
@wailovet
Copy link
Contributor

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?

Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

@phr00t
Copy link

phr00t commented Sep 20, 2024

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

@wailovet
Copy link
Contributor

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF.
Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

@phr00t
Copy link

phr00t commented Sep 20, 2024

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

I see that is the python gguf module and quants method used.

However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.

EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...

@wailovet
Copy link
Contributor

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

I see that is the python gguf module and quants method used.

However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.

EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...

I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability.

@realisticdreamer114514
Copy link

Can you make a Q4 GGUF for the 5B-I2V model?

@phr00t
Copy link

phr00t commented Sep 21, 2024

https://huggingface.co/MinusZoneAI/ComfyUI-CogVideoX-MZ

Hi @wailovet can you upload Q8 GGUF version ?
Thank in advance

I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference.

How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file).

I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. Quantitative methods can be referred to:https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L220

I see that is the python gguf module and quants method used.
However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version.
EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken...

I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability.

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

https://github.com/Nexesenex/croco.cpp/blob/32d7ed1b6e6e2a9be4e9777b331373b198b3dac3/gguf-py/gguf/quants.py#L554

@wailovet
Copy link
Contributor

wailovet commented Sep 21, 2024

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside.
I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

@phr00t
Copy link

phr00t commented Sep 21, 2024

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:

https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337

Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

@wailovet
Copy link
Contributor

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:

https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337

Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.

This is quite cumbersome. At least it is not as easy as imagined.

@phr00t
Copy link

phr00t commented Sep 21, 2024

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337
Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.

This is quite cumbersome. At least it is not as easy as imagined.

Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available?

@wailovet
Copy link
Contributor

Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that:

Look, there is no quantize_blocks inside. I suggest directly using the fp8 type. Even Q4 only quantizes a part of the layers. Compared with fp8, it will reduce VRAM usage by 30%. If Q6 is used, the effect may not be very obvious.

You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8:
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337
Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved.

You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference.
This is quite cumbersome. At least it is not as easy as imagined.

Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available?

There is no such script. I quickly implemented it by directly rewriting the local Comfyui code. The chaotic code is mixed with the local Comfyui. After all, at the very beginning, I only wanted it to be used once.

However, I can provide a code snippet rewritten in the form of torch.



def split_block_dims(blocks, *args):
    n_max = blocks.shape[1]
    dims = list(args) + [n_max - sum(args)]
    return torch.split(blocks, dims, dim=1)


def quant_shape_to_byte_shape(shape, qtype) -> tuple[int, ...]:
    # shape = shape[::-1]
    block_size, type_size = GGML_QUANT_SIZES[qtype]
    if shape[-1] % block_size != 0:
        raise ValueError(
            f"Quantized tensor row size ({shape[-1]}) is not a multiple of Q4_0 block size ({block_size})")
    return (*shape[:-1], shape[-1] // block_size * type_size)


def quantize_blocks_Q4_0(data):
    block_size, type_size = GGML_QUANT_SIZES["Q4_0"]

    original_shape = data.shape

    rows_size = data.numel()
    n_blocks = rows_size // block_size
    data = data.reshape((n_blocks, block_size))

    n_blocks = data.shape[0]

    max = torch.max(torch.abs(data), dim=-1, keepdim=True).values

    d = max / -8
    id = torch.where(d == 0, 0, 1 / d)

    qs = torch.trunc((data * id) + 8.5).to(torch.uint8).clamp(0, 15)
    qs = qs.reshape((n_blocks, 2, block_size // 2))
    qs = qs[..., 0, :] | (qs[..., 1, :] << 4)

    d = d.to(torch.float16).view(torch.uint8)

    out = torch.cat([d, qs], dim=-1)

    out = out.reshape(quant_shape_to_byte_shape(original_shape, qtype="Q4_0"))

    return out

@kijai
Copy link

kijai commented Sep 22, 2024

Thank you for your work with quantizing the models! This is all new to me, but thanks to this discussion and the provided snippet I think I managed to quant the I2V model similarly, at least it works:

https://huggingface.co/Kijai/CogVideoX_GGUF/blob/main/CogVideoX_5b_I2V_GGUF_Q4_0.safetensors

@realisticdreamer114514
Copy link

realisticdreamer114514 commented Sep 22, 2024

@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM?
image
Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs?
The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge could implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM.
(Remember add that model into this node too)

@kijai
Copy link

kijai commented Sep 22, 2024

@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? image Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)

I believe MinusZone AI has done that in this repo, I haven't tried that though.

@wailovet
Copy link
Contributor

wailovet commented Sep 22, 2024

@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? image Is it technically possible to enable_sequential_cpu_offload (the main VRAM optimization for low VRAM) for GGUFs? The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM. (Remember add that model into this node too)

I believe MinusZone AI has done that in this repo, I haven't tried that though.

I only perform pipe.enable_sequential_cpu_offload() in the loader. I think it should be effective.

Most of the time, the OOM I encounter occurs in vae encode.

@realisticdreamer114514
Copy link

realisticdreamer114514 commented Sep 22, 2024

The main advantage of GGUF is splitting inference memory between VRAM & CPU+RAM at least with llama.cpp, but I don't know how you and Illyasviel at Forge could implement that. If I assume the diffusers implementation to take as much VRAM as SAT then Q4 should be 1/4 of that, and I can split it between 4.5GB VRAM and 2GB CPU RAM.

With the setting in the image (main_device) it is much slower than using the transformer models, because it spills into shared memory (RAM). Wailovet probably means VRAM minimization but what I and probably many others with 8-12GB cards prefer is optimization in the way WebUI Forge or llama.cpp has, fitting it to the available VRAM or even letting us pick how much it consumes. I tried bringing this up to the original implementation repo and they imply that's not planned (by telling me to keep using the enable_sequential_cpu_offload option)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants