-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please upload GGUF model to hugging face link #7
Comments
Hi @wailovet can you upload Q8 GGUF version ? Thank in advance |
I don't have the GGUF version of Q8. You can directly download the model file in https://huggingface.co/alibaba-pai/CogVideoX-Fun-5b-InP/tree/main/transformer and put it in the unet folder. Select the type fp8_e4m3 to achieve 8-bit quantization inference. |
How can we quantize ourselves? I want a Q5 and maybe even a Q6, as I think Q4 is a bit more quant than I want. I'm happy to generate it myself, but I haven't figured out how to convery the safetensors to gguf. I've been trying to use the llama tools but it says it can't open the safetensors (or recognize the config.json file). |
I only referred to the quantization method of GGUF. After quantizing some layers, it is re-saved as safetensors. It is not strictly GGUF. |
I see that is the python gguf module and quants method used. However, I want to make a Q5_K_M version of CogVideoX-Fun. I'm wondering what steps were used to make the Q4_0 GGUF files, so I can do the same for a Q5_K_M version. EDIT: Looks like this was hardcoded to only support Q4_0, if I'm not mistaken... |
I tried to find the torch quantization code for Q5_K_M, but it seems that it doesn't exist. This may be beyond my ability. |
Can you make a Q4 GGUF for the 5B-I2V model? |
Looks like it is just referred to as "K" and not "K_M" in the source files. You want to use the "K" methods over the "_0" methods, which are newer and considered better at the same size. I'd really love a Q6_K quaint of the CogVideoX-Fun model, and here is the code for that: |
Look, there is no quantize_blocks inside. |
You are correct. How about Q5_1 then? It has a quantize_blocks, provides more precision than Q4_0, while still being marginally smaller than fp8: https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py#L337 Also, if you don't care to do it, I'd be more than happy to do the quant myself if you shared the steps involved. |
You need to rewrite the quantize_blocks of Q5_1 into torch and save a new model file after quantizing the weights in the original model. In addition, you need to modify https://github.com/MinusZoneAI/ComfyUI-CogVideoX-MZ/blob/main/mz_gguf_loader.py#L19 to recognize the Q5_1 type and dequantize_blocks during inference. This is quite cumbersome. At least it is not as easy as imagined. |
Do you have the script that you used to rewrite the blocks into Q4_0 somewhere available? |
There is no such script. I quickly implemented it by directly rewriting the local Comfyui code. The chaotic code is mixed with the local Comfyui. After all, at the very beginning, I only wanted it to be used once. However, I can provide a code snippet rewritten in the form of torch.
|
Thank you for your work with quantizing the models! This is all new to me, but thanks to this discussion and the provided snippet I think I managed to quant the I2V model similarly, at least it works: https://huggingface.co/Kijai/CogVideoX_GGUF/blob/main/CogVideoX_5b_I2V_GGUF_Q4_0.safetensors |
@kijai With my available VRAM (you've seen me asking in your own repo) how should I load GGUF quants for it to not OOM? |
I believe MinusZone AI has done that in this repo, I haven't tried that though. |
I only perform pipe.enable_sequential_cpu_offload() in the loader. I think it should be effective. Most of the time, the OOM I encounter occurs in vae encode. |
With the setting in the image (main_device) it is much slower than using the transformer models, because it spills into shared memory (RAM). Wailovet probably means VRAM minimization but what I and probably many others with 8-12GB cards prefer is optimization in the way WebUI Forge or llama.cpp has, fitting it to the available VRAM or even letting us pick how much it consumes. I tried bringing this up to the original implementation repo and they imply that's not planned (by telling me to keep using the enable_sequential_cpu_offload option) |
Hi @MinusZoneAI
Please upload GGUF model to hugging face. The link from china server is very slow to download.
Thank you
The text was updated successfully, but these errors were encountered: