Note that all these models are derivatives of black-forest-labs/FLUX.1-dev and therefore covered by the FLUX.1 [dev] Non-Commercial License license.

Some models are derivatives of finetunes, and are included with the permission of the finetuner

Optimised Flux GGUF models

A collection of GGUF models using mixed quantization (different layers quantized to different precision to optimise fidelity v. memory).

They were created using the convert.py script.

They can be loaded in ComfyUI using the ComfyUI GGUF Nodes. Just put the gguf files in your models/unet directory.

Naming convention (mx for 'mixed')

[original_model_name]_mxN_N.gguf

where N_N is the average number of bits per parameter.

Good choices to start with

-  3_8 might work on a 8 GB card
-  6_9 should be good for a 12 GB card
-  8_2 is a good choice for 16 GB cards if you want to add LoRAs etc
-  9_2 fits on a 16 GB card

Speed?

On an A40 (plenty of VRAM), everything except the model identical, the time taken to generate an image (30 steps, deis sampler) was about 65% longer than for the full model.

Quantised models will generally be slower because the weights have to be converted back into a native torch form when they are needed.

How is this optimised?

The process for optimisation is as follows:

240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
For each layer in turn, and for each quantization:
- A single layer was quantized
- The initial hidden states were processed by the modified layer stack
- The error (MSE) in the final hidden state was calculated
This gives a 'cost' for each possible layer quantization - how much different it is to the full model
An optimised quantization is one that gives the desired reduction in size for the smallest total cost
- A series of recipies for optimization have been created from the calculated costs
the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32

Also note

Tests on using bitsandbytes quantizations showed they did not perform as well as the equivalent sized GGUF quants
Different quantizations of different parts of a layer gave significantly worse results
Leaving bias in 16 bit made no relevant difference (the 'patched' models generally do)
Costs were evaluated for the original Flux.1-dev model. They are probably essentially the same for finetunes