My version to use Dev and Schnell on a 3090 using quants with a gradio front end.
I added quants to both models on startup. So it takes a few minutes on startup but then I get image generations in under 2 minutes for dev and just a few seconds for schnell.
Here is the github and a video explaining it:
https://github.com/NuclearGeekETH/NuclearGeek-Flux-Capacitor
This makes one hell of a difference in inference speed. I tested with only 28 steps. I quantized the text_encoder_2 version and not the text_encoder.
Ran in 35 seconds flat. nice!! I'm using RTX 4090 so no cpu offloading needed. Result was very good.
Note: If using GPU (without enable_model_cpu_offload()) you should Quantize BEFORE sending the pipeline to device="cuda".
Running transformer freeze DEV
Running text_encoder freeze DEV
seed = 17894334164879757554
100%|ββββββββββ| 28/28 [00:35<00:00, 1.28s/it]