Recommendation to Revisit the Diffuser Default LoRA Parameters
Over the last year I have trained hundreds of LoRA finetunes with SDXL, and in the short time that I've spent back in the consulting space, I have tested with over a dozen startup apps that offer finetuning services on their platforms. I have seen, very consistently, the same general quality results from these training programs.
- If the style is generic, loud, and in particular 3D, the style generally looks acceptable to the average viewer.
- Concepts in the original dataset are nearly always slightly overfit, and sometimes incredibly overfit, appearing in most images to a greater or lesser degree.
- Minimalist styles, realistic photography styles, and human-created datasets start to fall apart.
- Degradation is very obvious when you zoom in on the edges of many images.
Some examples: In this example you can see the level of fidelity, however the prompt was "a small boy" - although this character was in a reasonably size dataset, the concept became overfit before the fidelity of the model itself degraded. This indicates to me too fast training.
Alternatively, this image exhibits not just slight concept overfitting, but also you can see the line degradation. This can be hard to discern, because early overfit line degradation looks similar to underfit. The key is if you also find concepts look overfit, which would not be present in a truly underfit model.
Additionally, you can see in this example of realism that the more nuanced details which make realism convincing were not learned, despite broader concepts being understood. I speculate this is another result of training that is too fast and strong.
I was initially puzzled - why would training across the space so consistently have the same set of problems, particularly when, in my experience, those problems are generally avoidable? So - I started asking questions. I was lucky enough to start training SDXL by initially using The Last Ben's runpod notebook, which has very reasonable suggested settings. However, I remembered that I had been presented with the Diffusers preset during early tests and originally tried to work with them. So - I started recently asking founders - are you working from the Diffusers presets?
The answer was, of course, yes.
For context, here are the presets from the Diffusers LoRA blog.
In my experience, these produce fast results but the results are inferior. However, the solution is a simple one. In my opinion, the Unet and Text Encoder learning rate are set much too high. This leads to very fast training, and unless one is training a somewhat generic concept whose dataset has been through a VAE before, I find that the results are often a complete failure. Even in the instances where the overall style is captured, I consistently see issues in fidelity, prompt coherence, and overfitting of concepts. Additionally, because the learning rate is much higher, there is a natural need to reduce the overall training steps. I find this also forces quick learning that does not result in a model that I personally would be happy with.
From a practical standpoint, it is unrealistic for a startup to test every training edge case. Additionally, AI datasets are much easier to curate, so in my experience, many startups are relying on them for at least part of their training process. Diffusers also is an important resource, so it seems natural to rely on their presets as a good jumping-off point. In my opinion, this is leading to widespread adoption of poor training practices.
(As a side note, the presets also set 512 as the image resolution for training, when it should almost certainly be 1024 for SDXL.)
Here is my suggestion for a revised preset:
resolution: 1024
train batch size: 4
max training steps: [# of images x 250]*
Unet LR: 5e-5
Text Encoder LR: 1e-5
To preserve complex details, I will often raise the overall step count and reduce the learning rate incrementally. I typically will not go lower than 9e-7 and keep the text encoder at a lower rate than the Unet. I find subjects need more focused text encoder training than styles.
- It may not serve some, however, I think it is actually imperative to use stops in training with results at 60, 90, 120, 150, 180, 210 steps per image in the dataset, or other similarly spaced out steps. I think it is unrealistic to expect every dataset to need the same step count, even within the same style, and also give users the sense of control over the final results. If you cannot do this, you may find that stopping closer to 150 is better in many cases.
Positive Examples:
In this example you can see that the linework is very clean, and although digital, does not have the markings of the melty effect that obviously AI generated linework can have when it is slightly overfit. )
Not only not overfit on concept, this image clearly shows the level of fidelity that is achievable by slowind down a training.
In conclusion, I beleive it would benefit the whole community if the default Diffuser parameters were revisited. I hope this helps, and certainly would love to hear what results others get with my approach. I don't think my method is definitive, however I think adoption of more reasonable presets would lead to better results throughout the space, and Diffusers has a responsibility to suggest a preset that leads to more successes.