Image Mixer is a model that lets you combine the concepts, styles, and compositions from multiple images (and text prompts too) and generate new images.
It was trained by Justin Pinkney at Lambda Labs.
Training details
This model is a fine tuned version of Stable Diffusion Image Variations it has been trained to accept multiple CLIP embedding concatenated along the sequence dimension (as opposed to 1 in the original model). During training up to 5 crops of the training images are taken and CLIP embeddings are extracted, these are concatenated and used as the conditioning for the model. At inference time, CLIP embeddings from multiple images can be used to generate images which are influence by multiple inputs.
Training was done at 640x640 on a subset of LAION improved aesthetics, using 8xA100 from Lambda GPU Cloud.
Note text captions were not used during training of the model, although input text embeddings works to some extent during inference, the model is primarily designed to accept image embeddings
Usage
The model is available on huggingface spaces or to run locally do the following:
git clone https://github.com/justinpinkney/stable-diffusion.git
cd stable-diffusion
git checkout 1c8a598f312e54f614d1b9675db0e66382f7e23c
python -m venv .venv --prompt sd
. .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
python scripts/gradio_image_mixer.py
Then navigate to the gradio demo link printed in the terminal.
For details on how to use the model outside the app refer to the run
function in gradio_image_mixer.py
in the original repo