Finetuning details/code
Thanks for this great experiment, it is thought-provoking!
I would be interested in pushing the limits a bit, seeing how we can produce "events" that last less than 10 seconds. Saying like : groovy rock music with a 4 seconds sax reel". I'm curious how we could deal with ambiophonic events, like dog barking or bird signing? What do you suggest we do if we have smaller audio clips, (like a 2 sec bird song)? Should we just loop the sound to reach 10 secs or pad the audio with silence?
What if we're trying to capture something that exceed 10 secs, like long fog horn? I understand the 10 secs is a choice you made to match the image size, but once we pick a limit, we're kind of stuck with it...
Finally, any chance you would share your fine-tuning setup? It would save us lots of time to try to push the envelope of what you've accomplished!
Thanks again!
I can get fine-tuning to run just using the examples/train_text_to_image.py
script. You do need to format your data, but that's in the HF docs so not a huge issue.
In my case I had to download the riffusion model—I was getting errors when trying to get it directly from HF (i.e., in code). So I used git to download a local copy (you need lfs
).
The problem I'm having is that, when trying to run the model, I get a type error: RuntimeError: Input type (c10::Half) and bias type (float) should be the same
This is happening in F.conv2d()
. Full trace:
Traceback (most recent call last):
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.__dict__)
File "/home/james/src/somms/riffusion/riffusion/streamlit/pages/text_to_audio.py", line 102, in <module>
render_text_to_audio()
File "/home/james/src/somms/riffusion/riffusion/streamlit/pages/text_to_audio.py", line 78, in render_text_to_audio
image = streamlit_util.run_txt2img(
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/streamlit/runtime/caching/cache_utils.py", line 428, in wrapper
return get_or_create_cached_value()
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/streamlit/runtime/caching/cache_utils.py", line 401, in get_or_create_cached_value
return_value = func(*args, **kwargs)
File "/home/james/src/somms/riffusion/riffusion/streamlit/util.py", line 103, in run_txt2img
output = pipeline(
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 531, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 421, in forward
sample = self.conv_in(sample)
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/james/anaconda3/envs/riffusion/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Any thoughts as to what might be going on?
Ah, false alarm. I had hacked some stuff yesterday while trying to get things working... removing my hacks it works fine.
@jbmaxwell I'm working on the same thing.. What's a good way to connect with you so we can share what we learn along the way?