camenduru commited on
Commit
02ca9a4
1 Parent(s): 16eb622

thanks to TMElyralab ❤

Browse files
README.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: creativeml-openrail-m
3
+ language:
4
+ - en
5
+ ---
6
+ # MuseTalk
7
+
8
+ MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting
9
+ </br>
10
+ Yue Zhang <sup>\*</sup>,
11
+ Minhao Liu<sup>\*</sup>,
12
+ Zhaokang Chen,
13
+ Bin Wu<sup>†</sup>,
14
+ Yingjie He,
15
+ Chao Zhan,
16
+ Wenjiang Zhou
17
+ (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, [email protected])
18
+
19
+ **[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **Project(comming soon)** **Technical report (comming soon)**
20
+
21
+ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
22
+
23
+ # Overview
24
+ `MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
25
+
26
+ 1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`.
27
+ 1. supports audio in various languages, such as Chinese, English, and Japanese.
28
+ 1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
29
+ 1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results.
30
+ 1. checkpoint available trained on the HDTF dataset.
31
+ 1. training codes (comming soon).
32
+
33
+ # News
34
+ - [04/02/2024] Released MuseTalk project and pretrained models.
35
+
36
+ ## Model
37
+ ![Model Structure](assets/figs/musetalk_arc.jpg)
38
+ MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
39
+
40
+ ## Cases
41
+ ### MuseV + MuseTalk make human photos alive!
42
+ <table class="center">
43
+ <tr style="font-weight: bolder;text-align:center;">
44
+ <td width="33%">Image</td>
45
+ <td width="33%">MuseV</td>
46
+ <td width="33%">+MuseTalk</td>
47
+ </tr>
48
+ <tr>
49
+ <td>
50
+ <img src=assets/demo/musk/musk.png width="95%">
51
+ </td>
52
+ <td >
53
+ <video src=assets/demo/yongen/yongen_musev.mp4 controls preload></video>
54
+ </td>
55
+ <td >
56
+ <video src=assets/demo/yongen/yongen_musetalk.mp4 controls preload></video>
57
+ </td>
58
+ </tr>
59
+ <tr>
60
+ <td>
61
+ <img src=assets/demo/yongen/yongen.jpeg width="95%">
62
+ </td>
63
+ <td >
64
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/57ef9dee-a9fd-4dc8-839b-3fbbbf0ff3f4 controls preload></video>
65
+ </td>
66
+ <td >
67
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/94d8dcba-1bcd-4b54-9d1d-8b6fc53228f0 controls preload></video>
68
+ </td>
69
+ </tr>
70
+ <tr>
71
+ <td>
72
+ <img src=assets/demo/monalisa/monalisa.png width="95%">
73
+ </td>
74
+ <td >
75
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/1568f604-a34f-4526-a13a-7d282aa2e773 controls preload></video>
76
+ </td>
77
+ <td >
78
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/a40784fc-a885-4c1f-9b7e-8f87b7caf4e0 controls preload></video>
79
+ </td>
80
+ </tr>
81
+ <tr>
82
+ <td>
83
+ <img src=assets/demo/sun1/sun.png width="95%">
84
+ </td>
85
+ <td >
86
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 controls preload></video>
87
+ </td>
88
+ <td >
89
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/172f4ff1-d432-45bd-a5a7-a07dec33a26b controls preload></video>
90
+ </td>
91
+ </tr>
92
+ <tr>
93
+ <td>
94
+ <img src=assets/demo/sun2/sun.png width="95%">
95
+ </td>
96
+ <td >
97
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107 controls preload></video>
98
+ </td>
99
+ <td >
100
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/85a6873d-a028-4cce-af2b-6c59a1f2971d controls preload></video>
101
+ </td>
102
+ </tr>
103
+ </table >
104
+
105
+ * The character of the last two rows, `Xinying Sun`, is a supermodel KOL. You can follow her on [douyin](https://www.douyin.com/user/MS4wLjABAAAAWDThbMPN_6Xmm_JgXexbOii1K-httbu2APdG8DvDyM8).
106
+
107
+ ## Video dubbing
108
+ <table class="center">
109
+ <tr style="font-weight: bolder;text-align:center;">
110
+ <td width="70%">MuseTalk</td>
111
+ <td width="30%">Original videos</td>
112
+ </tr>
113
+ <tr>
114
+ <td>
115
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/4d7c5fa1-3550-4d52-8ed2-52f158150f24 controls preload></video>
116
+ </td>
117
+ <td>
118
+ <a href="//www.bilibili.com/video/BV1wT411b7HU">Link</a>
119
+ <href src=""></href>
120
+ </td>
121
+ </tr>
122
+ </table>
123
+
124
+ * For video dubbing, we applied a self-developed tool which can detect the talking person.
125
+
126
+
127
+ # TODO:
128
+ - [x] trained models and inference codes.
129
+ - [ ] technical report.
130
+ - [ ] training codes.
131
+ - [ ] online UI.
132
+ - [ ] a better model (may take longer).
133
+
134
+
135
+ # Getting Started
136
+ We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
137
+ ## Installation
138
+ To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
139
+ ### Build environment
140
+
141
+ We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:
142
+
143
+ ```shell
144
+ pip install -r requirements.txt
145
+ ```
146
+ ### whisper
147
+ install whisper to extract audio feature (only encoder)
148
+ ```
149
+ pip install --editable ./musetalk/whisper
150
+ ```
151
+
152
+ ### mmlab packages
153
+ ```bash
154
+ pip install --no-cache-dir -U openmim
155
+ mim install mmengine
156
+ mim install "mmcv>=2.0.1"
157
+ mim install "mmdet>=3.1.0"
158
+ mim install "mmpose>=1.1.0"
159
+ ```
160
+
161
+ ### Download ffmpeg-static
162
+ Download the ffmpeg-static and
163
+ ```
164
+ export FFMPEG_PATH=/path/to/ffmpeg
165
+ ```
166
+ for example:
167
+ ```
168
+ export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
169
+ ```
170
+ ### Download weights
171
+ You can download weights manually as follows:
172
+
173
+ 1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk).
174
+
175
+ 2. Download the weights of other components:
176
+ - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse)
177
+ - [whisper](https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt)
178
+ - [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
179
+ - [face-parse-bisent](https://github.com/zllrunning/face-parsing.PyTorch)
180
+ - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
181
+
182
+
183
+ Finally, these weights should be organized in `models` as follows:
184
+ ```
185
+ ./models/
186
+ ├── musetalk
187
+ │ └── musetalk.json
188
+ │ └── pytorch_model.bin
189
+ ├── dwpose
190
+ │ └── dw-ll_ucoco_384.pth
191
+ ├── face-parse-bisent
192
+ │ ├── 79999_iter.pth
193
+ │ └── resnet18-5c106cde.pth
194
+ ├── sd-vae-ft-mse
195
+ │ ├── config.json
196
+ │ └── diffusion_pytorch_model.bin
197
+ └── whisper
198
+ └── tiny.pt
199
+ ```
200
+ ## Quickstart
201
+
202
+ ### Inference
203
+ Here, we provide the inference script.
204
+ ```
205
+ python -m scripts.inference --inference_config configs/inference/test.yaml
206
+ ```
207
+ configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path.
208
+ The video_path should be either a video file or a directory of images.
209
+
210
+ #### Use of bbox_shift to have adjustable results
211
+ :mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.
212
+
213
+ You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range.
214
+
215
+ For example, in the case of `Xinying Sun`, after running the default configuration, it shows that the adjustable value rage is [-9, 9]. Then, to decrease the mouth openness, we set the value to be `-7`.
216
+ ```
217
+ python -m scripts.inference --inference_config configs/inference/test.yaml --bbox_shift -7
218
+ ```
219
+ :pushpin: More technical details can be found in [bbox_shift](assets/BBOX_SHIFT.md).
220
+
221
+ #### Combining MuseV and MuseTalk
222
+
223
+ As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
224
+
225
+ # Note
226
+
227
+ If you want to launch online video chats, you are suggested to generate videos using MuseV and apply necessary pre-processing such as face detection in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
228
+
229
+
230
+ # Acknowledgement
231
+ 1. We thank open-source components like [whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch).
232
+ 1. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers).
233
+ 1. MuseTalk has been built on `HDTF` datasets.
234
+
235
+ Thanks for open-sourcing!
236
+
237
+ # Limitations
238
+ - Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem.
239
+ If you need higher resolution, you could apply super resolution models such as [GFPGAN](https://github.com/TencentARC/GFPGAN) in combination with MuseTalk.
240
+
241
+ - Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color.
242
+
243
+ - Jitter: There exists some jitter as the current pipeline adopts single-frame generation.
244
+
245
+ # Citation
246
+ ```bib
247
+ @article{musetalk,
248
+ title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
249
+ author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and He, Yingjie and Zhan, Chao and Zhou, Wenjiang},
250
+ journal={arxiv},
251
+ year={2024}
252
+ }
253
+ ```
254
+ # Disclaimer/License
255
+ 1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.
256
+ 1. `model`: The trained model are available for any purpose, even commercially.
257
+ 1. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc..
258
+ 1. The testdata are collected from internet, which are available for non-commercial research purposes only.
259
+ 1. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.
dwpose/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
dwpose/dw-ll_ucoco.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9600664e7927229ed594197d552023e3be213f810beb38847a959ec8261e0f7
3
+ size 404734742
dwpose/dw-ll_ucoco_384.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:724f4ff2439ed61afb86fb8a1951ec39c6220682803b4a8bd4f598cd913b1843
3
+ size 134399116
dwpose/dw-ll_ucoco_384.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d9408b13cd863c4e95a149dd31232f88f2a12aa6cf8964ed74d7d97748c7a07
3
+ size 406878486
dwpose/dw-mm_ucoco.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b24f27f57d18d8bb7abc3af8e09bcc5f77ee9ecae13439f70a8f7d1b885413cf
3
+ size 216812378
dwpose/dw-ss_ucoco.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c13dfb1dc63aac2d794ac130bb89734330b3c74a1aff921a40fcde1d87361ffc
3
+ size 102933707
dwpose/dw-tt_ucoco.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7097f5af7f100609acffe58eb01734f02ffbfe22794fe029c2ea0a4d68d0f42d
3
+ size 68475107
dwpose/rtm-l_ucoco_256-95bb32f5_20230822.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95bb32f5c6ef235a01e5787b33040d5330f7d315afbbefb66832cabe83b6e49b
3
+ size 134223626
dwpose/rtm-x_ucoco_256-05f5bcb7_20230822.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05f5bcb76599e0e23389a9c21f3390b2aa1a56363b27d844556c3be4b138c536
3
+ size 226726579
dwpose/rtm-x_ucoco_384-f5b50679_20230822.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5b506794e8e4facfa6ae0bf2a19c7c43d67836a90b69a19beced4ddb54732b4
3
+ size 227246772
dwpose/yolox_l.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7860ae79de6c89a3c1eb72ae9a2756c0ccfbe04b7791bb5880afabd97855a411
3
+ size 216746733
face-parse-bisent/79999_iter.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:468e13ca13a9b43cc0881a9f99083a430e9c0a38abd935431d1c28ee94b26567
3
+ size 53289463
face-parse-bisent/resnet18-5c106cde.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c106cde386e87d4033832f2996f5493238eda96ccf559d1d62760c4de0613f8
3
+ size 46827520
musetalk/musetalk.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.6.0.dev0",
4
+ "act_fn": "silu",
5
+ "attention_head_dim": 8,
6
+ "block_out_channels": [
7
+ 320,
8
+ 640,
9
+ 1280,
10
+ 1280
11
+ ],
12
+ "center_input_sample": false,
13
+ "cross_attention_dim": 384,
14
+ "down_block_types": [
15
+ "CrossAttnDownBlock2D",
16
+ "CrossAttnDownBlock2D",
17
+ "CrossAttnDownBlock2D",
18
+ "DownBlock2D"
19
+ ],
20
+ "downsample_padding": 1,
21
+ "flip_sin_to_cos": true,
22
+ "freq_shift": 0,
23
+ "in_channels": 8,
24
+ "layers_per_block": 2,
25
+ "mid_block_scale_factor": 1,
26
+ "norm_eps": 1e-05,
27
+ "norm_num_groups": 32,
28
+ "out_channels": 4,
29
+ "sample_size": 64,
30
+ "up_block_types": [
31
+ "UpBlock2D",
32
+ "CrossAttnUpBlock2D",
33
+ "CrossAttnUpBlock2D",
34
+ "CrossAttnUpBlock2D"
35
+ ]
36
+ }
musetalk/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ee7d5ea03ea75d8dca50ea7a76df791e90633687a135c4a69393abfc0475ffe
3
+ size 3400076549
sd-vae-ft-mse/README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - stable-diffusion
5
+ - stable-diffusion-diffusers
6
+ inference: false
7
+ ---
8
+ # Improved Autoencoders
9
+
10
+ ## Utilizing
11
+ These weights are intended to be used with the [🧨 diffusers library](https://github.com/huggingface/diffusers). If you are looking for the model to use with the original [CompVis Stable Diffusion codebase](https://github.com/CompVis/stable-diffusion), [come here](https://huggingface.co/stabilityai/sd-vae-ft-mse-original).
12
+
13
+ #### How to use with 🧨 diffusers
14
+ You can integrate this fine-tuned VAE decoder to your existing `diffusers` workflows, by including a `vae` argument to the `StableDiffusionPipeline`
15
+ ```py
16
+ from diffusers.models import AutoencoderKL
17
+ from diffusers import StableDiffusionPipeline
18
+
19
+ model = "CompVis/stable-diffusion-v1-4"
20
+ vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
21
+ pipe = StableDiffusionPipeline.from_pretrained(model, vae=vae)
22
+ ```
23
+
24
+ ## Decoder Finetuning
25
+ We publish two kl-f8 autoencoder versions, finetuned from the original [kl-f8 autoencoder](https://github.com/CompVis/latent-diffusion#pretrained-autoencoding-models) on a 1:1 ratio of [LAION-Aesthetics](https://laion.ai/blog/laion-aesthetics/) and LAION-Humans, an unreleased subset containing only SFW images of humans. The intent was to fine-tune on the Stable Diffusion training set (the autoencoder was originally trained on OpenImages) but also enrich the dataset with images of humans to improve the reconstruction of faces.
26
+ The first, _ft-EMA_, was resumed from the original checkpoint, trained for 313198 steps and uses EMA weights. It uses the same loss configuration as the original checkpoint (L1 + LPIPS).
27
+ The second, _ft-MSE_, was resumed from _ft-EMA_ and uses EMA weights and was trained for another 280k steps using a different loss, with more emphasis
28
+ on MSE reconstruction (MSE + 0.1 * LPIPS). It produces somewhat ``smoother'' outputs. The batch size for both versions was 192 (16 A100s, batch size 12 per GPU).
29
+ To keep compatibility with existing models, only the decoder part was finetuned; the checkpoints can be used as a drop-in replacement for the existing autoencoder.
30
+
31
+ _Original kl-f8 VAE vs f8-ft-EMA vs f8-ft-MSE_
32
+
33
+ ## Evaluation
34
+ ### COCO 2017 (256x256, val, 5000 images)
35
+ | Model | train steps | rFID | PSNR | SSIM | PSIM | Link | Comments
36
+ |----------|---------|------|--------------|---------------|---------------|-----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
37
+ | | | | | | | | |
38
+ | original | 246803 | 4.99 | 23.4 +/- 3.8 | 0.69 +/- 0.14 | 1.01 +/- 0.28 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | as used in SD |
39
+ | ft-EMA | 560001 | 4.42 | 23.8 +/- 3.9 | 0.69 +/- 0.13 | 0.96 +/- 0.27 | https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt | slightly better overall, with EMA |
40
+ | ft-MSE | 840001 | 4.70 | 24.5 +/- 3.7 | 0.71 +/- 0.13 | 0.92 +/- 0.27 | https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt | resumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 * LPIPS), smoother outputs |
41
+
42
+
43
+ ### LAION-Aesthetics 5+ (256x256, subset, 10000 images)
44
+ | Model | train steps | rFID | PSNR | SSIM | PSIM | Link | Comments
45
+ |----------|-----------|------|--------------|---------------|---------------|-----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
46
+ | | | | | | | | |
47
+ | original | 246803 | 2.61 | 26.0 +/- 4.4 | 0.81 +/- 0.12 | 0.75 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | as used in SD |
48
+ | ft-EMA | 560001 | 1.77 | 26.7 +/- 4.8 | 0.82 +/- 0.12 | 0.67 +/- 0.34 | https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt | slightly better overall, with EMA |
49
+ | ft-MSE | 840001 | 1.88 | 27.3 +/- 4.7 | 0.83 +/- 0.11 | 0.65 +/- 0.34 | https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt | resumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 * LPIPS), smoother outputs |
50
+
51
+
52
+ ### Visual
53
+ _Visualization of reconstructions on 256x256 images from the COCO2017 validation dataset._
54
+
55
+ <p align="center">
56
+ <br>
57
+ <b>
58
+ 256x256: ft-EMA (left), ft-MSE (middle), original (right)</b>
59
+ </p>
60
+
61
+ <p align="center">
62
+ <img src=https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00025_merged.png />
63
+ </p>
64
+
65
+ <p align="center">
66
+ <img src=https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00011_merged.png />
67
+ </p>
68
+
69
+ <p align="center">
70
+ <img src=https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00037_merged.png />
71
+ </p>
72
+
73
+ <p align="center">
74
+ <img src=https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00043_merged.png />
75
+ </p>
76
+
77
+ <p align="center">
78
+ <img src=https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00053_merged.png />
79
+ </p>
80
+
81
+ <p align="center">
82
+ <img src=https://huggingface.co/stabilityai/stable-diffusion-decoder-finetune/resolve/main/eval/ae-decoder-tuning-reconstructions/merged/00029_merged.png />
83
+ </p>
sd-vae-ft-mse/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.4.2",
4
+ "act_fn": "silu",
5
+ "block_out_channels": [
6
+ 128,
7
+ 256,
8
+ 512,
9
+ 512
10
+ ],
11
+ "down_block_types": [
12
+ "DownEncoderBlock2D",
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D"
16
+ ],
17
+ "in_channels": 3,
18
+ "latent_channels": 4,
19
+ "layers_per_block": 2,
20
+ "norm_num_groups": 32,
21
+ "out_channels": 3,
22
+ "sample_size": 256,
23
+ "up_block_types": [
24
+ "UpDecoderBlock2D",
25
+ "UpDecoderBlock2D",
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D"
28
+ ]
29
+ }
sd-vae-ft-mse/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b4889b6b1d4ce7ae320a02dedaeff1780ad77d415ea0d744b476155c6377ddc
3
+ size 334707217
sd-vae-ft-mse/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1d993488569e928462932c8c38a0760b874d166399b14414135bd9c42df5815
3
+ size 334643276
whisper/tiny.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9
3
+ size 75572083