patrickvonplaten commited on
Commit
be7e653
1 Parent(s): 364c933

add diffusion models

Browse files
README.md CHANGED
@@ -9,15 +9,18 @@ tags:
9
  - vision
10
  datasets:
11
  - Laion2B-en
 
 
 
12
  ---
13
 
14
- # Versatile Diffusion (v1.0, four-flow)
15
 
16
  We built **Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework**, as a step towards **Universal Generative AI**. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.
17
 
18
  Resources for more information: [GitHub](https://github.com/SHI-Labs/Versatile-Diffusion), [arXiv](https://arxiv.org/abs/2211.08332).
19
 
20
- # Model Description
21
 
22
  One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text-to-image) under one data type (e.g., image) and one context type (e.g., text). The multi-flow structure of Versatile Diffusion shows in the following diagram:
23
 
@@ -25,22 +28,93 @@ One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context
25
  <img src="https://huggingface.co/shi-labs/versatile-diffusion-model/resolve/main/assets/figures/VD_framework.png" width="99%">
26
  </p>
27
 
28
- # Cautions, Biases, and Content Acknowledgment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- We would like the raise the awareness of users of this demo of its potential issues and concerns. Like previous large foundation models, Versatile Diffusion could be problematic in some cases, partially due to the imperfect training data and pretrained network (VAEs / context encoders) with limited scope. In its future research phase, VD may do better on tasks such as text-to-image, image-to-text, etc., with the help of more powerful VAEs, more sophisticated network designs, and more cleaned data. So far, we have kept all features available for research testing both to show the great potential of the VD framework and to collect important feedback to improve the model in the future. We welcome researchers and users to report issues with the HuggingFace community discussion feature or email the authors.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- Beware that VD may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography, and violence. VD was trained on the LAION-2B dataset, which scraped non-curated online images and text, and may contain unintended exceptions as we removed illegal content. VD in this demo is meant only for research purposes.
 
 
33
 
34
- # Citation
 
35
 
 
 
36
  ```
37
- @article{xu2022versatile,
38
- title = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model},
39
- author = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi},
40
- year = 2022,
41
- url = {https://arxiv.org/abs/2211.08332},
42
- eprint = {2211.08332},
43
- archiveprefix = {arXiv},
44
- primaryclass = {cs.CV}
45
- }
46
- ```
 
 
9
  - vision
10
  datasets:
11
  - Laion2B-en
12
+ widget:
13
+ - text: "A high tech solarpunk utopia in the Amazon rainforest"
14
+ example_title: Amazon rainforest
15
  ---
16
 
17
+ # Versatile Diffusion V1.0 Model Card
18
 
19
  We built **Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework**, as a step towards **Universal Generative AI**. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.
20
 
21
  Resources for more information: [GitHub](https://github.com/SHI-Labs/Versatile-Diffusion), [arXiv](https://arxiv.org/abs/2211.08332).
22
 
23
+ # Model Details
24
 
25
  One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text-to-image) under one data type (e.g., image) and one context type (e.g., text). The multi-flow structure of Versatile Diffusion shows in the following diagram:
26
 
 
28
  <img src="https://huggingface.co/shi-labs/versatile-diffusion-model/resolve/main/assets/figures/VD_framework.png" width="99%">
29
  </p>
30
 
31
+ - **Developed by:** Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi
32
+ - **Model type:** Diffusion-based multimodal generation model
33
+ - **Language(s):** English
34
+ - **License:** MIT
35
+ - **Resources for more information:** [GitHub Repository](https://github.com/SHI-Labs/Versatile-Diffusion), [Paper](https://arxiv.org/abs/2211.08332).
36
+ - **Cite as:**
37
+ ```
38
+ @article{xu2022versatile,
39
+ title = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model},
40
+ author = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi},
41
+ year = 2022,
42
+ url = {https://arxiv.org/abs/2211.08332},
43
+ eprint = {2211.08332},
44
+ archiveprefix = {arXiv},
45
+ primaryclass = {cs.CV}
46
+ }
47
+ ```
48
 
49
+ You can use the model both with the [🧨Diffusers library](https://github.com/huggingface/diffusers) and the [SHI-Labs Versatile Diffusion codebase](https://github.com/SHI-Labs/Versatile-Diffusion).
50
+
51
+ ### Diffusers
52
+ #### Text to Image
53
+ ```py
54
+ from diffusers import VersatileDiffusionTextToImagePipeline
55
+ import torch
56
+
57
+ pipe = VersatileDiffusionTextToImagePipeline.from_pretrained("diffusers/vd-official-test", torch_dtype=torch.float16)
58
+ pipe.remove_unused_weights()
59
+ pipe = pipe.to("cuda")
60
+
61
+ generator = torch.Generator(device="cuda").manual_seed(0)
62
+ image = pipe("an astronaut riding on a horse on mars", generator=generator).images[0]
63
+ image.save("./astronaut.png")
64
+ ```
65
+ #### Image variations
66
+ ```py
67
+ from diffusers import VersatileDiffusionImageVariationPipeline
68
+ import torch
69
+ import requests
70
+ from io import BytesIO
71
+ from PIL import Image
72
+
73
+ # download an initial image
74
+ url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
75
+ response = requests.get(url)
76
+ image = Image.open(BytesIO(response.content)).convert("RGB")
77
+
78
+ pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("diffusers/vd-official-test", torch_dtype=torch.float16)
79
+ pipe = pipe.to("cuda")
80
+
81
+ generator = torch.Generator(device="cuda").manual_seed(0)
82
+ image = pipe(image, generator=generator).images[0]
83
+ image.save("./car_variation.png")
84
+ ```
85
+ #### Dual-guided generation
86
+ ```py
87
+ from diffusers import VersatileDiffusionImageVariationPipeline
88
+ import torch
89
+ import requests
90
+ from io import BytesIO
91
+ from PIL import Image
92
+
93
+ # download an initial image
94
+ url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
95
+
96
+ response = requests.get(url)
97
+ image = Image.open(BytesIO(response.content)).convert("RGB")
98
+ text = "a red car in the sun"
99
 
100
+ pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("diffusers/vd-official-test", torch_dtype=torch.float16)
101
+ pipe.remove_unused_weights()
102
+ pipe = pipe.to("cuda")
103
 
104
+ generator = torch.Generator(device="cuda").manual_seed(0)
105
+ text_to_image_strength = 0.75
106
 
107
+ image = pipe(prompt=text, image=image, text_to_image_strength=text_to_image_strength, generator=generator).images[0]
108
+ image.save("./red_car.png")
109
  ```
110
+
111
+ ### Original GitHub Repository
112
+
113
+ Follow the instructions [here](https://github.com/SHI-Labs/Versatile-Diffusion/#evaluation).
114
+
115
+
116
+ # Cautions, Biases, and Content Acknowledgment
117
+
118
+ We would like the raise the awareness of users of this demo of its potential issues and concerns. Like previous large foundation models, Versatile Diffusion could be problematic in some cases, partially due to the imperfect training data and pretrained network (VAEs / context encoders) with limited scope. In its future research phase, VD may do better on tasks such as text-to-image, image-to-text, etc., with the help of more powerful VAEs, more sophisticated network designs, and more cleaned data. So far, we have kept all features available for research testing both to show the great potential of the VD framework and to collect important feedback to improve the model in the future. We welcome researchers and users to report issues with the HuggingFace community discussion feature or email the authors.
119
+
120
+ Beware that VD may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography, and violence. VD was trained on the LAION-2B dataset, which scraped non-curated online images and text, and may contain unintended exceptions as we removed illegal content. VD in this demo is meant only for research purposes.
image_encoder/config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openai/clip-vit-large-patch14",
3
+ "architectures": [
4
+ "CLIPVisionModelWithProjection"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "dropout": 0.0,
8
+ "hidden_act": "quick_gelu",
9
+ "hidden_size": 1024,
10
+ "image_size": 224,
11
+ "initializer_factor": 1.0,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-05,
15
+ "model_type": "clip_vision_model",
16
+ "num_attention_heads": 16,
17
+ "num_channels": 3,
18
+ "num_hidden_layers": 24,
19
+ "patch_size": 14,
20
+ "projection_dim": 768,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.25.0.dev0"
23
+ }
image_encoder/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89d2aa29b5fdf64f3ad4f45fb4227ea98bc45156bbae673b85be1af7783dbabb
3
+ size 1215993967
image_feature_extractor/preprocessor_config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": {
3
+ "height": 224,
4
+ "width": 224
5
+ },
6
+ "do_center_crop": true,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "feature_extractor_type": "CLIPFeatureExtractor",
12
+ "image_mean": [
13
+ 0.48145466,
14
+ 0.4578275,
15
+ 0.40821073
16
+ ],
17
+ "image_processor_type": "CLIPImageProcessor",
18
+ "image_std": [
19
+ 0.26862954,
20
+ 0.26130258,
21
+ 0.27577711
22
+ ],
23
+ "resample": 3,
24
+ "rescale_factor": 0.00392156862745098,
25
+ "size": {
26
+ "shortest_edge": 224
27
+ }
28
+ }
image_unet/config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.8.0.dev0",
4
+ "act_fn": "silu",
5
+ "attention_head_dim": 8,
6
+ "block_out_channels": [
7
+ 320,
8
+ 640,
9
+ 1280,
10
+ 1280
11
+ ],
12
+ "center_input_sample": false,
13
+ "cross_attention_dim": 768,
14
+ "down_block_types": [
15
+ "CrossAttnDownBlock2D",
16
+ "CrossAttnDownBlock2D",
17
+ "CrossAttnDownBlock2D",
18
+ "DownBlock2D"
19
+ ],
20
+ "downsample_padding": 1,
21
+ "flip_sin_to_cos": true,
22
+ "freq_shift": 0,
23
+ "in_channels": 4,
24
+ "layers_per_block": 2,
25
+ "mid_block_scale_factor": 1,
26
+ "norm_eps": 1e-05,
27
+ "norm_num_groups": 32,
28
+ "out_channels": 4,
29
+ "sample_size": null,
30
+ "up_block_types": [
31
+ "UpBlock2D",
32
+ "CrossAttnUpBlock2D",
33
+ "CrossAttnUpBlock2D",
34
+ "CrossAttnUpBlock2D"
35
+ ]
36
+ }
image_unet/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3899338e2c2a00a02e6ad0e33933da4fed163bf7a16187522c1019c82519cff2
3
+ size 3438354725
model_index.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "VersatileDiffusionPipeline",
3
+ "_diffusers_version": "0.8.0.dev0",
4
+ "image_encoder": [
5
+ "transformers",
6
+ "CLIPVisionModelWithProjection"
7
+ ],
8
+ "image_feature_extractor": [
9
+ "transformers",
10
+ "CLIPImageProcessor"
11
+ ],
12
+ "image_unet": [
13
+ "diffusers",
14
+ "UNet2DConditionModel"
15
+ ],
16
+ "scheduler": [
17
+ "diffusers",
18
+ "DDIMScheduler"
19
+ ],
20
+ "text_encoder": [
21
+ "transformers",
22
+ "CLIPTextModelWithProjection"
23
+ ],
24
+ "text_unet": [
25
+ "versatile_diffusion",
26
+ "UNetFlatConditionModel"
27
+ ],
28
+ "tokenizer": [
29
+ "transformers",
30
+ "CLIPTokenizer"
31
+ ],
32
+ "vae": [
33
+ "diffusers",
34
+ "AutoencoderKL"
35
+ ]
36
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDIMScheduler",
3
+ "_diffusers_version": "0.6.0",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "num_train_timesteps": 1000,
8
+ "set_alpha_to_one": false,
9
+ "skip_prk_steps": true,
10
+ "steps_offset": 1,
11
+ "trained_betas": null,
12
+ "clip_sample": false
13
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openai/clip-vit-large-patch14",
3
+ "architectures": [
4
+ "CLIPTextModelWithProjection"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "quick_gelu",
11
+ "hidden_size": 768,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 768,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.25.0.dev0",
24
+ "vocab_size": 49408
25
+ }
text_encoder/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85f5bcf101dde33d8ab9f7e5e1678339fa4258ea07bc65e6ca66e01f9de99622
3
+ size 494664885
text_unet/config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNetFlatConditionModel",
3
+ "_diffusers_version": "0.8.0.dev0",
4
+ "act_fn": "silu",
5
+ "attention_head_dim": 8,
6
+ "block_out_channels": [
7
+ 320,
8
+ 640,
9
+ 1280,
10
+ 1280
11
+ ],
12
+ "center_input_sample": false,
13
+ "cross_attention_dim": 768,
14
+ "down_block_types": [
15
+ "CrossAttnDownBlockFlat",
16
+ "CrossAttnDownBlockFlat",
17
+ "CrossAttnDownBlockFlat",
18
+ "DownBlockFlat"
19
+ ],
20
+ "downsample_padding": 1,
21
+ "flip_sin_to_cos": true,
22
+ "freq_shift": 0,
23
+ "in_channels": [
24
+ 768,
25
+ 1,
26
+ 1
27
+ ],
28
+ "layers_per_block": 2,
29
+ "mid_block_scale_factor": 1,
30
+ "norm_eps": 1e-05,
31
+ "norm_num_groups": 32,
32
+ "out_channels": [
33
+ 768,
34
+ 1,
35
+ 1
36
+ ],
37
+ "sample_size": null,
38
+ "up_block_types": [
39
+ "UpBlockFlat",
40
+ "CrossAttnUpBlockFlat",
41
+ "CrossAttnUpBlockFlat",
42
+ "CrossAttnUpBlockFlat"
43
+ ]
44
+ }
text_unet/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dab7a69b6d6f52cd90717966d93ca30d13004d0eaf2994e1f0fe526473ac827c
3
+ size 6835669073
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "name_or_path": "openai/clip-vit-large-patch14",
23
+ "pad_token": "<|endoftext|>",
24
+ "special_tokens_map_file": "./special_tokens_map.json",
25
+ "tokenizer_class": "CLIPTokenizer",
26
+ "unk_token": {
27
+ "__type": "AddedToken",
28
+ "content": "<|endoftext|>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
vae/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.8.0.dev0",
4
+ "act_fn": "silu",
5
+ "block_out_channels": [
6
+ 128,
7
+ 256,
8
+ 512,
9
+ 512
10
+ ],
11
+ "down_block_types": [
12
+ "DownEncoderBlock2D",
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D"
16
+ ],
17
+ "in_channels": 3,
18
+ "latent_channels": 4,
19
+ "layers_per_block": 2,
20
+ "norm_num_groups": 32,
21
+ "out_channels": 3,
22
+ "sample_size": 256,
23
+ "up_block_types": [
24
+ "UpDecoderBlock2D",
25
+ "UpDecoderBlock2D",
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D"
28
+ ]
29
+ }
vae/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b134cded8eb78b184aefb8805b6b572f36fa77b255c483665dda931fa0130c5
3
+ size 334707217