File size: 7,264 Bytes
44f0629
 
 
 
e697237
 
 
 
 
 
 
 
 
 
 
 
 
 
44f0629
e697237
 
 
 
 
 
 
 
 
 
 
 
 
 
f758fd4
 
 
e697237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb3fc17
 
 
 
e697237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ec2466
 
 
 
 
 
 
 
bf7e26d
8ec2466
 
 
 
 
 
4f90398
 
8ec2466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e697237
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: other
license_name: stable-cascade-nc-community
license_link: https://huggingface.co/stabilityai/stable-cascade/blob/main/LICENSE
language:
- en
tags:
- stable-cascade
- SDXL
- art
- artstyle
- fantasy
- anime
- aiart
- ketengan
- SomniumSC
pipeline_tag: text-to-image
library_name: diffusers
---

# SomniumSC-v1,1 Model Showcase
<p align="center">
  <img src="01.png" width=70% height=70%>
</p>

`Ketengan-Diffusion/SomniumSC-v1.1` is a fine tuned stage C Stable Cascade model [stabilityai/stable-cascade](https://huggingface.co/stabilityai/stable-cascade).

A fine-tuned model from all new stabilityAI model, Stable Cascade (Or we could say Würstchen v3) with a 2D (cartoonish) style is trained at Stage C 3.6B model. This model also trains the text encoder to generate a 2D style, so this model not only could generate using booru tag prompt, also you can use the natural language.

The model uses same amount and method of AnySomniumXL v2 used which has 33,000+ curated images from hundreds of thousands of images from various sources. The dataset is built by saving images that have an aesthetic score of at least 19 and a maximum of 50 (to maintain the cartoonish model and not too realistic. The scale is based on our proprietary aesthetic scoring mechanism), and do not have text and watermarks such as signatures or comic/manga images. Thus, images that have an aesthetic score of less than 17 and more than 50 will be discarded, as well as images that have watermarks or text will be discarded.

# Demo

Huggingface Space: [spaces/Ketengan-Diffusion/SomniumSC-v1.1-Demo](https://huggingface.co/spaces/Ketengan-Diffusion/SomniumSC-v1.1-Demo)
Our Official Demo (Temporary Backup): somniumscdemo.ketengan.com

# Training Process

SomniumSC v1.1 Technical Specifications:

Training per 1 Epoch 30 Epoch (Results from SomniumSC using Epoch 40)

Captioned by proprietary multimodal LLM, better than LLaVA

Trained with a bucket size of 1024x1024; 1536x1536 (Multi Resoutin)

Shuffle Caption: Yes

Clip Skip: 0

Trained with 1x NVIDIA A100 80GB


# Our Dataset Process Curation
<p align="center">
  <img src="Curation.png" width=70% height=70%>
</p>

Image source: [Source1](https://danbooru.donmai.us/posts/3143351) [Source2](https://danbooru.donmai.us/posts/3272710) [Source3](https://danbooru.donmai.us/posts/3320417)

Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract

<p align="center">
  <img src="Chart.png" width=70% height=70%>
</p>

This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 50-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted

The dataset curation proccess is using Nvidia T4 16GB Machine and takes about 7 days for curating 1.000.000 images.

# Captioning process
We using combination of proprietary Multimodal LLM and open source multimodal LLM such as LLaVa 1.5 as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.

# Tagging Process
We simply using booru tags, that retrieved from booru boards so this could be tagged by manually by human hence make this tags more accurate.

# Limitations:

✓ Still requires broader dataset training for more variation of poses and style

✓ Text cannot generated correctly, and seems ruined

✓ This optimized for human or mutated human generation. Non human like SCP, Ponies, and more maybe could resulting not what you expecting

✓ The faces maybe looks compressed. Generate the image at 1536px could be better

Smaller half size and stable cascade lite version will be released soon

# How to use SomniumSC:

Currently Stable Cascade only supported by ComfyUI.

Currently Stable Cascade only supported by ComfyUI.

You can use tutorial in [here](https://gist.github.com/comfyanonymous/0f09119a342d0dd825bb2d99d19b781c#file-stable_cascade_workflow_test-json) or [here](https://medium.com/@codeandbird/run-new-stable-cascade-model-in-comfyui-now-officially-supported-f66a37e9a8ad)

To simplify which model should you download, I will provide you the where's to download model directly

For stage A you can download from [Official stabilityai/stable-cascade repo](https://huggingface.co/stabilityai/stable-cascade).

For stage B you can download from [Official stabilityai/stable-cascade repo](https://huggingface.co/stabilityai/stable-cascade).

For stage C you can download the safetensors on huggingface repo that you find on files tab

And the text encoder you download from our huggingface repo on text_encoder folder

# Deplying SomniumSC v1.1 with Diffusers 🧨

⚠️ Warning: You must install this diffusers branch to make the code working to using Stable Cascade architecture

```
git+https://github.com/kashif/diffusers.git@a3dc21385b7386beb3dab3a9845962ede6765887
```

Deploying the simple SomniumSC-V1.1 inference

```import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
num_images_per_prompt = 1
print(f"Running on: {device}")

prior = StableCascadePriorPipeline.from_pretrained("Ketengan-Diffusion/SomniumSC-v1.1", torch_dtype=torch.bfloat16).to(device) # point to the fine tuned model that you desired (stage C)
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=torch.float16).to(device) # point to the "Mother" model which is from stabilityai (Stage A and B)

prompt = "An Astronout riding a horse"
negative_prompt = ""

prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=12.0,
    num_images_per_prompt=num_images_per_prompt,
    num_inference_steps=50
)
decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.half(),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=1.0,
    output_type="pil",
    num_inference_steps=10
).images

```

# SomniumSC Pro tips:

Negative prompt is a must to get better quality output. The recommended negative prompt is lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name

If the model producing pointy ears on the character, just add elf or pointy ears.

If the model producing "Compressed Face" use 1536px resolution, so the model can produce the face clearly.


# Disclaimer:

This model is under STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE. Which this model cannot be sold, and the derivative works cannot be commercialized. Except As far as I know, you can buy the membership of StabilityAI here To commercialize your derivative works based on this model. Please support StabilityAI, so they can always provide open source model for us. But still you can merge our model freely